August 19th 2023

Posted on Aug 18 • Originally published at neuml.hashnode.dev Semantic Search is a new category of search built on recent advances in Natural Language Processing (NLP). Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.While semantic search adds amazing capabilities, sparse keyword indexes can still add value. There may be cases where finding an exact match is important or we just want a fast index to quickly do an initial scan of a dataset.Both methods have their merits. What if we combine them together to build a unified hybrid search capability? Can we get the best of both worlds?This article will explore the benefits of hybrid search.Install txtai and all dependencies.Before diving into the benchmarks, let's briefly discuss how semantic and keyword search works.Semantic search uses large language models to vectorize inputs into arrays of numbers. Similar concepts will have similar values. The vectors are typically stored in a vector database, which is a system that specializes in storing these numerical arrays and finding matches. Vector search transforms an input query into a vector and then runs a search to find the best conceptual results.Keyword search tokenizes text into lists of tokens per document. These tokens are aggregated into token frequencies per document and stored in term frequency sparse arrays. At search time, the query is tokenized and the tokens of the query are compared to the tokens in the dataset. This is more a literal process. Keyword search is like string matching, it has no conceptual understanding, it matches on characters and bytes.Hybrid search combines the scores from semantic and keyword indexes. Given that semantic search scores are typically 0 - 1 and keyword search scores are unbounded, a method is needed to combine the results.The two methods supported in txtai are:The default method in txtai is convex combination and we'll use that.Now it's time to benchmark the results. For these tests, we'll use the BEIR dataset. We'll also use a benchmarks script from the txtai project. This benchmarks script has methods to work with the BEIR dataset.We'll select a subset of the BEIR sources for brevity. For each source, we'll benchmark a bm25 index, an embeddings index and a hybrid or combined index.Now let's run the benchmarks.The sections above show the metrics per source and method.The table headers list the source (dataset), index method, NDCG@10/MAP@10/RECALL@10/P@10 accuracy metrics, index time(s), search time(s) and memory usage(MB). The tables are sorted by NDCG@10 descending.Looking at the results, we can see that hybrid search often performs better than embeddings or bm25 individually. In some cases, as with scidocs, the combination performs worse. But in the aggregate, the scores are better. This holds true for the entire BEIR dataset. For some sources, bm25 does best, some embeddings but overall the combined hybrid scores do the best.Hybrid search isn't free though, it is slower as it has extra logic to combine the results. For individual queries, the results are often negligible.This article covered ways to improve search accuracy using a hybrid approach. We evaluated performance over a subset of the BEIR dataset to show how hybrid search, in many situations, can improve overall accuracy.Custom datasets can also be evaluated using this method as specified in this link. This article and the associated benchmarks script can be reused to evaluate what method works best on your data.Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well Confirm For further actions, you may consider blocking this person and/or reporting abuse Stefan Alfbo - Jul 31 Baivab Sarkar - Jul 31 Akshay Ballal - Jul 30 Baransel - Jul 19 Once suspended, neuml will not be able to comment or publish posts until their suspension is removed. Once unsuspended, neuml will be able to comment and publish posts again. Once unpublished, all posts by neuml will become hidden and only accessible to themselves. If neuml is not suspended, they can still re-publish their posts from their dashboard. Note: Once unpublished, this post will become invisible to the public and only accessible to David Mezzetti. They can still re-publish the post if they are not suspended. Thanks for keeping DEV Community safe. Here is what you can do to flag neuml: neuml consistently posts content that violates DEV Community's code of conduct because it is harassing, offensive or spammy. Unflagging neuml will restore default visibility to their posts. DEV Community — A constructive and inclusive social network for software developers. With you every step of your journey. Built on Forem — the open source software that powers DEV and other inclusive communities.Made with love and Ruby on Rails. DEV Community © 2016 - 2023. We're a place where coders share, stay up-to-date and grow their careers.

A New Era of Gaming: Nvidia's Revolut…
Ø§Ø³ØªÙƒØ´Ù Ø£ÙØ¶Ù„ ÙƒØªØ¨ Ø§Ù†Ø¬Ù„…
A List of the Best College Graduation…
Key Tips for Renting an Electric Vehi…
Another Appliance Company Now Files A…

This post first appeared on VedVyas Articles, please read the originial post: here