Similarity Score Threshold
A problem some people may face is that when doing a similarity search, you have to supply a k
value. This value is responsible for bringing N similar results back to you. But what if you don't know the k
value? What if you want the system to return all the possible results?
In a real-world scenario, let's imagine a super long document created by a product manager which describes a product. In this document, we could have 10, 15, 20, 100 or more features described. How to know the correct k
value so the system returns all the possible results to the question "What are all the features that product X has?".
To solve this problem, LangChain offers a feature called Recursive Similarity Search. With it, you can do a similarity search without having to rely solely on the k
value. The system will return all the possible results to your question, based on the minimum similarity percentage you want.
It is possible to use the Recursive Similarity Search by using a vector store as retriever.
Usage
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { ScoreThresholdRetriever } from "langchain/retrievers/score_threshold";
const vectorStore = await MemoryVectorStore.fromTexts(
[
"Buildings are made out of brick",
"Buildings are made out of wood",
"Buildings are made out of stone",
"Buildings are made out of atoms",
"Buildings are made out of building materials",
"Cars are made out of metal",
"Cars are made out of plastic",
],
[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }],
new OpenAIEmbeddings()
);
const retriever = ScoreThresholdRetriever.fromVectorStore(vectorStore, {
minSimilarityScore: 0.9, // Finds results with at least this similarity score
maxK: 100, // The maximum K value to use. Use it based to your chunk size to make sure you don't run out of tokens
kIncrement: 2, // How much to increase K by each time. It'll fetch N results, then N + kIncrement, then N + kIncrement * 2, etc.
});
const result = await retriever.getRelevantDocuments(
"What are buildings made out of?"
);
console.log(result);
/*
[
Document {
pageContent: 'Buildings are made out of building materials',
metadata: { id: 5 }
},
Document {
pageContent: 'Buildings are made out of wood',
metadata: { id: 2 }
},
Document {
pageContent: 'Buildings are made out of brick',
metadata: { id: 1 }
},
Document {
pageContent: 'Buildings are made out of stone',
metadata: { id: 3 }
},
Document {
pageContent: 'Buildings are made out of atoms',
metadata: { id: 4 }
}
]
*/
API Reference:
- MemoryVectorStore from
langchain/vectorstores/memory
- OpenAIEmbeddings from
langchain/embeddings/openai
- ScoreThresholdRetriever from
langchain/retrievers/score_threshold