Cassandra
Only available on Node.js.
Apache Cassandraยฎ is a NoSQL, row-oriented, highly scalable and highly available database.
The latest version of Apache Cassandra natively supports Vector Similarity Search.
Setupโ
- Create an Astra DB account.
- Create a vector enabled database.
- Download your secure connect bundle and application token on your database's "Connect" tab.
- Set up the following env vars:
export OPENAI_API_KEY=YOUR_OPENAI_API_KEY_HERE
export CASSANDRA_SCB=YOUR_CASSANDRA_SCB_HERE
export CASSANDRA_TOKEN=YOUR_CASSANDRA_TOKEN_HERE
- Install the Cassandra Node.js driver.
- npm
- Yarn
- pnpm
npm install cassandra-driver
yarn add cassandra-driver
pnpm add cassandra-driver
Indexing docsโ
import { CassandraStore } from "langchain/vectorstores/cassandra";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
const config = {
cloud: {
secureConnectBundle: process.env.CASSANDRA_SCB as string,
},
credentials: {
username: "token",
password: process.env.CASSANDRA_TOKEN as string,
},
keyspace: "test",
dimensions: 1536,
table: "test",
indices: [{ name: "name", value: "(name)" }],
primaryKey: {
name: "id",
type: "int",
},
metadataColumns: [
{
name: "name",
type: "text",
},
],
};
const vectorStore = await CassandraStore.fromTexts(
["I am blue", "Green yellow purple", "Hello there hello"],
[
{ id: 2, name: "2" },
{ id: 1, name: "1" },
{ id: 3, name: "3" },
],
new OpenAIEmbeddings(),
cassandraConfig
);
Querying docsโ
const results = await vectorStore.similaritySearch("Green yellow purple", 1);
or filtered query:
const results = await vectorStore.similaritySearch("B", 1, { name: "Bubba" });
Vector Typesโ
Cassandra supports cosine
(the default), dot_product
, and euclidean
similarity search; this is defined when the
vector store is first created, and specifed in the constructor parameter vectorType
, for example:
...,
vectorType: "dot_product",
...
Non-Equality Filtersโ
By default, filters are applied with an equality =
. For those fields that have an indices
entry, you may
provide an operator
with a string of a value supported by the index; in this case, you specify one or
more filters, as either a singleton or in a list (which will be AND
-ed together). For example:
{ name: "create_datetime", operator: ">", value: some_datetime_variable }
or
[
{ userid: userid_variable },
{ name: "create_datetime", operator: ">", value: some_date_variable },
];
Data Partitioning and Composite Keysโ
In some systems, you may wish to partition the data for various reasons, perhaps by user or by session. Data in Cassandra
is always partitioned; by default this library will partition by the first primary key field. You may specify multiple
columns which comprise the primary (unique) key of a record, and optionally indicate those fields which should be
part of the partition key. For example, the vector store could be partitioned by both userid
and collectionid
, with
additional fields docid
and docpart
making an individual entry unique:
...,
primaryKey: [
{name: "userid", type: "text", partition: true},
{name: "collectionid", type: "text", partition: true},
{name: "docid", type: "text"},
{name: "docpart", type: "int"},
],
...
When searching, you may include partition keys on the filter without defining indices
for these columns; you do
not need to specify all partition keys, but must specify those in the key first. In the above example, you could
specify a filter of {userid: userid_variable}
and {userid: userid_variable, collectionid: collectionid_variable}
,
but if you wanted to specify a filter of only {collectionid: collectionid_variable}
you would have to include
collectionid
on the indices
list.
Additional Configuration Optionsโ
In the configuration document, further optional parameters are provided; their defaults are:
...,
indices: [],
maxConcurrency: 25,
batchSize: 1,
withClause: "",
...
Parameter | Usage |
---|---|
indices | Optional, but required if using filtered queries on non-partition columns. Each metadata column (e.g. <metadata_filter_column> ) to be indexed should appear as an entry in a list, in the format [{name: "<metadata_filter_column>", value: "(<metadata_filter_column>)"}] . |
maxConcurrency | How many concurrent requests will be sent to Cassandra at a given time. |
batchSize | How many documents will be sent on a single request to Cassandra. When using a value > 1, you should ensure your batch size will not exceed the Cassandra parameter batch_size_fail_threshold_in_kb . Batches are unlogged. |
withClause | Cassandra tables may be created with an optional WITH clause; this is generally not needed but provided for completeness. |