Skip to main content

Extraction

Most APIs and databases still deal with structured information. Therefore, in order to better work with those, it can be useful to extract structured information from text. Examples of this include:

  • Extracting a structured row to insert into a database from a sentence
  • Extracting multiple rows to insert into a database from a long document
  • Extracting the correct API parameters from a user query

This work is extremely related to output parsing. Output parsers are responsible for instructing the LLM to respond in a specific format. In this case, the output parsers specify the format of the data you would like to extract from the document. Then, in addition to the output format instructions, the prompt should also contain the data you would like to extract information from.

While normal output parsers are good enough for basic structuring of response data, when doing extraction you often want to extract more complicated or nested structures.

With tool/function calling

Tool/function calling is a powerful way to perform extraction. At a high level, function calling encourages the model to respond in a structured format. By specifying one or more JSON schemas that you want the LLM to use, you can guide the LLM to "fill in the blanks" and populate proper values for the keys to the JSON.

Here's a concrete example using OpenAI's tool calling features. Note that this requires either the gpt-3.5-turbo-1106 or gpt-4-1106-preview models.

We'll use Zod, a popular open source package, to format schema in OpenAI's tool format:

$ npm install zod zod-to-json-schema
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";
import { ChatPromptTemplate } from "langchain/prompts";
import { ChatOpenAI } from "langchain/chat_models/openai";
import { JsonOutputToolsParser } from "langchain/output_parsers";

const EXTRACTION_TEMPLATE = `Extract and save the relevant entities mentioned \
in the following passage together with their properties.

If a property is not present and is not required in the function parameters, do not include it in the output.`;

const prompt = ChatPromptTemplate.fromMessages([
["system", EXTRACTION_TEMPLATE],
["human", "{input}"],
]);

const person = z.object({
name: z.string().describe("The person's name"),
age: z.string().describe("The person's age"),
});

const model = new ChatOpenAI({
modelName: "gpt-3.5-turbo-1106",
temperature: 0,
}).bind({
tools: [
{
type: "function",
function: {
name: "person",
description: "A person",
parameters: zodToJsonSchema(person),
},
},
],
});

const parser = new JsonOutputToolsParser();
const chain = prompt.pipe(model).pipe(parser);

const res = await chain.invoke({
input: "jane is 2 and bob is 3",
});

console.log(res);
/*
[
{ name: 'person', arguments: { name: 'jane', age: '2' } },
{ name: 'person', arguments: { name: 'bob', age: '3' } }
]
*/

API Reference: