Introducing the GPTScript Knowledge Tool

The GPTScript Knowledge Tool

In this post, we're excited to share the first release of the GPTScript Knowledge tool with you. This tool brings the power of RAG (Retrieval Augmented Generation) to your GPTScript powered applications, whether those are personal automations running on your workstation or full-fledged web apps serving your end users. To see the team discuss and demo this tool live (at the time of recording), checkout this previous livestream.

What makes the GPTScript Knowledge tool unique is that it's simple, fast, and turnkey. It's a single CLI that can ingest, store, and retrieve data from many sources. It ships with it's own SQLite database, which stores data as flat files, which means there are no external database dependencies for you to manage - just point the tool at your documents and get started.

In the rest of this article, we'll first explain how the knowledge tool works, then give you a practical example, and end with an explanation of more advanced use case and configuration. If you want to follow along, start by downloading a copy of the CLI from the GitHub releases page.

How it works

A classic RAG pipeline is defined by 3 key elements: ingestion, storage and retrieval. As mentioned, this tool manages all three of those in a single CLI. We'll start by explaining ingestion and storage.

Ingestion & Storage

The knowledge tool allows you to start ingesting whole directories of files and asking questions about them. You can even organize them into logical units or knowledge bases called datasets. As of the time of writing this article (v0.1.5), the knowledge tool supports the following file types:

  • .pdf
  • .html
  • .md
  • .txt
  • .docx
  • .odt
  • .rtf
  • .csv
  • .ipynb
  • .json

We don't necessarily act based upon the file extension - if none is present, we try to detect the mimetype to find the best available parser for the input file. For example your Go code (.go file extension) is parsed as plain text. You can easily ingest your files one-by-one or (recursively) per folder using a single command:

# a single file knowledge ingest myfile.pdf # a directory knowledge ingest myfiles # nested directories recursively knowledge ingest -r mynestedfilesdir

For each input file, this command will:

  1. Detect the file type and, if it's supported, convert the file contents to plain text and some useful metadata (title, page numbers, etc. as available)
  2. Split the plain text data into smaller chunks using tiktoken - optimized for the target LLM (default is gpt-4)
  3. Send the chunks to an embedding model (default is OpenAI's text-embedding-ada-002) to retrieve the vectorized representation of the text
  4. Store the vector data into an embedded vector database
  5. Create a new record for that file in an embedded relational database, associating it with the target dataset (default dataset if not specified otherwise) and all chunks that we generated before (also called documents)

💡 Note: We're using embedded databases, so the data is persisted as a bunch of files on your disk - you can copy them around and thus use your knowledge bases on other devices if you want.

Retrieval

Now that you have your data in a high dimensional vector space, you want to get answers to your questions out of there. To do so, run the following command: knowledge retrieve <prompt | query>

# this is an example where the whole Go codebase of the knowledge tool was ingested knowledge retrieve "How do I ingest data into a dataset using the CLI?"

This does the following:

  1. Embed your query/prompt to make it ready for the search process, meaning that it's also sent to the embedding model
  2. Run a similarity search in the Vector database (the current implementation uses the dot product to determine similarity between vectors, which is the same as their cosine similarity for normalized vectors)
  3. Return the Top-K (default number of results is 10) results matching your input in JSON format (LLMs are quite good at parsing JSON). Those results contain:
    1. The actual plain text
    2. Metadata, e.g. name of the source file, page number, etc. (useful for the LLM to give source references)
    3. A similarity score indicating how well this result document matched your query

💡 Note: This is currently a pretty naive RAG implementation in that we don't do any pre-processing of the input data (we do not generate document summaries to improve the search process or anything like that) and we also don't do any post-processing on the result set (e.g. no re-ranking). Those may improve the accuracy, but at the cost of more LLM calls. We will expose more features in the future.

Example

Now, let's go through a complete example: We're going to ingest all the code of the knowledge tool pkg/ directory that you can find on GitHub. The files in there are all .go files (treated as plain text) except for the also included OpenAPI specification which comes as JSON and YAML in pkg/docs and some testdata (PDFs).

  1. First, let's create a new dataset, so we don't mix it with any other ingested data

    $ knowledge create-dataset knowledge-code 2024/05/07 15:30:27 INFO Created dataset id=knowledge-code Created dataset "knowledge-code"
  2. Now, ingest the pkg/ directory into the dataset (recursively going through all directories), but ignore the PDFs, as we won't need them here

    # assuming that we're inside the 'knowledge' repository folder locally $ knowledge ingest -d knowledge-code -r --ignore-extensions ".pdf" pkg/ 2024/05/07 16:40:05 INFO Ingested document filename=reset.go count=1 absolute_path=/home/thklein/git/github.com/gptscript-ai/knowledge/pkg/cmd/reset.go # ... truncated ... 2024/05/07 16:40:21 INFO Ingested document filename=swagger.json count=5 absolute_path=/home/thklein/git/github.com/gptscript-ai/knowledge/pkg/docs/swagger.json 2024/05/07 16:40:21 INFO Ingested document filename=routes.go count=5 absolute_path=/home/thklein/git/github.com/gptscript-ai/knowledge/pkg/server/routes.go 2024/05/07 16:40:23 INFO Ingested document filename=docs.go count=5 absolute_path=/home/thklein/git/github.com/gptscript-ai/knowledge/pkg/docs/docs.go Ingested 38 files from "pkg/" into dataset "knowledge-code"

    Here we go, we ingested 38 files into our new dataset.

  3. Next, let's ask the knowledge base, how we can ingest data using the CLI

    $ knowledge retrieve -d knowledge-code "How can I ingest data into a dataset using the CLI?" Retrieved the following 5 sources for the query "How can I ingest data into a dataset using the CLI?" from dataset "knowledge-code": [{"content":" responses: \"200\": description: OK schema: $ref: '#/definitions/types.IngestResponse' summary: Ingest content into a dataset tags: - datasets /datasets/{id}/retrieve: post: consumes: - application/json description: Retrieve content from a dataset by ID parameters: - description: Dataset ID in: path name: id required: true type: string produces: - application/json responses: \"200\": description: OK schema: items: $ref: '#/definitions/vectorstore.Document' type: array summary: Retrieve content from a dataset tags: - datasets /datasets/create: post: consumes: - application/json description: Create a new dataset parameters: - description: Dataset object in: body name: dataset required: true schema: $ref: '#/definitions/types.Dataset' produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/types.Dataset' summary: Create a new dataset tags: - datasetsswagger: \"2.0\"","metadata":{"filename":"swagger.yaml"},"similarity_score":0.78012925},{"content":"package cmdimport (\t\"fmt\"\t\"github.com/gptscript-ai/knowledge/pkg/client\"\t\"github.com/spf13/cobra\"\t\"strings\")type ClientIngest struct {\tClient\tDataset string `usage:\"Target Dataset ID\" short:\"d\" default:\"default\" env:\"KNOW_TARGET_DATASET\"`\tIgnoreExtensions string `usage:\"Comma-separated list of file extensions to ignore\" env:\"KNOW_INGEST_IGNORE_EXTENSIONS\"`\tConcurrency int `usage:\"Number of concurrent ingestion processes\" short:\"c\" default:\"10\" env:\"KNOW_INGEST_CONCURRENCY\"`\tRecursive bool `usage:\"Recursively ingest directories\" short:\"r\" default:\"false\" env:\"KNOW_INGEST_RECURSIVE\"`}func (s *ClientIngest) Customize(cmd *cobra.Command) {\tcmd.Use = \"ingest [--dataset \u003cdataset-id\u003e] \u003cpath\u003e\"\tcmd.Short = \"Ingest a file/directory into a dataset (non-recursive)\"\tcmd.Args = cobra.ExactArgs(1)}func (s *ClientIngest) Run(cmd *cobra.Command, args []string) error {\tc, err := s.getClient()\tif err != nil {\t\treturn err\t}\tdatasetID := s.Dataset\tfilePath := args[0]\tingestOpts := \u0026client.IngestPathsOpts{\t\tIgnoreExtensions: strings.Split(s.IgnoreExtensions, \",\"),\t\tConcurrency: s.Concurrency,\t\tRecursive: s.Recursive,\t}\tfilesIngested, err := c.IngestPaths(cmd.Context(), datasetID, ingestOpts, filePath)\tif err != nil {\t\treturn err\t}\tfmt.Printf(\"Ingested %d files from %q into dataset %q\\", filesIngested, filePath, datasetID)\treturn nil}","metadata":{"filename":"ingest.go"},"similarity_score":0.77108264},{"content":"\", \"consumes\": [ \"application/json\" ], \"produces\": [ \"application/json\" ], \"tags\": [ \"datasets\" ], \"summary\": \"Retrieve content from a dataset\", \"parameters\": [ { \"type\": \"string\", \"description\": \"Dataset ID\", \"name\": \"id\", \"in\": \"path\", \"required\": true } ], \"responses\": { \"200\": { \"description\": \"OK\", \"schema\": { \"type\": \"array\", \"items\": { \"$ref\": \"#/definitions/vectorstore.Document\" } } } } } } }, \"definitions\": { \"gin.H\": { \"type\": \"object\", \"additionalProperties\": {} }, \"types.Dataset\": { \"type\": \"object\", \"required\": [ \"id\" ], \"properties\": { \"embed_dim\": { \"type\": \"integer\", \"default\": 1536, \"example\": 1536 }, \"id\": { \"description\": \"Dataset ID - must be a valid RFC 1123 hostname\", \"type\": \"string\", \"format\": \"hostname_rfc1123\", \"example\": \"asst-12345\" } } }, \"types.IngestResponse\": { \"type\": \"object\", \"properties\": { \"documents\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } } } }, \"vectorstore.Document\": { \"type\": \"object\", \"properties\": { \"content\": { \"type\": \"string\" }, \"metadata\": { \"type\": \"object\", \"additionalProperties\": {} }, \"similarity_score\": { \"type\": \"number\" } } } }}","metadata":{"filename":"swagger.json"},"similarity_score":0.75255966},{"content":" required: true type: string produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/types.Dataset' summary: Get a dataset tags: - datasets /datasets/{id}/documents/{doc_id}: delete: consumes: - application/json description: Remove a document from a dataset by ID parameters: - description: Dataset ID in: path name: id required: true type: string - description: Document ID in: path name: doc_id required: true type: string produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/gin.H' summary: Remove a document from a dataset tags: - datasets /datasets/{id}/files/{file_id}: delete: consumes: - application/json description: Remove a file from a dataset by ID parameters: - description: Dataset ID in: path name: id required: true type: string - description: File ID in: path name: file_id required: true type: string produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/gin.H' summary: Remove a file from a dataset tags: - datasets /datasets/{id}/ingest: post: consumes: - application/json description: Ingest content into a dataset by ID parameters: - description: Dataset ID in: path name: id required: true type: string produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/types.IngestResponse' summary: Ingest content into a dataset tags: - datasets /datasets/{id}/retrieve: post: consumes: - application/json description: Retrieve content from a dataset by ID parameters: - description: Dataset ID in: path name: id required: true ","metadata":{"filename":"swagger.yaml"},"similarity_score":0.7509696},{"content":"package clientimport (\t\"context\"\t\"fmt\"\t\"github.com/acorn-io/z\"\t\"github.com/gptscript-ai/knowledge/pkg/datastore\"\t\"github.com/gptscript-ai/knowledge/pkg/index\"\t\"github.com/gptscript-ai/knowledge/pkg/types\"\t\"github.com/gptscript-ai/knowledge/pkg/vectorstore\"\t\"os\"\t\"path/filepath\")type StandaloneClient struct {\t*datastore.Datastore}func NewStandaloneClient(ds*datastore.Datastore) (*StandaloneClient, error) {\treturn \u0026StandaloneClient{\t\tDatastore: ds,\t}, nil}func (c*StandaloneClient) CreateDataset(ctx context.Context, datasetID string) (types.Dataset, error) {\tds := types.Dataset{\t\tID: datasetID,\t\tEmbedDimension: nil,\t}\terr := c.Datastore.NewDataset(ctx, ds)\tif err != nil {\t\treturn ds, err\t}\treturn ds, nil}func (c *StandaloneClient) DeleteDataset(ctx context.Context, datasetID string) error {\treturn c.Datastore.DeleteDataset(ctx, datasetID)}func (c*StandaloneClient) GetDataset(ctx context.Context, datasetID string) (*index.Dataset, error) {\treturn c.Datastore.GetDataset(ctx, datasetID)}func (c*StandaloneClient) ListDatasets(ctx context.Context) ([]types.Dataset, error) {\treturn c.Datastore.ListDatasets(ctx)}func (c *StandaloneClient) Ingest(ctx context.Context, datasetID string, data []byte, opts datastore.IngestOpts) ([]string, error) {\treturn c.Datastore.Ingest(ctx, datasetID, data, opts)}func (c*StandaloneClient) IngestPaths(ctx context.Context, datasetID string, opts *IngestPathsOpts, paths ...string) (int, error) {\tingestFile := func(path string) error {\t\t// Gather metadata\t\tfinfo, err := os.Stat(path)\t\tif err != nil {\t\t\treturn fmt.Errorf(\"failed to stat file %s: %w\", path, err)\t\t}\t\tabspath, err := filepath.Abs(path)\t\tif err != nil {\t\t\treturn fmt.Errorf(\"failed to get absolute path for %s: %w\", path, err)\t\t}\t\tfile, err := os.ReadFile(path)\t\tif err != nil {\t\t\treturn fmt.Errorf(\"failed","metadata":{"filename":"standalone.go"},"similarity_score":0.7508112}]

    As we can see, the top 5 sources found by the similarity search have been returned in a JSON formatted list including metadata and the respective similarity scores.

  4. Finally, let's wrap this in GPTScript to see what the LLM replies based on the returned sources

4.1 Create a basic GPTScript - we're bundling the question right within it - we're saving it as demo.gpt

tools: retrieve As per the dataset "knowledge-code", how can I ingest all files in the "pkg/" directory into a dataset using the CLI? --- name: retrieve description: Retrieve sources for an input query from datasets using the knowledge tool args: dataset: The ID of the dataset to retrieve information from args: prompt: Query string for the similarity search #!knowledge retrieve -d ${dataset} ${prompt}

4.2 Now run it

$ gptscript demo.gpt 16:42:13 started [main] 16:42:13 sent [main] 16:42:13 started [retrieve(2)] [input={"dataset":"knowledge-code","prompt":"ingest files in directory CLI command"}] 16:42:13 sent [retrieve(2)] 16:42:14 ended [retrieve(2)] [output=Retrieved the following 5 sources for the query "ingest files in directory CLI command" from dataset "knowledge-code": [{"content":"package cmd"" # ... truncated (same as above) ... "metadata":{"filename":"swagger.json"},"similarity_score":0.72029245}] ] 16:42:14 continue [main] 16:42:14 sent [main] content [1] content | Waiting for model response... content [1] content | To ingest all files in the "pkg/" directory and subfolders into a dataset using the CLI, you can use the following command: content [1] content | content [1] content | \`\`\` content [1] content | ingest [--dataset <dataset-id>] <path> content [1] content | \`\`\` content [1] content | content [1] content | This command allows you to ingest a file or directory into a dataset. If you want to recursively ingest directories, ensure the recursive flag is set to true in your configuration or command line options. 16:42:20 ended [main] [output=To ingest all files in the "pkg/" directory and subfolders into a dataset using the CLI, you can use the following command: \`\`\` ingest [--dataset <dataset-id>] <path> \`\`\` This command allows you to ingest a file or directory into a dataset. If you want to recursively ingest directories, ensure the recursive flag is set to true in your configuration or command line options.] OUTPUT: To ingest all files in the "pkg/" directory and subfolders into a dataset using the CLI, you can use the following command: \`\`\` ingest [--dataset <dataset-id>] <path> \`\`\` This command allows you to ingest a file or directory into a dataset. If you want to recursively ingest directories, ensure the recursive flag is set to true in your configuration or command line options.

    The effective output of this is the following, which is quite a bit more helpful than the JSON blob before:

To ingest all files in the "pkg/" directory and subfolders into a dataset using the CLI, you can use the following command:
ingest [—dataset <dataset-id>] <path>
This command allows you to ingest a file or directory into a dataset. If you want to recursively ingest directories, ensure the recursive flag is set to true in your configuration or command line options.

Advanced - A GPTScript for the entire knowledge pipeline

You can of course also include all the other steps we did above in a GPTScript. We have an example ready in the GitHub repository.

tools: create_dataset, sys.find, ingest, retrieve, delete_dataset, get_dataset, list_datasets, uuidgen Create a new Knowledge Base Dataset with a random unique ID, if it doesn't exist yet. Then, ingest the directory pkg/ into the dataset. Then, retrieve from the knowledge base how I can ingest the current working directory into a dataset from the CLI. --- name: create_dataset description: Create a new Dataset in the Knowledge Base args: id: ID of the Dataset # !knowledge create-dataset ${id} --- name: ingest description: Ingest a file or all files from a directory into a Knowledge Base Dataset args: id: ID of the Dataset args: filepath: Path to the file or directory to be ingested # !knowledge ingest -d ${id} -r ${filepath} --- name: retrieve description: Retrieve information from a Knowledge Base Dataset args: id: ID of the Dataset args: query: Query to be executed against the Knowledge Base Dataset # !knowledge retrieve -k 10 -d ${id} ${query} --- name: delete_dataset description: Delete a Dataset from the Knowledge Base args: id: ID of the Dataset # !knowledge delete-dataset ${id} --- name: get_dataset description: Get a Dataset from the Knowledge Base args: id: ID of the Dataset # !knowledge get-dataset ${id} --- name: list_datasets description: List all Datasets in the Knowledge Base # !knowledge list-datasets --- name: uuidgen description: Generate a random UUID # !uuidgen

Knowledge API - Client / Server Setup

You can run the Knowledge tool as a REST API using the knowledge server command. Visit localhost:8000/v1/docs for the OpenAPI v2 / Swagger documentation. You may as well use a running server with the same CLI commands as above by just setting the environment variable KNOW_SERVER_URL=http://localhost:8000/v1.

See what's inside your datasets

Use knowledge get-dataset to see what files you have in a dataset or use knowledge list-datasets to get an overview of your created datasets.

Options

Here are some of the important top-level configuration options (set via environment variable or CLI flag):

  • OPENAI_BASE_URL to configure the OpenAI API endpoint (default: https://api.openai.com/v1)
  • OPENAI_API_KEY to configure your OpenAI API Key (default: sk-foo which is invalid)
  • OPENAI_EMBEDDING_MODEL to configure which embedding model should be used (default: text-embedding-ada-002)
  • KNOW_SERVER_URL to configure a remote Knowledge API Server if you're not running in standalone client mode

Wrapping Up

As mentioned, you can try out the GPTScript knowledge tool by downloading the CLI here.

The knowledge tool is under active development, so expect changes in the future, including:

  • Exposure of new configuration options, e.g. chunk size and overlap
  • Improved file parsing, e.g. supporting more encodings for PDF
  • New options for pre- and post-processing to improve results
  • Interoperability with other models behind non-OpenAI-style APIs

So stay tuned here at our blog or to our YouTube channel for more exciting developments.