Hi, I'm Clio, your slightly grumpy but friendly assistant, designed to help you with all your DevOps-related tasks using CLI programs.
In this post, we're excited to share the first release of the GPTScript Knowledge tool with you. This tool brings the power of RAG (Retrieval Augmented Generation) to your GPTScript powered applications, whether those are personal automations running on your workstation or full-fledged web apps serving your end users. To see the team discuss and demo this tool live (at the time of recording), checkout this previous livestream.
What makes the GPTScript Knowledge tool unique is that it's simple, fast, and turnkey. It's a single CLI that can ingest, store, and retrieve data from many sources. It ships with it's own SQLite database, which stores data as flat files, which means there are no external database dependencies for you to manage - just point the tool at your documents and get started.
In the rest of this article, we'll first explain how the knowledge tool works, then give you a practical example, and end with an explanation of more advanced use case and configuration. If you want to follow along, start by downloading a copy of the CLI from the GitHub releases page.
A classic RAG pipeline is defined by 3 key elements: ingestion, storage and retrieval. As mentioned, this tool manages all three of those in a single CLI. We'll start by explaining ingestion and storage.
The knowledge tool allows you to start ingesting whole directories of files and asking questions about them. You can even organize them into logical units or knowledge bases called datasets. As of the time of writing this article (v0.1.5), the knowledge tool supports the following file types:
We don't necessarily act based upon the file extension - if none is present, we try to detect the mimetype to find the best available parser for the input file. For example your Go code (.go file extension) is parsed as plain text. You can easily ingest your files one-by-one or (recursively) per folder using a single command:
# a single file knowledge ingest myfile.pdf # a directory knowledge ingest myfiles # nested directories recursively knowledge ingest -r mynestedfilesdir
For each input file, this command will:
💡 Note: We're using embedded databases, so the data is persisted as a bunch of files on your disk - you can copy them around and thus use your knowledge bases on other devices if you want.
Now that you have your data in a high dimensional vector space, you want to get answers to your questions out of there. To do so, run the following command: knowledge retrieve <prompt | query>
# this is an example where the whole Go codebase of the knowledge tool was ingested knowledge retrieve "How do I ingest data into a dataset using the CLI?"
This does the following:
💡 Note: This is currently a pretty naive RAG implementation in that we don't do any pre-processing of the input data (we do not generate document summaries to improve the search process or anything like that) and we also don't do any post-processing on the result set (e.g. no re-ranking). Those may improve the accuracy, but at the cost of more LLM calls. We will expose more features in the future.
Now, let's go through a complete example: We're going to ingest all the code of the knowledge tool pkg/ directory that you can find on GitHub. The files in there are all .go files (treated as plain text) except for the also included OpenAPI specification which comes as JSON and YAML in pkg/docs and some testdata (PDFs).
First, let's create a new dataset, so we don't mix it with any other ingested data
$ knowledge create-dataset knowledge-code 2024/05/07 15:30:27 INFO Created dataset id=knowledge-code Created dataset "knowledge-code"
Now, ingest the pkg/ directory into the dataset (recursively going through all directories), but ignore the PDFs, as we won't need them here
# assuming that we're inside the 'knowledge' repository folder locally $ knowledge ingest -d knowledge-code -r --ignore-extensions ".pdf" pkg/ 2024/05/07 16:40:05 INFO Ingested document filename=reset.go count=1 absolute_path=/home/thklein/git/github.com/gptscript-ai/knowledge/pkg/cmd/reset.go # ... truncated ... 2024/05/07 16:40:21 INFO Ingested document filename=swagger.json count=5 absolute_path=/home/thklein/git/github.com/gptscript-ai/knowledge/pkg/docs/swagger.json 2024/05/07 16:40:21 INFO Ingested document filename=routes.go count=5 absolute_path=/home/thklein/git/github.com/gptscript-ai/knowledge/pkg/server/routes.go 2024/05/07 16:40:23 INFO Ingested document filename=docs.go count=5 absolute_path=/home/thklein/git/github.com/gptscript-ai/knowledge/pkg/docs/docs.go Ingested 38 files from "pkg/" into dataset "knowledge-code"
Here we go, we ingested 38 files into our new dataset.
Next, let's ask the knowledge base, how we can ingest data using the CLI
$ knowledge retrieve -d knowledge-code "How can I ingest data into a dataset using the CLI?" Retrieved the following 5 sources for the query "How can I ingest data into a dataset using the CLI?" from dataset "knowledge-code": [{"content":" responses: \"200\": description: OK schema: $ref: '#/definitions/types.IngestResponse' summary: Ingest content into a dataset tags: - datasets /datasets/{id}/retrieve: post: consumes: - application/json description: Retrieve content from a dataset by ID parameters: - description: Dataset ID in: path name: id required: true type: string produces: - application/json responses: \"200\": description: OK schema: items: $ref: '#/definitions/vectorstore.Document' type: array summary: Retrieve content from a dataset tags: - datasets /datasets/create: post: consumes: - application/json description: Create a new dataset parameters: - description: Dataset object in: body name: dataset required: true schema: $ref: '#/definitions/types.Dataset' produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/types.Dataset' summary: Create a new dataset tags: - datasetsswagger: \"2.0\"","metadata":{"filename":"swagger.yaml"},"similarity_score":0.78012925},{"content":"package cmdimport (\t\"fmt\"\t\"github.com/gptscript-ai/knowledge/pkg/client\"\t\"github.com/spf13/cobra\"\t\"strings\")type ClientIngest struct {\tClient\tDataset string `usage:\"Target Dataset ID\" short:\"d\" default:\"default\" env:\"KNOW_TARGET_DATASET\"`\tIgnoreExtensions string `usage:\"Comma-separated list of file extensions to ignore\" env:\"KNOW_INGEST_IGNORE_EXTENSIONS\"`\tConcurrency int `usage:\"Number of concurrent ingestion processes\" short:\"c\" default:\"10\" env:\"KNOW_INGEST_CONCURRENCY\"`\tRecursive bool `usage:\"Recursively ingest directories\" short:\"r\" default:\"false\" env:\"KNOW_INGEST_RECURSIVE\"`}func (s *ClientIngest) Customize(cmd *cobra.Command) {\tcmd.Use = \"ingest [--dataset \u003cdataset-id\u003e] \u003cpath\u003e\"\tcmd.Short = \"Ingest a file/directory into a dataset (non-recursive)\"\tcmd.Args = cobra.ExactArgs(1)}func (s *ClientIngest) Run(cmd *cobra.Command, args []string) error {\tc, err := s.getClient()\tif err != nil {\t\treturn err\t}\tdatasetID := s.Dataset\tfilePath := args[0]\tingestOpts := \u0026client.IngestPathsOpts{\t\tIgnoreExtensions: strings.Split(s.IgnoreExtensions, \",\"),\t\tConcurrency: s.Concurrency,\t\tRecursive: s.Recursive,\t}\tfilesIngested, err := c.IngestPaths(cmd.Context(), datasetID, ingestOpts, filePath)\tif err != nil {\t\treturn err\t}\tfmt.Printf(\"Ingested %d files from %q into dataset %q\\", filesIngested, filePath, datasetID)\treturn nil}","metadata":{"filename":"ingest.go"},"similarity_score":0.77108264},{"content":"\", \"consumes\": [ \"application/json\" ], \"produces\": [ \"application/json\" ], \"tags\": [ \"datasets\" ], \"summary\": \"Retrieve content from a dataset\", \"parameters\": [ { \"type\": \"string\", \"description\": \"Dataset ID\", \"name\": \"id\", \"in\": \"path\", \"required\": true } ], \"responses\": { \"200\": { \"description\": \"OK\", \"schema\": { \"type\": \"array\", \"items\": { \"$ref\": \"#/definitions/vectorstore.Document\" } } } } } } }, \"definitions\": { \"gin.H\": { \"type\": \"object\", \"additionalProperties\": {} }, \"types.Dataset\": { \"type\": \"object\", \"required\": [ \"id\" ], \"properties\": { \"embed_dim\": { \"type\": \"integer\", \"default\": 1536, \"example\": 1536 }, \"id\": { \"description\": \"Dataset ID - must be a valid RFC 1123 hostname\", \"type\": \"string\", \"format\": \"hostname_rfc1123\", \"example\": \"asst-12345\" } } }, \"types.IngestResponse\": { \"type\": \"object\", \"properties\": { \"documents\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } } } }, \"vectorstore.Document\": { \"type\": \"object\", \"properties\": { \"content\": { \"type\": \"string\" }, \"metadata\": { \"type\": \"object\", \"additionalProperties\": {} }, \"similarity_score\": { \"type\": \"number\" } } } }}","metadata":{"filename":"swagger.json"},"similarity_score":0.75255966},{"content":" required: true type: string produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/types.Dataset' summary: Get a dataset tags: - datasets /datasets/{id}/documents/{doc_id}: delete: consumes: - application/json description: Remove a document from a dataset by ID parameters: - description: Dataset ID in: path name: id required: true type: string - description: Document ID in: path name: doc_id required: true type: string produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/gin.H' summary: Remove a document from a dataset tags: - datasets /datasets/{id}/files/{file_id}: delete: consumes: - application/json description: Remove a file from a dataset by ID parameters: - description: Dataset ID in: path name: id required: true type: string - description: File ID in: path name: file_id required: true type: string produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/gin.H' summary: Remove a file from a dataset tags: - datasets /datasets/{id}/ingest: post: consumes: - application/json description: Ingest content into a dataset by ID parameters: - description: Dataset ID in: path name: id required: true type: string produces: - application/json responses: \"200\": description: OK schema: $ref: '#/definitions/types.IngestResponse' summary: Ingest content into a dataset tags: - datasets /datasets/{id}/retrieve: post: consumes: - application/json description: Retrieve content from a dataset by ID parameters: - description: Dataset ID in: path name: id required: true ","metadata":{"filename":"swagger.yaml"},"similarity_score":0.7509696},{"content":"package clientimport (\t\"context\"\t\"fmt\"\t\"github.com/acorn-io/z\"\t\"github.com/gptscript-ai/knowledge/pkg/datastore\"\t\"github.com/gptscript-ai/knowledge/pkg/index\"\t\"github.com/gptscript-ai/knowledge/pkg/types\"\t\"github.com/gptscript-ai/knowledge/pkg/vectorstore\"\t\"os\"\t\"path/filepath\")type StandaloneClient struct {\t*datastore.Datastore}func NewStandaloneClient(ds*datastore.Datastore) (*StandaloneClient, error) {\treturn \u0026StandaloneClient{\t\tDatastore: ds,\t}, nil}func (c*StandaloneClient) CreateDataset(ctx context.Context, datasetID string) (types.Dataset, error) {\tds := types.Dataset{\t\tID: datasetID,\t\tEmbedDimension: nil,\t}\terr := c.Datastore.NewDataset(ctx, ds)\tif err != nil {\t\treturn ds, err\t}\treturn ds, nil}func (c *StandaloneClient) DeleteDataset(ctx context.Context, datasetID string) error {\treturn c.Datastore.DeleteDataset(ctx, datasetID)}func (c*StandaloneClient) GetDataset(ctx context.Context, datasetID string) (*index.Dataset, error) {\treturn c.Datastore.GetDataset(ctx, datasetID)}func (c*StandaloneClient) ListDatasets(ctx context.Context) ([]types.Dataset, error) {\treturn c.Datastore.ListDatasets(ctx)}func (c *StandaloneClient) Ingest(ctx context.Context, datasetID string, data []byte, opts datastore.IngestOpts) ([]string, error) {\treturn c.Datastore.Ingest(ctx, datasetID, data, opts)}func (c*StandaloneClient) IngestPaths(ctx context.Context, datasetID string, opts *IngestPathsOpts, paths ...string) (int, error) {\tingestFile := func(path string) error {\t\t// Gather metadata\t\tfinfo, err := os.Stat(path)\t\tif err != nil {\t\t\treturn fmt.Errorf(\"failed to stat file %s: %w\", path, err)\t\t}\t\tabspath, err := filepath.Abs(path)\t\tif err != nil {\t\t\treturn fmt.Errorf(\"failed to get absolute path for %s: %w\", path, err)\t\t}\t\tfile, err := os.ReadFile(path)\t\tif err != nil {\t\t\treturn fmt.Errorf(\"failed","metadata":{"filename":"standalone.go"},"similarity_score":0.7508112}]
As we can see, the top 5 sources found by the similarity search have been returned in a JSON formatted list including metadata and the respective similarity scores.
Finally, let's wrap this in GPTScript to see what the LLM replies based on the returned sources
4.1 Create a basic GPTScript - we're bundling the question right within it - we're saving it as demo.gpt
tools: retrieve As per the dataset "knowledge-code", how can I ingest all files in the "pkg/" directory into a dataset using the CLI? --- name: retrieve description: Retrieve sources for an input query from datasets using the knowledge tool args: dataset: The ID of the dataset to retrieve information from args: prompt: Query string for the similarity search #!knowledge retrieve -d ${dataset} ${prompt}
4.2 Now run it
$ gptscript demo.gpt 16:42:13 started [main] 16:42:13 sent [main] 16:42:13 started [retrieve(2)] [input={"dataset":"knowledge-code","prompt":"ingest files in directory CLI command"}] 16:42:13 sent [retrieve(2)] 16:42:14 ended [retrieve(2)] [output=Retrieved the following 5 sources for the query "ingest files in directory CLI command" from dataset "knowledge-code": [{"content":"package cmd"" # ... truncated (same as above) ... "metadata":{"filename":"swagger.json"},"similarity_score":0.72029245}] ] 16:42:14 continue [main] 16:42:14 sent [main] content [1] content | Waiting for model response... content [1] content | To ingest all files in the "pkg/" directory and subfolders into a dataset using the CLI, you can use the following command: content [1] content | content [1] content | \`\`\` content [1] content | ingest [--dataset <dataset-id>] <path> content [1] content | \`\`\` content [1] content | content [1] content | This command allows you to ingest a file or directory into a dataset. If you want to recursively ingest directories, ensure the recursive flag is set to true in your configuration or command line options. 16:42:20 ended [main] [output=To ingest all files in the "pkg/" directory and subfolders into a dataset using the CLI, you can use the following command: \`\`\` ingest [--dataset <dataset-id>] <path> \`\`\` This command allows you to ingest a file or directory into a dataset. If you want to recursively ingest directories, ensure the recursive flag is set to true in your configuration or command line options.] OUTPUT: To ingest all files in the "pkg/" directory and subfolders into a dataset using the CLI, you can use the following command: \`\`\` ingest [--dataset <dataset-id>] <path> \`\`\` This command allows you to ingest a file or directory into a dataset. If you want to recursively ingest directories, ensure the recursive flag is set to true in your configuration or command line options.
    The effective output of this is the following, which is quite a bit more helpful than the JSON blob before:
To ingest all files in the "pkg/" directory and subfolders into a dataset using the CLI, you can use the following command:
ingest [—dataset <dataset-id>] <path>
This command allows you to ingest a file or directory into a dataset. If you want to recursively ingest directories, ensure the recursive flag is set to true in your configuration or command line options.
You can of course also include all the other steps we did above in a GPTScript. We have an example ready in the GitHub repository.
tools: create_dataset, sys.find, ingest, retrieve, delete_dataset, get_dataset, list_datasets, uuidgen Create a new Knowledge Base Dataset with a random unique ID, if it doesn't exist yet. Then, ingest the directory pkg/ into the dataset. Then, retrieve from the knowledge base how I can ingest the current working directory into a dataset from the CLI. --- name: create_dataset description: Create a new Dataset in the Knowledge Base args: id: ID of the Dataset # !knowledge create-dataset ${id} --- name: ingest description: Ingest a file or all files from a directory into a Knowledge Base Dataset args: id: ID of the Dataset args: filepath: Path to the file or directory to be ingested # !knowledge ingest -d ${id} -r ${filepath} --- name: retrieve description: Retrieve information from a Knowledge Base Dataset args: id: ID of the Dataset args: query: Query to be executed against the Knowledge Base Dataset # !knowledge retrieve -k 10 -d ${id} ${query} --- name: delete_dataset description: Delete a Dataset from the Knowledge Base args: id: ID of the Dataset # !knowledge delete-dataset ${id} --- name: get_dataset description: Get a Dataset from the Knowledge Base args: id: ID of the Dataset # !knowledge get-dataset ${id} --- name: list_datasets description: List all Datasets in the Knowledge Base # !knowledge list-datasets --- name: uuidgen description: Generate a random UUID # !uuidgen
You can run the Knowledge tool as a REST API using the knowledge server command. Visit localhost:8000/v1/docs for the OpenAPI v2 / Swagger documentation. You may as well use a running server with the same CLI commands as above by just setting the environment variable KNOW_SERVER_URL=http://localhost:8000/v1.
Use knowledge get-dataset
Here are some of the important top-level configuration options (set via environment variable or CLI flag):
As mentioned, you can try out the GPTScript knowledge tool by downloading the CLI here.
The knowledge tool is under active development, so expect changes in the future, including:
So stay tuned here at our blog or to our YouTube channel for more exciting developments.