Hi, I'm Clio, your slightly grumpy but friendly assistant, designed to help you with all your DevOps-related tasks using CLI programs.
OK, the title may be a little exaggerated, but still, our little Knowledge tool gained the ability to use various Embedding Models or rather Embedding Model Providers in v0.3.0. GPTScript is our natural language approach to programming: prompt in English (or your native language), to write tools which are then strung together via AI to execute (view initial blog here for a refresh).
What that means is that you can now define an endless list of embedding model providers in a newly created config file and for every knowledge base (dataset) you create, you can choose any of them.
Important Note here: you can only use one embedding model per dataset, you cannot ingest one file with model A and another one with model B, as that would screw up the vector space due to different embedding types and vector dimensionalities.
As of v0.3.0 we tested the knowledge tool with the following providers and models (with no judgement of how well each of them works in terms of accuracy and performance):
text-embedding-ada-002
text-embedding-ada-002
CompendiumLabs/bge-large-en-v1.5-gguf
1
VS_CHROMEM_EMBEDDING_PARALLEL_THREAD=1
embed-english-v3.0
embed-english-v3.0
bert-cpp-minilm-v6
mistral-embed
all-MiniLM-L6-v2
mxbai-embed-large
I tested the locally running models, especially via LM-Studio and Ollama on my development laptop with an i7-1260P and 64GB RAM and with Ollama the processing time of ingesting a 509 pages PDF file was about 5 minutes (this was without enabling parallelism on Ollama and I bet there are some settings I can tweak, as my laptop was using pretty few resources).
With the command-line flag
--embedding-model-provider
KNOW_EMBEDDING_MODEL_PROVIDER
openai
Now let’s say you also have access to Google Vertex AI and have all the environment variables configured (at least
VERTEX_API_KEY
VERTEX_PROJECT_ID
export VERTEX_API_KEY="my-super-secret-key" export VERTEX_PROJECT_ID="my-google-project" knowledge ingest -d my-vertex-powered-dataset --embedding-model-provider=vertex path/to/some/files # or alternatively export KNOW_EMBEDDING_MODEL_PROVIDER="vertex" knowledge ingest -d my-vertex-powered-dataset path/to/more/files
Obviously, you can define the environment variables only once per provider. Now you could use dotenv files to configure multiple settings, e.g. two different variations for Vertex, using different projects or models - or different Ollama servers, whatever.
This can be quite cumbersome… but don’t worry, here comes the YAML (you’ll love it - but you may use JSON as well) config file where you can define as many providers as you want and can give them different names.
Here you can see an example config that defines 3 different provider configs of which two are using the same provider type, which wouldn’t be possible with just environment variables:
embeddings: providers: - name: my-vertex # custom name type: vertex # one of the provider types as shown in the list further up config: apiKey: ${SOME_VERTEX_API_KEY} # environment variables will get expanded model: text-embedding-004 - name: ollama-1 type: ollama config: model: mxbai-embed-large - name: ollama-2 type: ollama config: model: nomic-embed-text
With this config file we can now reference any provider by our custom name:
knowledge ingest -c /path/to/config.yaml --embedding-model-provider="ollama-1" -d my-ollama-1-dataset path/to/files
You can find more up-to-date information on this new config file and embedding model providers that we integrated in the knowledge documentation:
With this new setup, it will soon be possible to share knowledge bases / datasets with the embedding model provider information attached to it, finally getting rid of the hurdle that ingesting into an imported dataset may need some time figuring out exactly which model was used. With the information about the originally used provider and model attached to the dataset, the knowledge tool just has to get your API Token (get it from env or ask you for it) and you’re setup, without any further configuration.
This will make sharing datasets and collaboratively building knowledge bases even easier!