We recommend viewing in desktop mode for the best experience
Introducing function-calling-test-suite
Function calling is the fundamental feature that powers our flagship project, GPTScript. This makes an LLM’s ability to call functions the primary consideration when determining its suitability as a drop-in replacement for OpenAI’s gpt4-o (the current default model used by GPTScript). To quantify this metric, we decided to sink some time into building out function-calling-test-suite (FCTS), a shiny new test framework!
We’ll drop another blog post that delves into the specifics of the design shortly, but for now, here’s a breakdown of the framework’s key features:
- Simple YAML specs to describe test cases
- Optionally use gpt4-o to judge test results
- Configurable test run count (i.e. run each test N times to detect non-deterministic model responses)
Now that introductions are out of the way, here’s what we’ve found with FCTS so far:
Rankings
We tested six major models with function calling support across a total of four major platforms. In order to account for the non-deterministic nature of generative models, we ran every test case 10 times per model, then ranked the models by overall pass rate.
Rank | Pass Rate | Model | Platform |
---|---|---|---|
1 | 98.24% | gpt-4o-2024-05-13 | OpenAI |
2 | 94.71% | gpt-4-turbo-2024-04-09 | OpenAI |
3 | 87.65% | claude-3-5-sonnet-20240620 | Anthropic |
4 | 72.94% | claude-3-opus-20240229 | Anthropic |
5 | 51.18% | mistral-large-2402 | La Plateforme (Mistral AI) |
6 | 48.82% | gemini-1.5-pro | Vertex AI (Google) |
A Quick Litmus Test
As mentioned earlier, GPTScript uses gpt-4o — which referenced gpt-4o-2024-05-13 at the time these rankings were compiled — by default, so we were already confident in its ability to satisfy our use cases. But to get a rough idea of how well these results stack up to reality, we also ran GPTScript on a selection of example scripts and recorded the pass rate for each model.
Example | gpt-4o-2024-05-13 | gpt-4-turbo-2024-04-09 | claude-3-5-sonnet-20240620 | claude-3-opus-20240229 | mistral-large-2402 | gemini-1.5-pro |
---|---|---|---|---|---|---|
bob-as-shell.gpt | pass | pass | pass | pass | pass | pass |
bob.gpt | pass | pass | pass | pass | pass | pass |
echo.gpt | pass | pass | pass | pass | pass | pass |
fac.gpt | pass | pass | pass | pass | pass | fail |
helloworld.gpt | pass | pass | pass | pass | pass | pass |
describe-code.gpt | pass | pass | fail | fail | fail | fail |
add-go-mod-dep.gpt | pass | pass | fail | fail | fail | fail |
hacker-news-headlines.gpt | pass | pass | pass | fail | fail | fail |
search.gpt | pass | pass | pass | pass | fail | pass |
json-notebook | pass | pass | pass | fail | fail | fail |
sqlite-download.gpt | pass | pass | pass | pass | fail | fail |
syntax-from-code.gpt | pass | pass | pass | pass | fail | pass |
git-commit.gpt | pass | pass | pass | fail | pass | fail |
sentiments.gpt | pass | pass | pass | fail | pass | fail |
Rank | Example Pass Rate | FCTF Pass Rate | Model |
---|---|---|---|
1 | 100% | 98.24% | gpt-4o-2024-05-13 |
2 | 100% | 94.71% | gpt-4-turbo-2024-04-09 |
3 | 85.71% | 87.65% | claude-3-5-sonnet-20240620 |
4 | 57.14% | 72.94% | claude-3-opus-20240229 |
5 | 50.00% | 51.18% | mistral-large-2402 |
6 | 42.86% | 48.82% | gemini-1.5-pro |
With the exception of claude-3-opus-20240229, which differs by ~16%, the practical rankings are within 6% of the FCTS rankings. Although this isn’t exactly an apples-to-apples comparison, we feel the congruence is enough to warrant some confidence that FCTS is a reasonable approximation of a model’s potential performance with GPTScript.
Huzzah!
Now that we’ve convinced ourselves that our results pass muster, let’s take a closer look at the test cases.
Test Case Overview
The initial test suite spans six categories and contains a relatively small number of test cases, but we feel they cover a wide mix of typical use cases without being too overwhelming.
Category | Description |
---|---|
basic | Tests that a model can make the most basic function calls |
sequenced | Tests that a model can make function calls in a specific order |
chained | Tests that a model can pass the result of a function call to another function |
grouped | Tests that a model can identify and make groups of function calls |
semantic | Tests that a model can infer and make the correct function calls given natural language prompts and descriptions |
gptscript | Tests that a model can perform more complex tasks found in GPTScript’s example scripts. |
Test ID | Description | Categories |
---|---|---|
01_basic.yaml-0 | Asserts that the model can make a function call with a given argument and conveys the result to the user | basic |
01_basic.yaml-1 | Asserts that the model can make a function call with an ordered set of arguments and conveys the result to the user | basic |
03_sequenced.yaml-0 | Asserts that the model can make a sequence of function calls in the correct order and conveys the results to the user | sequenced |
03_sequenced.yaml-1 | Asserts that the model can make a mix of ordered an unordered function calls and conveys the result to the user | sequenced |
05_chained.yaml-0 | Asserts that the model can use the result of a function call as the argument for a specified function and conveys the result to the user | chained |
05_chained.yaml-1 | Asserts that the model can use the results of a group of function calls as arguments for a single function call and conveys the result to the user | chained, grouped |
05_chained.yaml-2 | Asserts that the model can use the results of a group of function calls as arguments for successive groups of function calls and conveys the result to the user | chained, grouped |
07_semantic.yaml-0 | Asserts that the model can derive and make a function call with one argument from a prompt and conveys the result to the user | semantic, basic |
07_semantic.yaml-1 | Asserts that the model can derive and make a function call with two arguments from a prompt and conveys the result to the user | semantic, basic |
07_semantic.yaml-2 | Asserts that the model can derive and make an ordered sequence of function calls from a prompt and conveys the results to the user | sequenced, semantic |
07_semantic.yaml-3 | Asserts that the model can derive and make two function calls from the prompt, using the result of the first call as the argument for the second, and convey the result to the user | semantic, chained |
07_semantic.yaml-4 | Asserts that the model can derive and make a series of functions calls from a prompt, where the results of an initial group of calls are used as arguments for a final function call, and conveys the result to the user | semantic, chained |
07_semantic.yaml-5 | Asserts that the model can interpret and execute a complex series of chained steps related to creating a database and creating entries in it. | semantic, chained |
07_semantic.yaml-6 | Asserts that the model can parse a comma delimited list from one function, pass each entry to a second function, and send the gathered results of those calls to a third function. | chained, semantic, grouped |
07_semantic.yaml-7 | Asserts that the model can parse a large csv style response and make a series of chained calls for each row in the csv | semantic, chained |
07_semantic.yaml-8 | Asserts that the model can parse and transform user input based on the instructions in its system prompt. | sequenced, gptscript, semantic, chained |
07_semantic.yaml-9 | Asserts that the model can build chain of grouped function calls. | sequenced, semantic, chained, grouped, gptscript |
Note: Test ID refers to the spec file name and yaml stream index that a given spec originated from. There are “gaps” in the indices above are because we’ve elided the nascent negative test category from our analysis. We did this because we’re not fully confident the category is meaningful yet. The full spec files for the entire suite, including negatives, are available for review in the FCTS repo.
Comparing Performance
Plotting the number of passed runs for each test case as a heat map makes the major differences between models stand out.
Here we can see that the gulf in performance between OpenAI and the other providers is mostly caused by failing chained and semantic test cases. Interestingly, with the exception of claude-3.5-sonnet, non-OpenAPI providers fail the same two chained and semantic test cases across the board (05_chained.yaml-0, 05_chained.yaml-1, 07_semantic.yaml-6, and 07_semantic.yaml-8). These failures represent a whopping ~66% and 20% of the total test runs in their respective categories!
But to compare the deficits of each model in any greater fidelity, we’ll need to understand why they failed on a test-by-test basis.
gpt-4o-2024-05-13
Test ID | Fail Rate | Failure Pathology |
---|---|---|
07_semantic.yaml-4 | 30% | – Fails to properly chain groups of function calls – Hallucinates function arguments |
gpt-4-turbo-2024-04-09
Test ID | Fail Rate | Failure Pathology |
---|---|---|
07_semantic.yaml-6 | 80% | – Returns an incorrect argument after a large number of function calls |
07_semantic.yaml-9 | 10% | – Makes an unnecessary duplicate function call |
claude-3-5-sonnet-20240620
Test ID | Fail Rate | Failure Pathology |
---|---|---|
05_chained.yaml-1 | 100% | – Chains correctly – Final answer enumerates the chain of function calls invoked instead of the final evaluated result |
05_chained.yaml-2 | 10% | – Chains correctly – Final answer enumerates the chain of function calls invoked instead of the final evaluated result |
07_semantic.yaml-6 | 100% | – Halts after the first call – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
claude-3-opus-20240229
Test ID | Fail Rate | Failure Pathology |
---|---|---|
05_chained.yaml-0 | 60% | – Makes chained calls in parallel – Passes a “place holder” instead of a “real” argument |
05_chained.yaml-1 | 100% | – Chains correctly – Final answer enumerates the chain of function calls invoked instead of the final evaluated result |
05_chained.yaml-2 | 100% | – Makes chained calls in parallel – Passes a “place holder” instead of a “real” argument |
07_semantic.yaml-6 | 100% | – Halts after the first call – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
07_semantic.yaml-8 | 100% | – Halts without making any calls – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
mistral-large-2402
Test ID | Fail Rate | Failure Pathology |
---|---|---|
05_chained.yaml-1 | 100% | – Halts without making any calls – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
05_chained.yaml-2 | 100% | – Makes chained calls in parallel – Passes a “place holder” instead of a “real” argument |
07_semantic.yaml-2 | 30% | – Halts after the first call – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
07_semantic.yaml-4 | 100% | – Halts after the first call – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
07_semantic.yaml-5 | 100% | – Halts without making any calls – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
07_semantic.yaml-6 | 100% | – Makes chained calls in parallel – Passes a “place holder” instead of a “real” argument |
07_semantic.yaml-7 | 100% | – Makes chained calls in parallel – Hallucinates arguments instead of using the results of the initial call |
07_semantic.yaml-8 | 100% | – Halts after the first call – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
07_semantic.yaml-9 | 100% | – Halts after the first call – Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan |
gemini-1.5-pro
Test ID | Fail Rate | Failure Pathology |
---|---|---|
01_basic.yaml-1 | 10% | – Makes the correct tool call – Returns the raw JSON of what looks like the internal “google representation” of the call result |
05_chained.yaml-0 | 100% | – Fails to derive chain call order – Passes “unknown” literal as argument |
05_chained.yaml-1 | 100% | – Fails to derive chain call order – Passes given arguments to the wrong function |
05_chained.yaml-2 | 100% | – Makes no function calls – Returns a 500 error |
07_semantic.yaml-2 | 100% | – Chains correctly – Final answer doesn’t contain the chain’s result |
07_semantic.yaml-3 | 60% | – Chains correctly – Final answer is missing required information |
07_semantic.yaml-4 | 100% | – Chains correctly – Final answer is missing required information |
07_semantic.yaml-6 | 100% | – Returns an incorrect argument after a large number of function calls |
07_semantic.yaml-8 | 100% | – Fails to derive the chain order – Hallucinates initial argument |
07_semantic.yaml-9 | 100% | – Begins chain correctly – Adds extra escape characters to new-lines |
Thumbing through the failure pathologies above reveals a few common threads between models:
Premature Halting
claude-3-5-sonnet-20240620, claude-3-opus-20240229, mistral-large-2402, and gemini-1.5-pro all frequently halt before completing their tasks. Instead of executing the plan, they just describe what should be done. For example, claude-3-opus-20240229 stops after the first step in test 07_semantic.yaml-6, while mistral-large-2402 exhibits similar behavior in several tests, like 07_semantic.yaml-4 and 07_semantic.yaml-5. gemini-1.5-pro also halts prematurely, particularly in 05_chained.yaml-1.
Poor Chaining
claude-3-opus-20240229 and mistral-large-2402 tend to make parallel calls when they should be sequential, leading to incorrect results. This problem is evident in tests like 05_chained.yaml-2 and 07_semantic.yaml-6. gemini-1.5-pro also encounters this issue, especially in 05_chained.yaml-0 and 05_chained.yaml-1, failing to derive the correct call order.
Argument Hallucination
Hallucinating function arguments is another prevalent issue. gpt-4o-2024-05-13, claude-3-opus-20240229, and gemini-1.5-pro all exhibit this behavior. In 07_semantic.yaml-4, gpt-4o-2024-05-13 generates arguments that were not part of the original input. Similarly, claude-3-opus-20240229 and gemini-1.5-pro show this issue in tests 07_semantic.yaml-7 and 07_semantic.yaml-8, respectively, making up inputs on the fly.
Potential Confounding
At the moment, one factor that could throw off our results is the use of GPTScript provider shims for model providers that don’t support OpenAI’s Chat Completion API; e.g. claude-3-opus-20240229 and gemini-1.5-pro. While we’re fairly confident in our shims, there’s always the potential for unknown bugs to skew our test results. However, since we’ve tested our provider shims pretty thoroughly we expect confounding from this source to be minimal.
Conclusion
The exercise of building a framework calling test framework has been a fruitful one. It’s given us a much deeper grasp on the strengths and weaknesses of the current ecosystem’s top models. It’s also unveiled several real world takeaways that we’ve already put to use in our other work-streams (e.g. using an LLM to test GPTScript). To us, the results indicate a real gap in performance between OpenAI and the other providers which serves as evidence to support our initial decision to build GPTScript around OpenAI’s models. They’ve also exposed the best providers and made it clear that they are getting even better (e.g. gpt-4o vs gpt-4-turbo and claude-3.5-sonnet vs claude-3-opus).
If you’ve found this post interesting, you may want to check out the FCTS repo and give it a spin for yourself. Feel free to join our Discord server to chat with us about it too!