OpenAI Function Calling Test Framework

A comprehensive PHP-based testing framework for validating OpenAI function calling capabilities and message responses. This framework allows you to test whether your AI assistant correctly calls the right functions with the right parameters or provides appropriate text responses based on user inputs.

🚀 Features

Function Call Testing: Validate that the AI calls the correct functions with expected parameters
Message Response Testing: Verify that the AI provides appropriate text responses when no function calls are needed
AI-Powered Meaning Validation: Uses another AI model to validate if response meanings match expectations
Colorized Console Output: Easy-to-read test results with color-coded success/failure indicators
Flexible Test Filtering: Run all tests or filter by test name patterns
Detailed Error Reporting: Comprehensive output showing what was expected vs what was received
Concurrent Execution: Run tests in parallel batches with --concurrency / -c flag for faster results
Repeat Mode: Run tests multiple times with a numeric argument to measure consistency
Token Usage Tracking: Automatic tracking and reporting of input/output/cached token usage per test
Timing Reports: Elapsed time per test, per batch, and overall suite execution time
Multi-Turn Conversation Testing: Full message history with function calls and function outputs in test cases

📁 Project Structure

├── test.php                 # Main test framework and runner
├── cases.json               # Test case definitions (active)
├── cases.default.json       # Default test case template
├── structure.json           # API configuration: model, tools, streaming, etc.
├── structure.default.json   # Default API configuration template
├── settings.ini.php         # API keys and configuration settings (active)
├── settings.ini.default.php # Default settings template
├── README.md                # This documentation
└── LICENSE                  # Project license

⚙️ Configuration

⚠️ Important Setup Notice
Before running the tests, you must copy the default configuration files to their active versions:

Copy settings.ini.default.php → settings.ini.php

Copy cases.default.json → cases.json

Copy structure.default.json → structure.json

Then customize these files with your specific API keys and test configurations.

Note: After copying structure.default.json to structure.json, you might want to edit the file and remove the vector storage section if you don't need it for your testing purposes.

settings.ini.php

Configure your API keys and models:

<?php
return [
    "api_key" => "your-openai-api-key",
    "model_meaning_verification" => "gpt-4.1-mini",
    "model_core" => "o3-mini",
    "system_prompt" => "Your system prompt here..."
];
?>

structure.json

Defines the OpenAI API request structure including available tools and model configuration.

📝 Test Case Format

Test cases are defined in cases.json with the following structure:

{
    "name": "test_name",
    "messages": [
        {
            "role": "user",
            "content": "User input message"
        }
    ],
    "expected_output": {
        "type": "function_call|message",
        "tool": "function_name",           // For function_call type
        "arguments": {"key": "value"},     // Optional, for validating function arguments
        "meaning": "Expected meaning"      // For message type with AI validation
    },
    "ignore": false                       // Optional: skip this test when running all tests
}

expected_output.type

"function_call" — Expect the AI to call a specific tool
"message" — Expect the AI to respond with a text message (no function calls)
["function_call", "message"] — Accept either outcome (array form for multiple valid possibilities)

Multi-Turn Conversations

Test cases can include full message histories with function calls and their outputs. Use "type": "message" for user/assistant messages and "type": "function_call" / "type": "function_call_output" for tool interactions:

{
    "name": "multi_turn_example",
    "messages": [
        {
            "type": "message",
            "role": "user",
            "content": "Check my balance"
        },
        {
            "type": "function_call",
            "status": "completed",
            "call_id": "fc_001",
            "name": "check_balance",
            "arguments": "{}"
        },
        {
            "type": "function_call_output",
            "call_id": "fc_001",
            "output": "You have $50 balance."
        }
    ],
    "expected_output": {"type": "message"}
}

🧪 Test Examples

Here are the current test cases included in the framework:

1. Function Call Test - Check Support Price

{
    "name": "support_price",
    "messages": [
        {
            "role": "user",
            "content": "What is paid support price?"
        }
    ],
    "expected_output": {
        "type": "function_call",
        "tool": "support_price"
    }
}

What it tests: Verifies that when a user asks about support price, the AI correctly calls the support_price function.

2. Function Call Test with Arguments - Password Reminder

{
    "name": "password_reminder",
    "messages": [
        {
            "role": "user", 
            "content": "I forgot my password, my e-mail is remdex@gmail.com"
        }
    ],
    "expected_output": {
        "type": "function_call",
        "tool": "password_reminder",
        "arguments": {"email": "remdex@gmail.com"}
    }
}

What it tests: Ensures the AI calls the correct function with the right parameters when asked to remind password.

3. Message Response Test - Welcome

{
    "name": "welcome",
    "messages": [
        {
            "role": "user",
            "content": "Hi"
        }
    ],
    "expected_output": {
        "type": "message"
    }
}

What it tests: Validates that the AI responds with a text message (not a function call) for greeting inputs.

4. Message Response with Meaning Validation - Unrelated Questions

{
    "name": "unrelated",
    "messages": [
        {
            "role": "user",
            "content": "Who is the president of the USA?"
        }
    ],
    "expected_output": {
        "type": "message",
        "meaning": "Answer to user question should not be provided as it is not related to the Live Helper Chat."
    }
}

What it tests: Ensures the AI refuses to answer questions unrelated to the Live Helper domain and uses AI-powered validation to check if the response meaning matches expectations.

5. Multi-Turn Conversation Test

{
    "name": "support_price_multi_turn",
    "messages": [
        {
            "type": "message",
            "role": "user",
            "content": "Hi"
        },
        {
            "type": "message",
            "role": "assistant",
            "content": "Hello! How can I help you?"
        },
        {
            "type": "message",
            "role": "user",
            "content": "What is paid support price?"
        }
    ],
    "expected_output": {"type": "function_call", "tool": "support_price"}
}

What it tests: Verifies the AI correctly calls a function after a multi-turn conversation history.

6. Multiple Acceptable Outcomes

{
    "name": "password_reminder_or_message",
    "messages": [
        {
            "role": "user",
            "content": "Can you help me reset my password?"
        }
    ],
    "expected_output": {"type": ["function_call", "message"]}
}

What it tests: Accepts either a function call or a message response as valid — useful when either outcome is acceptable.

🏃‍♂️ Running Tests

Run All Tests

php test.php

Tests with "ignore": true are automatically skipped when running all tests. To run an ignored test, target it by name.

Run Specific Test by Name

php test.php "password_reminder"

This will run all tests containing "password_reminder" in their name. When targeting a specific test by name, the ignore flag is bypassed.

Run Tests with Concurrency

php test.php --concurrency 5
php test.php -c 3 "support_price"

Executes tests in parallel batches. The first example runs all tests 5 at a time; the second runs matching tests 3 at a time.

Repeat Tests Multiple Times

php test.php 10
php test.php 10 --concurrency 5
php test.php "welcome" 3

A numeric argument repeats all (or matching) tests that many times. Useful for measuring consistency and accumulating token stats over multiple runs.

📊 Test Output

The framework provides detailed, colorized output:

╔══════════════════════════════════════════════════════════════╗
║                  OpenAI Function Calling Test                ║
╚══════════════════════════════════════════════════════════════╝

Running test: support_price
Expected tool: support_price
Expected type: function_call
[✓] support_price - PASS <function_call> (1.23s, 450t (320i/130o))
    → Called tools: support_price

Running test: password_reminder  
Expected tool: password_reminder
Expected type: function_call
[✓] password_reminder - PASS <function_call> (980ms, 520t (380i/140o))
    → Called tools: password_reminder
    → Arguments matched: {"email": "remdex@gmail.com"}

╔══════════════════════════════════════════════════════════════╗
║                        Test Summary                          ║
╚══════════════════════════════════════════════════════════════╝

Total Runs: 2
Passed: 2
Failed: 0
Success Rate: 100.0%

── Timing Report ──
Total time:       2.5s
Avg per test:     1.2s
Avg per run:      2.5s

── Top Token Consumers ──
#    Test Name                                          Runs      Input     Output     Cached      Total
1.   password_reminder                                      1        380        140          0        520
2.   support_price                                          1        320        130          0        450

🎉 All tests passed! 🎉

Output Legend

✓ / ✗ — Pass/Fail indicator
<function_call> / <message> — The matched expected output type
1.23s — Elapsed time for the test
450t (320i/130o) — Token usage: total (input/output). If present, cached tokens shown with c suffix
Timing Report — Shows total, per-test, and per-run averages
Top Token Consumers — Top 5 tests ranked by average total tokens per run

🔍 Validation Types

Function Call Validation

Verifies the correct function is called
Validates function arguments match expected values
Checks that no unexpected function calls occur

Message Response Validation

Ensures AI responds with text messages when appropriate
Validates no function calls are made when not expected
Uses AI-powered meaning validation for semantic correctness

AI-Powered Meaning Validation

For message responses with meaning defined, the framework uses a separate AI model to validate if the actual response semantically matches the expected meaning in context of the original user question.

🛠️ Adding New Tests

Open cases.json
Add a new test case following the format above
Run the tests to validate your new case

Example of adding a new test:

{
    "name": "check_account_balance", 
    "messages": [
        {
            "role": "user",
            "content": "What is my account balance?"
        }
    ],
    "expected_output": {
        "type": "function_call",
        "tool": "get_account_balance"
    }
}

Test Case Properties Reference

Property	Type	Description
`name`	string	Unique test name (used for filtering)
`messages`	array	Array of message objects forming the conversation
`expected_output.type`	string or string[]	`"function_call"`, `"message"`, or an array like `["function_call", "message"]`
`expected_output.tool`	string	(function_call only) Expected function name
`expected_output.arguments`	object	(function_call only) Expected arguments to match
`expected_output.meaning`	string	(message only) Semantic meaning to validate via AI
`ignore`	boolean	If `true`, test is skipped when running all tests

Message Object Types

Type	Description
`"role": "user"` / `"role": "assistant"`	Simple message (legacy format)
`"type": "message", "role": "user"`	User message (explicit format)
`"type": "message", "role": "assistant"`	Assistant message (explicit format)
`"type": "function_call"`	A function call made by the AI in history
`"type": "function_call_output"`	The output of a function call in history

🔧 Error Handling

The framework provides comprehensive error handling:

API connection errors
Invalid JSON responses
Missing or malformed test cases
Function call validation failures
Meaning validation errors

📋 Requirements

PHP 7.4 or higher
cURL extension enabled
Valid OpenAI API key
Internet connection for API calls

🤝 Contributing

Add test cases to cases.json
Update function definitions in structure.json if needed
Run tests to ensure everything works
Submit your changes

📄 License

This project is licensed under the terms specified in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cases.default.json		cases.default.json
settings.ini.default.php		settings.ini.default.php
structure.default.json		structure.default.json
test.php		test.php

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OpenAI Function Calling Test Framework

🚀 Features

📁 Project Structure

⚙️ Configuration

settings.ini.php

structure.json

📝 Test Case Format

expected_output.type

Multi-Turn Conversations

🧪 Test Examples

1. Function Call Test - Check Support Price

2. Function Call Test with Arguments - Password Reminder

3. Message Response Test - Welcome

4. Message Response with Meaning Validation - Unrelated Questions

5. Multi-Turn Conversation Test

6. Multiple Acceptable Outcomes

🏃‍♂️ Running Tests

Run All Tests

Run Specific Test by Name

Run Tests with Concurrency

Repeat Tests Multiple Times

📊 Test Output

Output Legend

🔍 Validation Types

Function Call Validation

Message Response Validation

AI-Powered Meaning Validation

🛠️ Adding New Tests

Test Case Properties Reference

Message Object Types

🔧 Error Handling

📋 Requirements

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages