Product Introduction
1. Definition
APIEval-20 is a specialized black-box benchmark designed to evaluate the functional testing capabilities of AI agents and Large Language Models (LLMs). Specifically categorized as an automated test suite generation benchmark, it assesses an agent's ability to interpret a JSON schema and a single sample payload to create effective, bug-detecting test cases without access to source code, documentation, or implementation details.
2. Core Value Proposition
APIEval-20 addresses the critical need for objective, non-subjective evaluation of AI-driven software testing. By utilizing live reference APIs with planted bugs rather than relying on "LLM-as-judge" frameworks, the benchmark provides a ground-truth measurement of an agent's reasoning, edge case discovery, and diagnostic accuracy. It enables developers and researchers to quantify how well an autonomous agent can "think like a QA engineer" in a realistic, information-constrained environment.
Main Features
1. Multi-Tiered Bug Spectrum
The benchmark includes 20 distinct scenarios, each containing between 3 and 8 hidden bugs categorized by complexity.
- Simple Bugs: Focus on structural integrity, such as missing required fields, incorrect data types, or empty values.
- Moderate Bugs: Test field-level constraints including malformed strings (email, date formats), numeric range violations, and undocumented enum values.
- Complex Bugs: Evaluate semantic reasoning and cross-field relationships, such as mutually exclusive parameters or business logic failures (e.g., applying discounts to ineligible orders).
2. Objective Automated Scoring Harness
Unlike qualitative evaluations, APIEval-20 uses a fully automated execution environment. Every test case generated by an agent is executed against a live reference implementation. A bug is marked as "detected" only if the API response deviates from expected correct behavior (e.g., a 200 OK instead of a 400 Bad Request, or incorrect body values). This eliminates the bias and hallucination risks associated with human or LLM-based scoring.
3. Comprehensive Performance Metrics
Scoring is calculated through a weighted formula (0.0 to 1.0) across three dimensions:
- Bug Detection Rate (70%): The primary metric measuring the percentage of total planted bugs successfully triggered.
- Coverage Score (20%): A composite of Parameter Coverage (fields exercised), Edge Case Coverage (nulls, empty strings, out-of-range values), and Input Variation (using Jaccard similarity to penalize repetitive tests).
- Efficiency Score (10%): A signal-to-noise ratio that penalizes excessively large test suites, rewarding agents that find more bugs with fewer requests.
4. Diverse Application Domains
The benchmark spans seven critical industry sectors to ensure versatility:
- E-commerce & Payments: Order flows, coupon logic, and currency conversion.
- Authentication & User Management: Token refreshes, session handling, and RBAC (Role-Based Access Control).
- Scheduling & Notifications: Recurring events, availability logic, and communication preferences.
- Search & Filtering: Pagination, sorting, and complex query construction.
Problems Solved
1. Pain Point: Evaluation Subjectivity in AI Agents
Traditional LLM benchmarks often measure text quality or code syntax rather than functional utility. APIEval-20 solves the "evaluator's dilemma" by providing a binary success/failure metric based on real-world API interactions, preventing agents from passing benchmarks through eloquent but ineffective test descriptions.
2. Target Audience
- AI Researchers & Model Developers: Benchmarking the reasoning capabilities of foundation models (GPT, Claude, Llama) in specialized engineering tasks.
- QA Engineering Teams: Evaluating autonomous testing tools to determine their readiness for integration into CI/CD pipelines.
- Enterprise Software Architects: Assessing the reliability of AI agents like Cursor, Devin, or GitHub Copilot in generating high-coverage regression suites.
- Product Managers for AI Tools: Using standardized data to demonstrate the efficacy of their automated testing products.
3. Use Cases
- Autonomous Test Generation: Creating initial regression suites for legacy APIs where documentation is missing or outdated.
- Agent Performance Comparison: Conducting head-to-head comparisons of different AI agents to see which handles complex business logic more effectively.
- Edge Case Discovery: Using AI to surface non-obvious failure modes in microservices before they reach production.
Unique Advantages
1. Differentiation: Black-Box Realism
Most API benchmarks provide the agent with full Swagger/OpenAPI documentation or implementation code. APIEval-20 restricts input to a JSON schema and one payload. This replicates the real-world scenario where developers or testers must work with limited context, forcing the agent to perform genuine semantic inference rather than simple documentation parsing.
2. Key Innovation: Semantic Relationship Testing
APIEval-20 goes beyond schema validation. While tools like Schemathesis check if a request matches a schema, APIEval-20 tests if the agent understands what the field represents. By planting bugs that depend on the relationship between multiple fields (e.g., if "method" is "express," then "shipping_date" cannot be null), it evaluates higher-order cognitive logic.
Frequently Asked Questions (FAQ)
1. How does APIEval-20 differ from traditional API fuzzing tools?
Traditional fuzzers like Schemathesis or Dredd primarily focus on schema compliance and structural robustness. APIEval-20 evaluates the intelligence of the test generator, specifically its ability to design targeted tests for complex business logic and semantic errors that random fuzzing often misses.
2. Is APIEval-20 a static dataset or a live evaluation?
The dataset (schemas and payloads) is hosted on Hugging Face, but the evaluation is live. Agents generate test suites which are then executed against KushoAI’s hosted reference implementations. This "live-reference" model prevents contamination and ensure that agents cannot "cheat" by memorizing static response patterns.
3. Can I use APIEval-20 to test security vulnerabilities?
APIEval-20 currently focuses on functional correctness and business logic. However, KushoAI has announced APIEval-Security as a forthcoming expansion, which will specifically target OWASP API Security Top 10 categories, such as authentication bypass and injection flaws.
4. What constitutes a "Strong" performance on this benchmark?
A score between 0.7 and 1.0 is considered "Strong." This indicates the agent successfully identified bugs across all complexity tiers (simple, moderate, and complex), achieved broad field coverage, and maintained a lean, efficient test suite comparable to a senior human QA engineer.
