Testing Hierarchy
Testing Hierarchy
llm-app-test is designed to complement existing approaches. We recommend this testing hierarchy:
-
Behavioral Testing (llm-app-test)
- Fast, cost-effective first line of testing
- Validates IF your LLM application is even working as intended
- Tests core functionality and behavior
- Must pass before proceeding to benchmarking
- Failure indicates fundamental problems with the application
-
Benchmarking and Performance Evaluation
- Much slower and more expensive
- Only run AFTER behavioral tests pass
- Measures HOW WELL the application performs (in our view, this blurs the lines into LLM evaluation but it should still be done, just not as the first line of defence against broken apps due to the time and cost required)
- Tests performance metrics, response quality
- Used for optimization and model selection
That said, we are planning on building a benchmarking system to allow you to get some metrics on how well your system is complying with behavioral specifications, planned for 0.3.0b1.
[!IMPORTANT] Always ensure behavioral tests pass before running benchmarks. A failing behavioral test indicates your application is fundamentally broken - no amount of performance optimization will fix incorrect behavior.
Example Flow:
# 1. First, run behavioral tests
behavioral_asserter.assert_behavioral_match(result, "Expected behavior description")
# 2. Only if behavioral tests pass, run benchmarks
if behavioral_tests_pass:
run_performance_benchmarks()
This hierarchy ensures:
- Core functionality is correct before optimization
- Clear separation of behavior and performance testing
- Efficient use of compute resources and API calls
- Structured approach to LLM application testing