Behavioral Testing Reliability - Real-World Industry Specific Testing
⚠️ Important Notice About Semantic Testing
This documentation refers to tests originally named semantic assertion. We have since deprecated the name semantic
in favor of behavioral testing, as we found that it more accurately describes what we were doing. The underlying implementation and reliability testing remain valid, as the core functionality is identical - we've simply improved the conceptual framework to better reflect what the library is actually doing.
Overview
This page documents our extensive testing of llm-app-test against real-world industry use cases. We designed this test suite to validate the library's reliability with the kind of content that LLM applications would probably generate in production environments.
The first test case in this suite initially exhibited behavioral non-determinism. Upon investigation, this was caused by a legitimate semantic boundary in the test case - a situation where the expected and actual outputs sit at the edge of what could be considered equivalent behaviors. As the library's author (a lawyer by training with three years of litigation experience), I had to carefully analyze the semantic boundaries to identify the exact conditions causing the non-determinism. This analysis and its implications are documented in detail here.
The remaining 9 test cases maintained 100% pass rate across 1,700 runs. For simplicity and clarity in this documentation, we focus on the 1,200 runs where all 10 cases in the suite achieved a 100% pass rate.
The relevant logs for the testing covered by this page can be found here.
Test Suite Design
The test suite covers 10 key industries where LLMs are likely already seeing active use:
- Healthcare (Patient Education)
- Financial Services (Portfolio Reporting)
- Media/Entertainment (Content Recommendations)
- Legal (Document Summarization)
- Manufacturing (Maintenance Prediction)
- E-commerce (Product Descriptions)
- Education (Assignment Feedback)
- Real Estate (Property Listings)
- Human Resources (Interview Feedback)
- Customer Service (Ticket Response)
The Test Suite consisted of 5 positive cases and 5 negative cases.
Testing Scale
- Total Runs: 1,200 (600 Windows, 600 Linux (Pop_OS))
- Tests per Run: 10
- Total Test Executions: 12,000
- Pass Rate: 100%
Cross-references:
- See Test Configuration for setup details
- See Test Results for detailed analysis
- Full test logs available in reliability_testing
Test Characteristics
Each test was designed to reflect:
- Realistic content length
- Industry-specific terminology
- Common formatting patterns
- Typical validation requirements
- Real-world edge cases
- Claude 3.5 Sonnet was used to generate the test cases to make it more realistic, but testing was done with GPT-4o
Test Results
Format Compliance
- Zero format violations across 12,000 executions
- Consistent PASS/FAIL behaviour
Cross-Platform Reliability
- Identical behavior on Windows and Linux
- No platform-specific issues detected
Content Processing
- Successfully handled varying content lengths
- Maintained accuracy across different domains
- Consistent behaviour with specialised terminology
Cost Analysis
Running the test suite demonstrated the following costs:
- Single Run (10 tests): US$0.014
- 100 Runs (1,000 tests): US$1.40
- Repeated runs for reliability testing (12,000 tests): US$16.80
Cost Breakdown:
- Per Test Cost: ~US$0.0014
- Per Run (10 tests): US$0.014
- Per 100 Runs: US$1.40
This demonstrates that comprehensive behavioral testing remains economically viable even at scale. The cost per test is minimal considering the confidence gained in library reliability.
Key Cost Insights:
- Linear cost scaling with test volume
- Predictable pricing for planning purposes
- Reasonable expense for production validation
Test Configuration
All tests used library defaults:
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o
LLM_TEMPERATURE=0.0
LLM_MAX_TOKENS=4096
LLM_MAX_RETRIES=2
LLM_TIMEOUT=10.0 # Added for OpenAI in 0.1.0b5 using the underlying Langchain implementation in dev branch
semantic_assert_match
function (Update: Deprecated and replaced with identical assert_behavioral_match
) also saw slight modification:
if result.startswith("FAIL"):
raise SemanticAssertionError(
"Semantic assertion failed",
reason=result.split("FAIL: ")[1]
)
# Section below added to cause failure in the event of format violation
elif result.startswith("PASS"):
pass
else:
raise RuntimeError(
f"Format Non-compliance Detected {result}"
)
The prompts to the asserter LLM (that sits behind semantic_assert_match
(Update: Deprecated and replaced with identical assert_behavioral_match
)) were:
DEFAULT_SYSTEM_PROMPT = """You are a testing system. Your job is to determine if an actual output matches the expected behavior.
Important: You can only respond with EXACTLY:
1. 'PASS' if it matches, or
2. 'FAIL: <reason>' if it doesn't match.
Any other type of response will mean disaster which as a testing system, you are meant to prevent.
Be strict but consider semantic meaning rather than exact wording."""
DEFAULT_HUMAN_PROMPT = """
Expected Behavior: {expected_behavior}
Actual Output: {actual}
Does the actual output match the expected behavior? Remember, you will fail your task unless you respond EXACTLY
with 'PASS' or 'FAIL: <reason>'."""
Test Suite Code
⚠️ Note About Test Code:
The test suite shown uses the deprecated SemanticAssertion
class and assert_semantic_match
method.
These tests remain valid as the underlying implementation is identical in the new BehavioralAssertion
class
and assert_behavioral_match
method. The only change is in terminology to better reflect the testing approach.
import pytest
from llm_app_test.semantic_assert.semantic_assert import SemanticAssertion
from llm_app_test.exceptions.test_exceptions import (
SemanticAssertionError
)
class TestRealWorldSemanticAssertion:
"""Test suite for semantic matching across diverse industry-specific LLM applications.
This test class is specifically designed to validate semantic matching capabilities
across a wide range of real-world LLM application outputs. It contains both positive
and negative test cases that represent actual use cases where LLMs are being used
in production environments.
Industries Covered:
- Healthcare (Patient Education)
- Financial Services (Portfolio Reporting)
- Media/Entertainment (Content Recommendations)
- Legal (Document Summarization)
- Manufacturing (Maintenance Prediction)
- E-commerce (Product Descriptions)
- Education (Assignment Feedback)
- Real Estate (Property Listings)
- Human Resources (Interview Feedback)
- Customer Service (Ticket Response)
Test Structure:
- Each test validates specific industry requirements
- Mix of positive and negative test cases
- Focus on realistic content length and complexity
- Industry-specific terminology and formatting
- Comprehensive coverage of common LLM outputs
Purpose:
This test suite is designed for brute force reliability testing of the semantic
matcher. It ensures the library can handle diverse, real-world content while
maintaining consistent behavior across multiple test runs.
Usage:
These tests are intended to be run multiple times (1000+) to validate the
consistency and reliability of the semantic matching functionality across
different contexts and content types.
"""
@pytest.fixture
def asserter(self):
return SemanticAssertion()
def test_patient_education_diabetes_management(self, asserter):
"""Test semantic matching for patient education content about diabetes management. Failure is expected because
this does not contain emergency response steps."""
actual = """
Understanding and Managing Type 2 Diabetes
Type 2 diabetes is a chronic condition that affects how your body processes blood sugar (glucose).
While this condition is serious, it can be effectively managed through lifestyle changes and,
when necessary, medication. This guide will help you understand the key aspects of diabetes
management.
Blood Sugar Monitoring:
Regular blood sugar monitoring is essential. Your target blood glucose levels should typically
be 80-130 mg/dL before meals and less than 180 mg/dL two hours after meals. However, your
healthcare provider may set different targets based on your individual needs. Keep a log of
your readings to identify patterns and adjust your management strategy accordingly.
Dietary Considerations:
A balanced diet is crucial for managing type 2 diabetes. Focus on:
- Controlling portion sizes
- Choosing high-fiber, low-glycemic foods
- Limiting refined carbohydrates and processed sugars
- Including lean proteins and healthy fats
- Spacing meals evenly throughout the day
Physical Activity:
Regular exercise helps control blood sugar levels by improving insulin sensitivity. Aim for:
- At least 150 minutes of moderate-intensity aerobic activity weekly
- Resistance training 2-3 times per week
- Daily movement, even if just short walks
Always check your blood sugar before and after exercise, and carry a fast-acting
carbohydrate source.
Medication Management:
If prescribed, take diabetes medications as directed. Common medications include:
- Metformin (helps reduce glucose production)
- Sulfonylureas (increase insulin production)
- DPP-4 inhibitors (help maintain blood sugar control)
Never adjust or stop medications without consulting your healthcare provider.
Warning Signs:
Learn to recognize and respond to:
- Hypoglycemia (low blood sugar): shakiness, sweating, confusion
- Hyperglycemia (high blood sugar): increased thirst, frequent urination, fatigue
Seek immediate medical attention if you experience severe symptoms or sustained
high blood sugar levels.
Regular Health Monitoring:
Schedule regular check-ups with your healthcare team, including:
- HbA1c tests every 3-6 months
- Annual eye examinations
- Regular foot checks
- Kidney function tests
- Cholesterol level monitoring
Remember, diabetes management is a journey, not a destination. Small, consistent
steps in the right direction can lead to significant improvements in your health
and quality of life.
"""
expected = """A medical education document that must:
1. Contain an overview section explaining the condition
2. List specific numerical guidelines (blood sugar ranges, exercise minutes)
3. Include structured sections for diet, exercise, and medication
4. Provide clear warning signs AND detailed emergency response procedures
5. End with follow-up care instructions"""
with pytest.raises(SemanticAssertionError) as excinfo:
asserter.assert_semantic_match(actual, expected)
assert "Semantic assertion failed" in str(excinfo.value)
def test_investment_portfolio_report_generation(self, asserter):
"""Test semantic matching for investment portfolio report generation. Tests that the report
contains all required sections and maintains professional financial terminology."""
actual = """
Q4 2023 Portfolio Performance Summary
Portfolio Overview:
Your investment portfolio has demonstrated resilient performance during Q4 2023,
achieving a total return of 8.2% against our benchmark index return of 7.5%.
Total portfolio value stands at $1,245,000 as of December 31, 2023.
Asset Allocation Analysis:
Current allocation stands at:
- Equities: 65% ($809,250)
- US Large Cap: 40% ($498,000)
- International Developed: 15% ($186,750)
- Emerging Markets: 10% ($124,500)
- Fixed Income: 25% ($311,250)
- Government Bonds: 15% ($186,750)
- Corporate Bonds: 10% ($124,500)
- Alternative Investments: 10% ($124,500)
- Real Estate: 5% ($62,250)
- Commodities: 5% ($62,250)
Performance Attribution:
Key contributors to performance:
1. US Technology sector outperformance (+12.3%)
2. Emerging Markets recovery (+9.1%)
3. Corporate Bond yield optimization (+4.2%)
Risk Metrics:
- Portfolio Beta: 0.85
- Sharpe Ratio: 1.45
- Maximum Drawdown: -5.2%
- Standard Deviation: 12.3%
Rebalancing Recommendations:
Based on current market conditions and your investment objectives:
1. Consider increasing Fixed Income allocation by 2%
2. Reduce US Large Cap exposure by 3%
3. Increase Emerging Markets exposure by 1%
Market Outlook:
Looking ahead to 2024, we anticipate:
- Continued monetary policy normalization
- Potential emerging markets opportunities
- Heightened focus on quality factors in equity selection
Next Steps:
1. Schedule quarterly review meeting
2. Discuss rebalancing recommendations
3. Update investment policy statement if needed
"""
expected = """A professional investment portfolio report that must:
1. Present portfolio performance with specific metrics
2. Detail current asset allocation with percentages
3. Include risk analysis metrics
4. Provide forward-looking recommendations
5. Maintain formal financial terminology
6. Include clear next steps or action items"""
asserter.assert_semantic_match(actual, expected)
def test_content_recommendation_missing_viewing_patterns(self, asserter):
"""Test semantic matching for content recommendations. Should fail due to missing viewing patterns
and user preferences section."""
actual = """
Personalized Content Recommendations - User Profile #A1234
Generated: November 22, 2024
Recommended Content Queue:
1. "Climate Pioneers" (Documentary Series)
- Episode length: 45 minutes
- New episodes available
2. "Global Power Play" (Political Drama)
- Episode length: 55 minutes
- Features actors from previously watched content
3. "Earth's Tipping Points" (Scientific Documentary)
- Episode length: 40 minutes
- Recently added to platform
Engagement Optimization:
- Scheduled new episode alerts
- Downloadable content for offline viewing
- Similar content suggestions refreshed weekly
- Customized language preferences maintained
Content Accessibility:
All recommended content includes your preferred subtitle options and is
available in HD quality. Downloads are enabled for offline viewing during
your upcoming travel dates.
"""
expected = """A personalized content recommendation document that must:
1. Include the viewing patterns and preferences of the user
2. List recommended content with clear reasoning
3. Provide matching percentages or relevance metrics
4. Include viewing optimization suggestions
5. Address content accessibility features"""
with pytest.raises(SemanticAssertionError) as excinfo:
asserter.assert_semantic_match(actual, expected)
assert "Semantic assertion failed" in str(excinfo.value)
def test_legal_document_summary_generation(self, asserter):
"""Test semantic matching for legal document summary generation. Tests that the summary
maintains accuracy while being accessible to non-legal readers."""
actual = """
Contract Summary Analysis
Document Reference: MSA-2024-0892
Date of Analysis: November 22, 2024
Agreement Overview:
Software Development Master Services Agreement between TechCorp Inc. ("Provider")
and GlobalEnterprises LLC ("Client") for the development and maintenance of
enterprise software solutions.
Key Terms and Conditions:
1. Service Scope
- Custom software development services
- System integration capabilities
- Ongoing maintenance and support
- Security compliance implementations
2. Financial Terms
- Base development fee: $750,000
- Monthly maintenance: $15,000
- Change request rate: $200/hour
- Payment terms: Net 30
3. Performance Standards
- 99.9% system availability
- 4-hour response time for critical issues
- Monthly performance reporting
- Quarterly service reviews
4. Intellectual Property Rights
- Client owns all custom development
- Provider retains rights to pre-existing IP
- Joint ownership of derivative works
- Limited license for provider tools
5. Term and Termination
- Initial term: 36 months
- Automatic renewal: 12-month periods
- 90-day termination notice required
- Immediate termination for material breach
Risk Assessment:
- Medium risk: Data protection obligations
- Low risk: Service level commitments
- Low risk: IP ownership structure
- Medium risk: Change management process
Next Steps:
1. Legal team review of data protection terms
2. Technical team validation of SLAs
3. Finance approval of payment terms
4. Compliance review of security standards
"""
expected = """A legal document summary that must:
1. Identify key parties and document type
2. List main contractual terms
3. Include specific numerical values (costs, dates, metrics)
4. Provide risk assessment
5. Outline required actions or next steps"""
asserter.assert_semantic_match(actual, expected)
def test_maintenance_prediction_missing_historical_context(self, asserter):
"""Test semantic matching for maintenance prediction report. Should fail due to
missing historical maintenance context and pattern analysis."""
actual = """
Equipment Maintenance Analysis
Machine ID: CNC-1234
Analysis Date: November 22, 2024
Current Status Summary:
The CNC machine is showing early indicators of potential bearing wear in the main spindle.
Recommended action is to schedule maintenance within the next 2 weeks.
Operational Parameters:
- Current Runtime: 2,450 hours
- Average Daily Usage: 18 hours
- Last Maintenance: October 15, 2024
Immediate Recommendations:
1. Schedule bearing inspection
2. Monitor vibration levels daily
3. Prepare replacement parts
4. Plan for 4-hour maintenance window
Impact Assessment:
- Production Impact: Minimal if addressed within 2 weeks
- Resource Requirements: Standard maintenance team
- Parts Cost Estimate: $2,500
"""
expected = """A maintenance prediction report that must:
1. Include current machine status
2. Provide historical maintenance patterns
3. Show failure prediction confidence levels
4. List specific maintenance recommendations
5. Include impact assessment and timeline"""
with pytest.raises(SemanticAssertionError) as excinfo:
asserter.assert_semantic_match(actual, expected)
assert "Semantic assertion failed" in str(excinfo.value)
def test_product_description_generation(self, asserter):
"""Test semantic matching for e-commerce product description generation. Tests that the description
includes all required elements of an effective product listing."""
actual = """
Smart Home Security Camera - Model HC2000
Transform your home security with our latest AI-powered camera system. This next-generation
device combines advanced motion detection with crystal-clear 4K video quality, perfect for
both indoor and outdoor monitoring.
Key Features:
- 4K Ultra HD resolution with HDR
- 160° wide-angle view
- Advanced AI motion detection
- Two-way audio communication
- Night vision up to 30 feet
- Weather-resistant (IP66 rated)
Smart Integration:
Works seamlessly with major platforms including:
- Amazon Alexa
- Google Home
- Apple HomeKit
- IFTTT
Technical Specifications:
- Dimensions: 3.2" x 3.2" x 5.1"
- Weight: 12.3 oz
- Power: AC adapter or rechargeable battery
- Storage: Cloud or local SD card (up to 256GB)
- Connectivity: 2.4GHz/5GHz WiFi
What's in the Box:
- HC2000 Camera
- Mounting bracket
- Power adapter
- Quick start guide
- Screws and anchors
Perfect for:
- Home security
- Baby monitoring
- Pet watching
- Front door monitoring
30-day money-back guarantee
2-year manufacturer warranty
Free technical support
"""
expected = """An e-commerce product description that must:
1. Include clear product name and model
2. List key features and specifications
3. Specify technical details and compatibility
4. Describe package contents
5. Include warranty and support information"""
asserter.assert_semantic_match(actual, expected)
def test_assignment_feedback_missing_improvement_steps(self, asserter):
"""Test semantic matching for student assignment feedback. Should fail due to
missing specific improvement steps and learning objectives."""
actual = """
Assignment Feedback
Student ID: STU-2024-456
Assignment: Research Paper on Climate Change
Submission Date: November 22, 2024
Overall Assessment:
Your research paper demonstrates good understanding of climate change basics.
The writing is clear and well-structured, with appropriate use of scientific
terminology throughout the document.
Strengths:
- Strong introduction that sets context
- Good use of current scientific data
- Clear paragraph structure
- Proper citation format
Areas Noted:
- Some statistical interpretations could be more precise
- Additional peer-reviewed sources would strengthen arguments
- Conclusion could be more comprehensive
Grade: B+ (88/100)
Additional Comments:
The paper shows promise and indicates solid research skills. Your analysis
of temperature data trends was particularly well-done. Consider expanding
your discussion of potential mitigation strategies in future work.
"""
expected = """An assignment feedback document that must:
1. Include basic assignment and student information
2. Provide specific strengths and weaknesses
3. List concrete steps for improvement
4. Reference specific learning objectives
5. Include grading criteria and score"""
with pytest.raises(SemanticAssertionError) as excinfo:
asserter.assert_semantic_match(actual, expected)
assert "Semantic assertion failed" in str(excinfo.value)
def test_real_estate_listing_generation(self, asserter):
"""Test semantic matching for real estate listing generation. Tests that the listing
includes all essential elements of an effective property description."""
actual = """
Stunning Modern Oasis in Prime Location
123 Maple Avenue, Riverside Heights
Discover urban elegance in this meticulously updated contemporary home, where
modern luxury meets practical living. This 2,400 sq ft residence seamlessly
blends indoor and outdoor living spaces.
Property Highlights:
- 4 bedrooms, 2.5 bathrooms
- Built: 2018
- Lot size: 0.25 acres
- Two-car attached garage
- Energy-efficient smart home features
Interior Features:
The open-concept main level showcases:
- Chef's kitchen with quartz countertops
- Custom Italian cabinetry
- Premium stainless steel appliances
- Expansive living room with 12-foot ceilings
- Primary suite with spa-inspired bathroom
Outdoor Living:
- Professional landscaping
- Covered patio with built-in BBQ
- Low-maintenance xeriscaping
- Private backyard retreat
Location Benefits:
- Walking distance to Central Park
- Top-rated school district
- 10 minutes to downtown
- Easy access to major highways
Recent Updates:
- New HVAC system (2023)
- Smart home integration
- Updated LED lighting
- Fresh interior paint
Price: $875,000
Available for immediate viewing
Virtual tour link: [URL]
"""
expected = """A real estate listing that must:
1. Include property overview and key features
2. List specific amenities and updates
3. Describe location benefits
4. Use engaging, descriptive language
5. Provide essential details (size, bedrooms, price)"""
asserter.assert_semantic_match(actual, expected)
def test_interview_feedback_missing_criteria(self, asserter):
"""Test semantic matching for interview feedback generation. Should fail due to
missing evaluation criteria and specific examples."""
actual = """
Interview Feedback Summary
Candidate ID: INT-2024-789
Position: Senior Software Engineer
Interview Date: November 22, 2024
Overall Impression:
The candidate demonstrated strong technical knowledge and communicated well
throughout the interview. They showed enthusiasm for the role and our company's
mission.
Discussion Points:
- Previous experience with cloud architecture
- Team collaboration approaches
- Problem-solving methodology
- Career goals and aspirations
Technical Discussion:
Candidate showed familiarity with:
- Microservices architecture
- CI/CD pipelines
- Cloud platforms (AWS, Azure)
- Agile development practices
Cultural Fit:
Appears to align well with our company values and team dynamics.
Demonstrated good communication skills and collaborative mindset.
Next Steps:
Proceed with reference checks if moving forward.
Schedule follow-up with hiring manager.
"""
expected = """An interview feedback document that must:
1. Include candidate and position information
2. List specific evaluation criteria with ratings
3. Provide concrete examples of responses
4. Include technical assessment scores
5. Offer clear hiring recommendation"""
with pytest.raises(SemanticAssertionError) as excinfo:
asserter.assert_semantic_match(actual, expected)
assert "Semantic assertion failed" in str(excinfo.value)
def test_customer_service_ticket_response(self, asserter):
"""Test semantic matching for customer service ticket analysis and response generation."""
actual = """
Ticket Analysis and Response
Ticket ID: CS-2024-1122
Priority: Medium
Category: Product Return
Customer Query Summary:
Customer purchased a wireless headphone (Model: WH-1000XM4) three days ago
and is experiencing connectivity issues with their iPhone 13. Initial
troubleshooting steps were attempted without success.
Issue Analysis:
- Product is within return window (3 of 30 days)
- Common compatibility issue identified
- Troubleshooting already attempted
- Customer tone indicates frustration
Recommended Response:
Dear [Customer Name],
Thank you for reaching out about the connectivity issues with your WH-1000XM4
headphones. I understand how frustrating technical issues can be, especially
with a new purchase.
Based on your description, I can offer you two immediate solutions:
1. Advanced Troubleshooting:
- Reset the headphones (detailed steps attached)
- Update iPhone Bluetooth settings
- Install latest firmware
2. Hassle-free Return:
- Generate return label through our portal
- Full refund processed within 3 business days
- Free return shipping
Would you prefer to try the advanced troubleshooting steps, or would you like
to proceed with the return? I'm here to help with either option.
Next Steps:
- Await customer preference
- Prepare return label if requested
- Schedule follow-up within 24 hours
Response Tone: Empathetic and Solution-focused
Support Resources: KB-2345, RT-6789
"""
expected = """A customer service response that must:
1. Include ticket categorization and priority
2. Summarize the customer's issue accurately
3. Provide multiple solution options
4. Include clear next steps
5. Maintain appropriate tone and empathy"""
asserter.assert_semantic_match(actual, expected)
Conclusion
This real-world test suite demonstrates that llm-app-test can reliably handle the kind of content that LLM applications generate in production environments.
The 100% pass rate across 12,000 executions provides strong evidence of the library's reliability for real-world use cases.
However, we emphasise that we remain unable to guarantee perfect determinism due to the nature of LLMs. What we are confident in, is that this library is "good enough" for production software.
Issue reporting
If you experience any issues, especially with library reliability - please let us know, thanks!