Testing Patterns

This guide demonstrates common behavioral testing patterns using llm-app-test. For API reference and syntax, see API Documentation.

Pattern: Basic Behavioral Testing

Scenario: Testing simple greeting behaviour

Implementation:

actual = "Hello Alice, how are you today?" 

expected_behavior = """A polite greeting that:
    Addresses the person by name (Alice)
    Asks about their wellbeing"""

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Recognises personal address
Identifies greeting context
Validates wellbeing inquiry

Pattern: Basic Behavioral Matching

Scenario: Testing simple factual statements

Implementation:

actual = "The sky is blue"

expected_behavior = "A statement about the color of the sky"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Shows direct behavioral matching
Clear relationship between statement and expectation
Passes when meaning aligns

Pattern: Expected Behavioral Mismatch

Scenario: Validating behavioral mismatch detection

Implementation:

actual = "The sky is blue"

expected_behavior = "A statement about the weather forecast"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ❌ FAIL

Fails because the actual statement is about sky colour
Expected behaviour asks for weather forecast information
Demonstrates how behavioral mismatches are caught
Shows when assertions will fail in your tests

Pattern: Multilingual Behavioral Testing

Scenario: Testing behavioral understanding across languages

Implementation:

actual = "Bonjour, comment allez-vous?"

expected_behavior = "A polite greeting asking about wellbeing"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Demonstrates language-agnostic understanding
Shows cross-language behavioral matching
Validates international content handling

Pattern: Technical Documentation Testing

Scenario: Testing technical concept explanations

Implementation:

actual = """The TCP handshake is a three-way process where the client 
         sends SYN, server responds with SYN-ACK, and client confirms with ACK"""

expected_behavior = "An explanation of the TCP connection establishment process"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Validates technical accuracy
Handles specialised terminology
Maintains precision in behavioral assessment

Pattern: Contextual Disambiguation

Scenario: Testing understanding of ambiguous terms

Implementation:

actual = "The bank was steep and covered in wildflowers"

expected_behavior = "A description of a riverbank or hillside, not a financial institution"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Shows contextual understanding
Handles ambiguous terms
Validates specific meaning exclusions

Pattern: Sentiment Analysis

Scenario: Testing subtle emotional content

Implementation:

actual = "While the presentation wasn't perfect, it showed promise"

expected_behavior = "A constructive criticism with mixed but generally positive sentiment"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Detects nuanced sentiment
Understands mixed emotions
Validates overall tone

Pattern: Long-Form Content

Scenario: Testing comprehension of detailed explanations

Implementation:

actual = """Machine learning is a subset of artificial intelligence 
         that enables systems to learn and improve from experience without 
         explicit programming. It focuses on developing computer programs 
         that can access data and use it to learn for themselves."""

expected_behavior = "A comprehensive definition of machine learning emphasizing autonomous learning and data usage"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Handles longer text
Maintains context
Captures key concepts

Pattern: Subtle Sentiment Mismatch

Scenario: Testing detection of subtle sentiment differences

Implementation:

actual = "The project was completed on time, though there were some hiccups"

expected_behavior = "A statement expressing complete satisfaction with project execution"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ❌ FAIL

Fails because actual statement indicates mixed satisfaction
Expected behaviour suggests complete satisfaction
Shows sensitivity to subtle emotional differences

Pattern: Technical Context Mismatch

Scenario: Testing technical meaning precision

Implementation:

actual = "The function returns a pointer to the memory address"

expected_behavior = "A description of a function that returns the value stored at a memory location"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ❌ FAIL

Fails because returning a pointer is different from returning a stored value
Shows precision in technical context
Validates technical accuracy

Pattern: Ambiguous Reference Testing

Scenario: Testing handling of context-dependent terms

Implementation:

actual = "The bank processed the transaction after reviewing the account"

expected_behavior = "A description of a riverbank's geological formation process"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ❌ FAIL

Fails because contexts are completely different
Shows strong contextual understanding
Validates semantic boundaries in behavioral matching

Pattern: Temporal Context

Scenario: Testing time-based behavioral understanding

Implementation:

actual = "I will have completed the task by tomorrow"

expected_behavior = "A statement about a task that was completed in the past"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ❌ FAIL

Fails because of tense mismatch
Shows temporal awareness
Validates time context

Pattern: Logical Implication

Scenario: Testing logical relationship understanding

Implementation:

actual = "If it rains, the ground will be wet"

expected_behavior = "A statement indicating that wet ground always means it has rained"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ❌ FAIL

Fails because of reversed logical implication
Shows logical relationship understanding
Validates causality direction

Pattern: Multi-Step Reasoning

Scenario: Testing complex logical chains

Implementation:

actual = """When water freezes, it expands by approximately 9% in volume. 
This expansion creates less dense ice that floats according to Archimedes' principle of displacement. 
Because Arctic sea ice is already floating in the ocean, its melting doesn't significantly affect sea levels - 
it's already displacing its weight in water. However, land-based glaciers in places like Greenland 
aren't currently displacing any ocean water. When these glaciers melt, they add entirely new water volume 
to the oceans, making them a primary contributor to sea level rise."""

expected_behavior = """A multi-step scientific explanation.
Must maintain logical consistency across all steps."""

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Handles complex logical chains
Maintains consistency across steps
Validates scientific reasoning

Pattern: Nonsensical Content

Scenario: Testing handling of grammatically correct but meaningless content

Implementation:

actual = "The colorless green ideas sleep furiously"

expected_behavior = "A grammatically correct but semantically nonsensical statement"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Recognizes grammatical structure
Identifies semantic nonsense
Validates meta-understanding

Pattern: Extended Narrative

Scenario: Testing long-form narrative understanding

Implementation:

actual = """
        The Roman Empire's rise began with modest origins in central Italy. What started as a small 
        settlement along the Tiber River would eventually become one of history's most influential 
        civilizations. In the early days, Rome was ruled by kings, but this system was overthrown 
        in 509 BCE, giving birth to the Roman Republic.

        During the Republic, Rome expanded its territory through military conquest and diplomatic 
        alliances. The Roman army became increasingly professional, developing innovative tactics 
        and technologies. This military success brought wealth and power, but also internal 
        challenges. Social tensions grew between patricians and plebeians, leading to significant 
        political reforms.

        By the 1st century BCE, the Republic faced severe internal strife. Military commanders 
        like Marius, Sulla, and eventually Julius Caesar accumulated unprecedented power. Caesar's 
        crossing of the Rubicon in 49 BCE marked a point of no return. His assassination in 44 BCE 
        led to another civil war, ultimately resulting in his adopted heir Octavian becoming 
        Augustus, the first Roman Emperor.

        Augustus transformed Rome into an empire while maintaining a facade of republican 
        institutions. He implemented sweeping reforms in administration, military organization, 
        and public works. The Pax Romana that followed brought unprecedented peace and prosperity 
        across the Mediterranean world. Trade flourished, cities grew, and Roman culture spread 
        throughout the empire.
        """

expected_behavior = """A historical narrative that:
1. Maintains chronological progression
2. Shows cause-and-effect relationships
3. Develops consistent themes (power, governance, military)
4. Connects multiple historical events coherently
5. Demonstrates character development (e.g., Caesar to Augustus)
"""

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Handles extended narratives
Maintains thematic consistency
Validates complex relationships
Shows chronological understanding

Pattern: Emoji Quantity Testing

Scenario: Testing recognition of repeated emojis

Implementation:

actual = "🤖" * 100

expected_behavior = "A lot of emojis"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Handles repeated Unicode characters
Recognises quantity concepts
Validates emoji processing

Pattern: Emoji Quantity Mismatch

Scenario: Testing quantity recognition accuracy

Implementation:

actual = "🤖" * 100

expected_behavior = "Only a few emojis"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ❌ FAIL

Fails due to quantity mismatch
Shows quantity awareness
Validates numerical understanding

Pattern: Mixed Unicode Content ⚠️ Known Reliability Issue

Scenario: Testing complex Unicode combinations and repetitive patterns

Observed Behavior

Test Case 1: Strict Pattern Matching

actual = "🤖👾" * 50 + "こんにちは" * 20 + "🌈" * 30

expected_behavior = "A mix of emojis and Japanese text"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Results:

✅ Success Rate: 96% (48/50 runs)
❌ Failure Rate: 4% (2/50 runs)
🔍 Failure Analysis:
- Occurs primarily during increased API latency
- GPT-4o occasionally interprets sequential patterns as "distinct collections" rather than "mixed content"
- Failure message example: "This is not a mix as there is a distinct collection of emojis followed by Japanese text and then a collection of rainbows"

Recommended Implementation

Test Case 2: Pattern-Agnostic Matching

actual = "🤖👾" * 50 + "こんにちは" * 20 + "🌈" * 30

expected = "More than one type of emoji and Japanese text regardless of order"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Results:

✅ Success Rate: 100% (preliminary)
⚠️ Extended testing in progress
🔍 Monitoring prompt effectiveness across different test scenarios

Best Practices

Use pattern-agnostic assertions for repetitive Unicode content
Consider implementing retry logic for critical tests
Monitor API response times during failures
Use enhanced prompts for complex Unicode pattern testing

Ongoing Investigation

Testing various prompt configurations to improve reliability
Monitoring performance impact of different prompt strategies
Collecting data on failure patterns with different prompt versions

Pattern: Multilingual Emoji Spam

Scenario: Testing repeated multilingual content

Implementation:

actual = "Hello你好Bonjour🌈" * 50

expected_behavior = "A repetitive greeting in multiple languages with rainbows"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Handles multilingual text
Recognises repetitive patterns
Validates mixed content types

Pattern: ASCII Art Recognition

Scenario: Testing complex ASCII art patterns

Implementation:

actual = """
(╯°□°)╯︵ ┻━┻
""" * 20

expected_behavior = "Multiple instances of table-flipping ASCII art"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Recognises ASCII art patterns
Understands visual representations
Validates repeated patterns

Pattern: Extreme Whitespace

Scenario: Testing handling of excessive spacing

Implementation:

actual = "hello    " + " " * 1000 + "    world" + "\n" * 500

expected_behavior = "A greeting with excessive spacing"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Handles extreme whitespace
Maintains semantic meaning
Validates text normalisation

Pattern: Number Pattern Recognition

Scenario: Testing numerical pattern understanding

Implementation:

actual = "".join([str(i % 10) for i in range(1000)])

expected_behavior = "A long sequence of repeating numbers"

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ✅ PASS

Recognises numerical patterns
Handles long repetitive sequences
Validates pattern understanding

Pattern: Patient Education Content Testing

Scenario: Testing medical education content for diabetes management

Implementation:

actual = """
Understanding and Managing Type 2 Diabetes

Type 2 diabetes is a chronic condition that affects how your body processes blood sugar (glucose). 
While this condition is serious, it can be effectively managed through lifestyle changes and, 
when necessary, medication. This guide will help you understand the key aspects of diabetes 
management.

Blood Sugar Monitoring:
Regular blood sugar monitoring is essential. Your target blood glucose levels should typically 
be 80-130 mg/dL before meals and less than 180 mg/dL two hours after meals. However, your 
healthcare provider may set different targets based on your individual needs. Keep a log of 
your readings to identify patterns and adjust your management strategy accordingly.

Dietary Considerations:
A balanced diet is crucial for managing type 2 diabetes. Focus on:
- Controlling portion sizes
- Choosing high-fiber, low-glycemic foods
- Limiting refined carbohydrates and processed sugars
- Including lean proteins and healthy fats
- Spacing meals evenly throughout the day

Physical Activity:
Regular exercise helps control blood sugar levels by improving insulin sensitivity. Aim for:
- At least 150 minutes of moderate-intensity aerobic activity weekly
- Resistance training 2-3 times per week
- Daily movement, even if just short walks
Always check your blood sugar before and after exercise, and carry a fast-acting 
carbohydrate source.

Medication Management:
If prescribed, take diabetes medications as directed. Common medications include:
- Metformin (helps reduce glucose production)
- Sulfonylureas (increase insulin production)
- DPP-4 inhibitors (help maintain blood sugar control)
Never adjust or stop medications without consulting your healthcare provider.

Warning Signs:
Learn to recognize and respond to:
- Hypoglycemia (low blood sugar): shakiness, sweating, confusion
- Hyperglycemia (high blood sugar): increased thirst, frequent urination, fatigue
Seek immediate medical attention if you experience severe symptoms or sustained 
high blood sugar levels.

Regular Health Monitoring:
Schedule regular check-ups with your healthcare team, including:
- HbA1c tests every 3-6 months
- Annual eye examinations
- Regular foot checks
- Kidney function tests
- Cholesterol level monitoring

Remember, diabetes management is a journey, not a destination. Small, consistent 
steps in the right direction can lead to significant improvements in your health 
and quality of life.
"""

expected = """A medical education document that must:
1. Contain an overview section explaining the condition
2. List specific numerical guidelines (blood sugar ranges, exercise minutes)
3. Include structured sections for diet, exercise, and medication
4. Provide clear warning signs AND detailed emergency response procedures
5. End with follow-up care instructions"""

behavioral_assert.assert_behavioral_match(actual, expected_behavior)

Result: ❌ FAIL

Fails because emergency response procedures are missing
Fails because follow-up care instructions are incomplete
Shows precision in medical content requirements
Validates structured information presentation

Pattern: Investment Portfolio Report Testing

Scenario: Testing professional investment portfolio report generation

Implementation:

actual = """
Q4 2023 Portfolio Performance Summary

Portfolio Overview:
Your investment portfolio has demonstrated resilient performance during Q4 2023, 
achieving a total return of 8.2% against our benchmark index return of 7.5%. 
Total portfolio value stands at $1,245,000 as of December 31, 2023.

Asset Allocation Analysis:
Current allocation stands at:
- Equities: 65% ($809,250)
  - US Large Cap: 40% ($498,000)
  - International Developed: 15% ($186,750)
  - Emerging Markets: 10% ($124,500)
- Fixed Income: 25% ($311,250)
  - Government Bonds: 15% ($186,750)
  - Corporate Bonds: 10% ($124,500)
- Alternative Investments: 10% ($124,500)
  - Real Estate: 5% ($62,250)
  - Commodities: 5% ($62,250)

Performance Attribution:
Key contributors to performance:
1. US Technology sector outperformance (+12.3%)
2. Emerging Markets recovery (+9.1%)
3. Corporate Bond yield optimization (+4.2%)

Risk Metrics:
- Portfolio Beta: 0.85
- Sharpe Ratio: 1.45
- Maximum Drawdown: -5.2%
- Standard Deviation: 12.3%

Rebalancing Recommendations:
Based on current market conditions and your investment objectives:
1. Consider increasing Fixed Income allocation by 2%
2. Reduce US Large Cap exposure by 3%
3. Increase Emerging Markets exposure by 1%

Market Outlook:
Looking ahead to 2024, we anticipate:
- Continued monetary policy normalization
- Potential emerging markets opportunities
- Heightened focus on quality factors in equity selection

Next Steps:
1. Schedule quarterly review meeting
2. Discuss rebalancing recommendations
3. Update investment policy statement if needed
"""

expected = """A professional investment portfolio report that must:
1. Present portfolio performance with specific metrics
2. Detail current asset allocation with percentages
3. Include risk analysis metrics
4. Provide forward-looking recommendations
5. Maintain formal financial terminology
6. Include clear next steps or action items"""

asserter.assert_behavioral_match(actual, expected)

Result: ✅ PASS

Shows comprehensive financial reporting structure
Validates precise numerical data presentation
Confirms professional financial terminology
Demonstrates clear action items and recommendations
Verifies complete risk metric inclusion

Pattern: Personalized Content Recommendation Testing

Scenario: Testing personalized content recommendation generation

Implementation:

actual = """
Personalized Content Recommendations - User Profile #A1234
Generated: November 22, 2024

Recommended Content Queue:
1. "Climate Pioneers" (Documentary Series)
- Episode length: 45 minutes
- New episodes available

2. "Global Power Play" (Political Drama)
- Episode length: 55 minutes
- Features actors from previously watched content

3. "Earth's Tipping Points" (Scientific Documentary)
- Episode length: 40 minutes
- Recently added to platform

Engagement Optimization:
- Scheduled new episode alerts
- Downloadable content for offline viewing
- Similar content suggestions refreshed weekly
- Customized language preferences maintained

Content Accessibility:
All recommended content includes your preferred subtitle options and is 
available in HD quality. Downloads are enabled for offline viewing during 
your upcoming travel dates.
"""

expected = """A personalized content recommendation document that must:
1. Include the viewing patterns and preferences of the user
2. List recommended content with clear reasoning
3. Provide matching percentages or relevance metrics
4. Include viewing optimization suggestions
5. Address content accessibility features"""

asserter.assert_behavioral_match(actual, expected)

Result: ❌ FAIL

Fails because relevance metrics (percentages) are missing
Lacks explicit matching percentages or relevance scores
Shows content recommendations without quantified relevance
Demonstrates engagement optimization suggestions
Validates content accessibility features

Pattern: Legal Document Summary Testing

Scenario: Testing legal document summary generation

Implementation:

actual = """
Contract Summary Analysis
Document Reference: MSA-2024-0892
Date of Analysis: November 22, 2024

Agreement Overview:
Software Development Master Services Agreement between TechCorp Inc. ("Provider") 
and GlobalEnterprises LLC ("Client") for the development and maintenance of 
enterprise software solutions.

Key Terms and Conditions:
1. Service Scope
- Custom software development services
- System integration capabilities
- Ongoing maintenance and support
- Security compliance implementations

2. Financial Terms
- Base development fee: $750,000
- Monthly maintenance: $15,000
- Change request rate: $200/hour
- Payment terms: Net 30

3. Performance Standards
- 99.9% system availability
- 4-hour response time for critical issues
- Monthly performance reporting
- Quarterly service reviews

4. Intellectual Property Rights
- Client owns all custom development
- Provider retains rights to pre-existing IP
- Joint ownership of derivative works
- Limited license for provider tools

5. Term and Termination
- Initial term: 36 months
- Automatic renewal: 12-month periods
- 90-day termination notice required
- Immediate termination for material breach

Risk Assessment:
- Medium risk: Data protection obligations
- Low risk: Service level commitments
- Low risk: IP ownership structure
- Medium risk: Change management process

Next Steps:
1. Legal team review of data protection terms
2. Technical team validation of SLAs
3. Finance approval of payment terms
4. Compliance review of security standards
"""

expected = """A legal document summary that must:
1. Identify key parties and document type
2. List main contractual terms
3. Include specific numerical values (costs, dates, metrics)
4. Provide risk assessment
5. Outline required actions or next steps"""

asserter.assert_behavioral_match(actual, expected)

Result: ✅ PASS

Shows comprehensive legal document structure
Validates precise contractual terms
Confirms specific numerical values inclusion
Demonstrates clear risk assessment
Verifies actionable next steps

Pattern: Maintenance Prediction Testing

Scenario: Testing maintenance prediction report generation for CNC equipment

Implementation:

actual = """
Equipment Maintenance Analysis
Machine ID: CNC-1234
Analysis Date: November 22, 2024

Current Status Summary:
The CNC machine is showing early indicators of potential bearing wear in the main spindle.
Recommended action is to schedule maintenance within the next 2 weeks.

Operational Parameters:
- Current Runtime: 2,450 hours
- Average Daily Usage: 18 hours
- Last Maintenance: October 15, 2024

Immediate Recommendations:
1. Schedule bearing inspection
2. Monitor vibration levels daily
3. Prepare replacement parts
4. Plan for 4-hour maintenance window

Impact Assessment:
- Production Impact: Minimal if addressed within 2 weeks
- Resource Requirements: Standard maintenance team
- Parts Cost Estimate: $2,500
"""

expected = """A maintenance prediction report that must:
1. Include current machine status
2. Provide historical maintenance patterns
3. Show failure prediction confidence levels
4. List specific maintenance recommendations
5. Include impact assessment and timeline"""

asserter.assert_behavioral_match(actual, expected)

Result: ❌ FAIL

Fails because historical maintenance patterns are incomplete
Missing failure prediction confidence levels
Shows current status and recommendations
Includes basic impact assessment
Demonstrates timeline considerations

Pattern: E-commerce Product Description Testing

Scenario: Testing product description for a smart home security camera

Implementation:

actual = """
Smart Home Security Camera - Model HC2000

Transform your home security with our latest AI-powered camera system. This next-generation 
device combines advanced motion detection with crystal-clear 4K video quality, perfect for 
both indoor and outdoor monitoring.

Key Features:
- 4K Ultra HD resolution with HDR
- 160° wide-angle view
- Advanced AI motion detection
- Two-way audio communication
- Night vision up to 30 feet
- Weather-resistant (IP66 rated)

Smart Integration:
Works seamlessly with major platforms including:
- Amazon Alexa
- Google Home
- Apple HomeKit
- IFTTT

Technical Specifications:
- Dimensions: 3.2" x 3.2" x 5.1"
- Weight: 12.3 oz
- Power: AC adapter or rechargeable battery
- Storage: Cloud or local SD card (up to 256GB)
- Connectivity: 2.4GHz/5GHz WiFi

What's in the Box:
- HC2000 Camera
- Mounting bracket
- Power adapter
- Quick start guide
- Screws and anchors

Perfect for:
- Home security
- Baby monitoring
- Pet watching
- Front door monitoring

30-day money-back guarantee
2-year manufacturer warranty
Free technical support
"""

expected = """An e-commerce product description that must:
1. Include clear product name and model
2. List key features and specifications
3. Specify technical details and compatibility
4. Describe package contents
5. Include warranty and support information"""

asserter.assert_behavioral_match(actual, expected)

Result: ✅ PASS

Shows comprehensive product information structure
Validates technical specifications inclusion
Confirms compatibility details
Demonstrates complete package contents listing
Verifies warranty and support information

Pattern: Assignment Feedback Testing

Scenario: Testing student assignment feedback generation

Implementation:

actual = """
Assignment Feedback
Student ID: STU-2024-456
Assignment: Research Paper on Climate Change
Submission Date: November 22, 2024

Overall Assessment:
Your research paper demonstrates good understanding of climate change basics.
The writing is clear and well-structured, with appropriate use of scientific
terminology throughout the document.

Strengths:
- Strong introduction that sets context
- Good use of current scientific data
- Clear paragraph structure
- Proper citation format

Areas Noted:
- Some statistical interpretations could be more precise
- Additional peer-reviewed sources would strengthen arguments
- Conclusion could be more comprehensive

Grade: B+ (88/100)

Additional Comments:
The paper shows promise and indicates solid research skills. Your analysis
of temperature data trends was particularly well-done. Consider expanding
your discussion of potential mitigation strategies in future work.
"""

expected = """An assignment feedback document that must:
1. Include basic assignment and student information
2. Provide specific strengths and weaknesses
3. List concrete steps for improvement
4. Reference specific learning objectives
5. Include grading criteria and score"""

asserter.assert_behavioral_match(actual, expected)

Result: ❌ FAIL

Fails because concrete improvement steps are missing
Missing specific learning objectives
Shows basic assignment information
Includes strengths and weaknesses
Demonstrates grading information

Pattern: Real Estate Listing Testing

Scenario: Testing real estate property listing generation

Implementation:

actual = """
Stunning Modern Oasis in Prime Location
123 Maple Avenue, Riverside Heights

Discover urban elegance in this meticulously updated contemporary home, where 
modern luxury meets practical living. This 2,400 sq ft residence seamlessly 
blends indoor and outdoor living spaces.

Property Highlights:
- 4 bedrooms, 2.5 bathrooms
- Built: 2018
- Lot size: 0.25 acres
- Two-car attached garage
- Energy-efficient smart home features

Interior Features:
The open-concept main level showcases:
- Chef's kitchen with quartz countertops
- Custom Italian cabinetry
- Premium stainless steel appliances
- Expansive living room with 12-foot ceilings
- Primary suite with spa-inspired bathroom

Outdoor Living:
- Professional landscaping
- Covered patio with built-in BBQ
- Low-maintenance xeriscaping
- Private backyard retreat

Location Benefits:
- Walking distance to Central Park
- Top-rated school district
- 10 minutes to downtown
- Easy access to major highways

Recent Updates:
- New HVAC system (2023)
- Smart home integration
- Updated LED lighting
- Fresh interior paint

Price: $875,000
Available for immediate viewing
Virtual tour link: [URL]
"""

expected = """A real estate listing that must:
1. Include property overview and key features
2. List specific amenities and updates
3. Describe location benefits
4. Use engaging, descriptive language
5. Provide essential details (size, bedrooms, price)"""

asserter.assert_behavioral_match(actual, expected)

Result: ✅ PASS

Shows comprehensive property information
Validates specific amenities and features
Confirms location benefits inclusion
Demonstrates engaging descriptive language
Verifies essential property details

Pattern: Interview Feedback Testing

Scenario: Testing interview feedback generation for technical position

Implementation:

def test_interview_feedback_missing_criteria(self, asserter):
"""Test semantic matching for interview feedback generation. Should fail due to
missing evaluation criteria and specific examples."""
actual = """
Interview Feedback Summary
Candidate ID: INT-2024-789
Position: Senior Software Engineer
Interview Date: November 22, 2024

Overall Impression:
The candidate demonstrated strong technical knowledge and communicated well
throughout the interview. They showed enthusiasm for the role and our company's
mission.

Discussion Points:
- Previous experience with cloud architecture
- Team collaboration approaches
- Problem-solving methodology
- Career goals and aspirations

Technical Discussion:
Candidate showed familiarity with:
- Microservices architecture
- CI/CD pipelines
- Cloud platforms (AWS, Azure)
- Agile development practices

Cultural Fit:
Appears to align well with our company values and team dynamics.
Demonstrated good communication skills and collaborative mindset.

Next Steps:
Proceed with reference checks if moving forward.
Schedule follow-up with hiring manager.
"""

expected = """An interview feedback document that must:
1. Include candidate and position information
2. List specific evaluation criteria with ratings
3. Provide concrete examples of responses
4. Include technical assessment scores
5. Offer clear hiring recommendation"""

asserter.assert_behavioral_match(actual, expected)

Result: ❌ FAIL

Fails because evaluation criteria lack ratings
Missing concrete response examples
Missing technical assessment scores
Shows basic candidate information
Lacks clear hiring recommendation

Pattern: Customer Service Response Testing

Scenario: Testing customer service ticket response generation

Implementation:

actual = """
Ticket Analysis and Response
Ticket ID: CS-2024-1122
Priority: Medium
Category: Product Return

Customer Query Summary:
Customer purchased a wireless headphone (Model: WH-1000XM4) three days ago
and is experiencing connectivity issues with their iPhone 13. Initial
troubleshooting steps were attempted without success.

Issue Analysis:
- Product is within return window (3 of 30 days)
- Common compatibility issue identified
- Troubleshooting already attempted
- Customer tone indicates frustration

Recommended Response:
Dear [Customer Name],

Thank you for reaching out about the connectivity issues with your WH-1000XM4
headphones. I understand how frustrating technical issues can be, especially
with a new purchase.

Based on your description, I can offer you two immediate solutions:

1. Advanced Troubleshooting:
- Reset the headphones (detailed steps attached)
- Update iPhone Bluetooth settings
- Install latest firmware

2. Hassle-free Return:
- Generate return label through our portal
- Full refund processed within 3 business days
- Free return shipping

Would you prefer to try the advanced troubleshooting steps, or would you like
to proceed with the return? I'm here to help with either option.

Next Steps:
- Await customer preference
- Prepare return label if requested
- Schedule follow-up within 24 hours

Response Tone: Empathetic and Solution-focused
Support Resources: KB-2345, RT-6789
"""

expected = """A customer service response that must:
1. Include ticket categorization and priority
2. Summarize the customer's issue accurately
3. Provide multiple solution options
4. Include clear next steps
5. Maintain appropriate tone and empathy"""

asserter.assert_behavioral_match(actual, expected)

Result: ✅ PASS

Shows comprehensive ticket information structure
Validates accurate issue summary
Confirms multiple solution options
Demonstrates clear next steps
Verifies empathetic and professional tone