Retrieval-Augmented Generation with LangChain

Nov 12 2024 · Python 3.12, LangChain 0.3.x, JupyterLab 4.2.4

Lesson 05: Evaluating & Optimizing RAG Systems

Assessing a RAG Pipeline Demo

Episode complete

Play next episode

Next

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

00:00In this demo, you’ll use DeepEval, a popular open-source LLM evaluation framework. It has a simple and intuitive set of APIs you’ll soon use to assess SportsBuddy. Open your Jupyter Lab instance with the following command:

jupyter lab
pip install -U deepeval
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
  ContextualPrecisionMetric,
  ContextualRecallMetric,
  ContextualRelevancyMetric
)

contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()
test_case = LLMTestCase(
  input="Which programmes were dropped from the 2024 Olympics?",
  actual_output="Four events were dropped from weightlifting for the 
    2024 Olympics. Additionally, in canoeing, two sprint events 
    were replaced with two slalom events. The overall event 
    total for canoeing remained at 16.",
  expected_output="Four events were dropped from weightlifting.",
  retrieval_context=[
    """Four events were dropped from weightlifting."""
 ]
)
evaluate(
  test_cases=[test_case],
  metrics=[contextual_precision, contextual_recall, contextual_relevancy]
)
python deepeval-sportsbuddy-test.py
======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False, 
    evaluation model: gpt-4o, reason: The score is 1.00 because the 
    context directly answers the question by stating 'Four events 
    were dropped from weightlifting.' Great job!, error: None)
  - ✅ Contextual Recall (score: 1.0, threshold: 0.5, strict: False, 
    evaluation model: gpt-4o, reason: The score is 1.00 because the 
    expected output perfectly matches the content in the first node 
    of the retrieval context. Great job!, error: None)
  - ❌ Contextual Relevancy (score: 0.0, threshold: 0.5, strict: False, 
    evaluation model: gpt-4o, reason: The score is 0.00 because the
    context only mentions 'Four events were dropped from weightlifting' 
    without specifying which programmes or providing a comprehensive 
    list of dropped programmes from the 2024 Olympics., error: None)

For test case:

  - input: Which programmes were dropped from the 2024 Olympics?
  - actual output: Four events were dropped from weightlifting for 
    the 2024 Olympics. Additionally, in canoeing, two sprint events 
    were replaced with two slalom events. The overall event total 
    for canoeing remained at 16.
  - expected output: Four events were dropped from weightlifting.
  - context: None
  - retrieval context: ['Four events were dropped from weightlifting.']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 100.00% pass rate
Contextual Recall: 100.00% pass rate
Contextual Relevancy: 0.00% pass rate

======================================================================
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate

answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()

evaluate(
  test_cases=[test_case],
  metrics=[answer_relevancy, faithfulness]
)
=====================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 0.6666666666666666, threshold: 0.5, 
    strict: False, evaluation model: gpt-4o, reason: The score is 0.67 
    because while the response contains relevant information, it veers 
    off-topic by discussing the overall event total for canoeing, 
    which does not directly answer the specific question about which 
    programmes were dropped from the 2024 Olympics., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation 
    model: gpt-4o, reason: The score is 1.00 because there are no 
    contradictions, indicating a perfect alignment between the actual 
    output and the retrieval context. Great job maintaining accuracy!,
    error: None)

For test case:

  - input: Which programmes were dropped from the 2024 Olympics?
  - actual output: Four events were dropped from weightlifting for 
    the 2024 Olympics. Additionally, in canoeing, two sprint events 
    were replaced with two slalom events. The overall event total 
    for canoeing remained at 16.
  - expected output: Four events were dropped from weightlifting.
  - context: None
  - retrieval context: ['Four events were dropped from weightlifting.']

======================================================================

Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate
Faithfulness: 100.00% pass rate

======================================================================
See forum comments
Cinema mode Download course materials from Github
Previous: Assessing a RAG Pipeline Next: Understanding Query Analysis