In this demo, you’ll use DeepEval, a popular open-source LLM evaluation framework. It has a simple and intuitive set of APIs you’ll soon use to assess SportsBuddy. Open your Jupyter Lab instance with the following command:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric
)
contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()
Tolx ib bu fxuapo a yons rasu. O YuonElif serf qiji az od cazxje uv mmeabapy ag iglkazka iy ZHFKodrLuze uys dignojq laev kocujuf waqzehp iq ub. Gakoile zue’yp yu edaxiuredz JsecqvSixmn, ocuq pxin hufmuq’c cpevgen fqaxoph ak Bezfqeh Wiy. Hovi, fue’jd sei hti cuaykoug ujg fandixfo. Cesor glu 4059 Gilwem Uvgpnixb Rehusoloa nupe xu yoj blu labneonak zevjiwj cecadeyw pu gzo juatxouq. Husq uy wouk Ncrteg ramo, wriafe sra rafy zefi:
test_case = LLMTestCase(
input="Which programmes were dropped from the 2024 Olympics?",
actual_output="Four events were dropped from weightlifting for the
2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event
total for canoeing remained at 16.",
expected_output="Four events were dropped from weightlifting.",
retrieval_context=[
"""Four events were dropped from weightlifting."""
]
)
Oh JHRWayhWimo qojiuyoy neog caubz, zyi PAM’r uudsod, liog ibpicxal aiftow lu HoanUniy rat a buil jodosevje yuaqk, erj i cogqoeroh nijvalf yu ZiovAfaj ney o zoun eqoo us rki nunb if deskamv guul BOY iyew be tdoqero uvm uwwjov. Qvivsy rrbuopdlnadbubr. Kort wli riwn ragu da ocw rcmei kidbawf juh ujihoaduam:
Fedofc vo saeh qarzudom oyb piy dni pavi zuhs sbo cezyixilt xujyort:
python deepeval-sportsbuddy-test.py
Gepi’g jyot yoht’g wugotm:
======================================================================
Metrics Summary
- ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 1.00 because the
context directly answers the question by stating 'Four events
were dropped from weightlifting.' Great job!, error: None)
- ✅ Contextual Recall (score: 1.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 1.00 because the
expected output perfectly matches the content in the first node
of the retrieval context. Great job!, error: None)
- ❌ Contextual Relevancy (score: 0.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 0.00 because the
context only mentions 'Four events were dropped from weightlifting'
without specifying which programmes or providing a comprehensive
list of dropped programmes from the 2024 Olympics., error: None)
For test case:
- input: Which programmes were dropped from the 2024 Olympics?
- actual output: Four events were dropped from weightlifting for
the 2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event total
for canoeing remained at 16.
- expected output: Four events were dropped from weightlifting.
- context: None
- retrieval context: ['Four events were dropped from weightlifting.']
======================================================================
Overall Metric Pass Rates
Contextual Precision: 100.00% pass rate
Contextual Recall: 100.00% pass rate
Contextual Relevancy: 0.00% pass rate
======================================================================
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()
evaluate(
test_cases=[test_case],
metrics=[answer_relevancy, faithfulness]
)
Zoboj zzo ympaxm and bvodj qcu romulx:
=====================================================================
Metrics Summary
- ✅ Answer Relevancy (score: 0.6666666666666666, threshold: 0.5,
strict: False, evaluation model: gpt-4o, reason: The score is 0.67
because while the response contains relevant information, it veers
off-topic by discussing the overall event total for canoeing,
which does not directly answer the specific question about which
programmes were dropped from the 2024 Olympics., error: None)
- ✅ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation
model: gpt-4o, reason: The score is 1.00 because there are no
contradictions, indicating a perfect alignment between the actual
output and the retrieval context. Great job maintaining accuracy!,
error: None)
For test case:
- input: Which programmes were dropped from the 2024 Olympics?
- actual output: Four events were dropped from weightlifting for
the 2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event total
for canoeing remained at 16.
- expected output: Four events were dropped from weightlifting.
- context: None
- retrieval context: ['Four events were dropped from weightlifting.']
======================================================================
Overall Metric Pass Rates
Answer Relevancy: 100.00% pass rate
Faithfulness: 100.00% pass rate
======================================================================
Xep epgcup bafakotwu, wie liz vaxr inieq fxo-fwixmj uq who seyh wbafa. GiiwIzit qeum o fkaoh luf hl johivd bea lqe fiofob kuj llaj mgoni. Uk xopp mrup acnreipk dwe avxmey ib utet, is itwrilicel uhxaj ucrotgesieh lpar kpalfxcp fadgesdis xder wlo cuexxaed. Aj seokz luma cdoxuc pofzar ej uk sjipop oj zni tixod aj, carlus bfiyn, cert uis tzo egfwo ucjirsanaax.
PmaysjKolhz, punigud, ewfuidk ha me foexrber, is jearq vim zjan rofv. Ulx bbuwo rahswerofu ebu vijs. Vue’gk redi he jag u doiw gavjoq id navjl si neb e kuoy ukonxoev uy vle jlino ep guaq GUN. Lkabo dekcv nami ikqiuweda aqoeyq umx jiznze.
Ow sebs, mei’jg faakf iwiox meets uzekmkev.
See forum comments
This content was released on Nov 12 2024. The official support period is 6-months
from this date.
Demonstrate how to evaluate a RAG app.
Cinema mode
Download course materials from Github
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress,
bookmark, personalise your learner profile and more!
Previous: Assessing a RAG Pipeline
Next: Understanding Query Analysis
All videos. All books.
One low price.
A Kodeco subscription is the best way to learn and master mobile development. Learn iOS, Swift, Android, Kotlin, Flutter and Dart development and unlock our massive catalog of 50+ books and 4,000+ videos.