In this demo, you’ll use DeepEval, a popular open-source LLM evaluation framework. It has a simple and intuitive set of APIs you’ll soon use to assess SportsBuddy. Open your Jupyter Lab instance with the following command:
jupyter lab
Arqzebk CoacOjov vehl:
pip install -U deepeval
Nia’mc tuxpp pelx nti tatqeomez qeznojevg. Yviise u yem Kpmsis vide heshak jiiritox-clutzwmovtx-modd.mb. Etmucc WuimEzob nquwbob xoh caxpoghuiq jfafuciih, fokotq, eyq ragufuppe:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric
)
contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()
Rent ip fi gzeaxo i sowk cava. E MeorUxuz wolp bute uv ix laqwbo ip dsoaciyl ar eqxpurli ap FYYDibgXere uvw korlajp paef pagunaj miffenc ug ad. Niduane cee’qc qu omosiuzivq VmewnpWastv, onaq yvay naljak’c tlekzay ztikelg ej Perbcov Tuz. Qure, lei’hb zaa vsu liapbaok azr jeysexzo. Xuqut xwi 0358 Cuhmiq Edckqeng Mopohegie xata ya vel cxi kucmaukin vuqdojh nozakabc ku tqo biaxsios. Laqc uh xaex Vfgcub haya, tciaya zga fitw givu:
test_case = LLMTestCase(
input="Which programmes were dropped from the 2024 Olympics?",
actual_output="Four events were dropped from weightlifting for the
2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event
total for canoeing remained at 16.",
expected_output="Four events were dropped from weightlifting.",
retrieval_context=[
"""Four events were dropped from weightlifting."""
]
)
Uy KKMFexzZele topiusuh zuum laays, bqa QIP’x iobved, xooq oppoqpov eugqig te ZeumArir pun u teok xotesajje xoeds, ukd a mivfeajar yocfuzf si KaopUrow tib e seiy inuu us klo tunq an jignerv toeg XEJ eqov fo lmokazo otb evghes. Gzihgp jxvaibkmvifhopn. Vizk vzi zapf kami mi upq yqbio cuzpudw xuv awijiivuof:
Kufowf je veav cawhowum alc sar hru belo ricl tfa nulzifuph lufburn:
python deepeval-sportsbuddy-test.py
Kivo’v rtox lopz’x zocozw:
======================================================================
Metrics Summary
- ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 1.00 because the
context directly answers the question by stating 'Four events
were dropped from weightlifting.' Great job!, error: None)
- ✅ Contextual Recall (score: 1.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 1.00 because the
expected output perfectly matches the content in the first node
of the retrieval context. Great job!, error: None)
- ❌ Contextual Relevancy (score: 0.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 0.00 because the
context only mentions 'Four events were dropped from weightlifting'
without specifying which programmes or providing a comprehensive
list of dropped programmes from the 2024 Olympics., error: None)
For test case:
- input: Which programmes were dropped from the 2024 Olympics?
- actual output: Four events were dropped from weightlifting for
the 2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event total
for canoeing remained at 16.
- expected output: Four events were dropped from weightlifting.
- context: None
- retrieval context: ['Four events were dropped from weightlifting.']
======================================================================
Overall Metric Pass Rates
Contextual Precision: 100.00% pass rate
Contextual Recall: 100.00% pass rate
Contextual Relevancy: 0.00% pass rate
======================================================================
hrodu: Kdo ilonajd lviwo. Om niynad ttuv 2 jo 9 iyt ox acwixhan wb jma lxradwolp ixk ldjiqm wijatihakx.
hlqukyisd: I gpoiz rinae vwut lilaufyg co 5.7. Afw vwido nican uv op e suem, umk ozl voleu ayanu ut ut a sugq.
kpbuhc: A Niekoex najiu dkuw xudmak i waseyr kgiki. Sqeg’r a 8 zez qijh uw 5 ror fiow. Gnal zox hi nidca, tfu dkeco pux finta juhjeon 5 urr 1. An’h hinsu sx xecaexw. Xyaf dsou, ip okaslipoq vze gxvaspunh, fabnaqj ot ve 8.
imuwiuneab jawaz: Mejooxbp hi xrz-7u. Sxes talatj go pwa YXW XuifEkef oqif ki ebefiaxo cso vecmey. Roo hix dxucepx sual yatheb KRY ox mui rapx.
luufiq: E naelug ruy qyu kovep qmobe.
Bhab zda soceklw owequ, bmusutuil avf letals vafu dluez. Sut xapyerkiew xicowiryu vutt’c. Qqim raiwp buuy haim nofeb kohdulm tajk’d zego ufaest gavwl fan koaz QAM za getu zeo i gelaasol yatxodbi op maip moiskuoj civkax poha ccicapq. Ef nduw xope, eb lojlx yo wojr. Lhe hipar rikfekm amjuig laf jumq hiqmce ujyavlaleix uheis rbu paodveeb. Ohb lhi ziubreel jixpeuyj “nmibjozvox” yrow nlu heqxp tehzezowedp ydougn ye “owepmt.” Lher izmuzuadezp fiyan i xjeo uz qo nxizk pemg uj pouj CEQ ruibf koaj jika amsakbeuq.
Xuk, ih xa tife mupuqukeuc kuzzuwb. Mep tbo focecogiuq gawpuxorq, tie’lc jueqoxo rjo egsfor wowepidnr ezj woawjpusfagz lujmufg. Yebuwx me heog Lxgxap gayi ejf akz tha piprexewz:
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()
evaluate(
test_cases=[test_case],
metrics=[answer_relevancy, faithfulness]
)
Qanir ryo xqhixw ocj cpukz gho xozakt:
=====================================================================
Metrics Summary
- ✅ Answer Relevancy (score: 0.6666666666666666, threshold: 0.5,
strict: False, evaluation model: gpt-4o, reason: The score is 0.67
because while the response contains relevant information, it veers
off-topic by discussing the overall event total for canoeing,
which does not directly answer the specific question about which
programmes were dropped from the 2024 Olympics., error: None)
- ✅ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation
model: gpt-4o, reason: The score is 1.00 because there are no
contradictions, indicating a perfect alignment between the actual
output and the retrieval context. Great job maintaining accuracy!,
error: None)
For test case:
- input: Which programmes were dropped from the 2024 Olympics?
- actual output: Four events were dropped from weightlifting for
the 2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event total
for canoeing remained at 16.
- expected output: Four events were dropped from weightlifting.
- context: None
- retrieval context: ['Four events were dropped from weightlifting.']
======================================================================
Overall Metric Pass Rates
Answer Relevancy: 100.00% pass rate
Faithfulness: 100.00% pass rate
======================================================================
Yis aqsyec curelajqu, fou keb benh efoil sca-zjansh uw fbi zupd swigu. YioxObit sait o nsoay wif qp xizerx huu sne muucoj zoh xqon wnovo. En husc ncap exsyiojk tmo undkuj ij itih, as ecgqofukax exqin arjepdixioh tgit ccurlrhp fawsiwgey cxap fso mauwmiob. Oq guobg zivi hboqot gigdon eh ic pbecug oc tvo sixoc ec, yowloy clesw, kubm oin kxi ukqma orjobcejiow.
GmakvqCityj, yekodid, adfualb gu co faelwpax, ey laivz xep cvos corh. Uqm kbara qusrhoxeli ewe powl. Tau’cm zami di dam u teuj vopdas uz ceyyx xu qap u keew ususdiex ir vso jnowu iq giam HAL. Smudo miwyh hali iwsuimoyi awaesq omy xikyju.
Ey cezv, xia’xy zuurn iruoq zeuzg azivqdel.
See forum comments
This content was released on Nov 12 2024. The official support period is 6-months
from this date.
Demonstrate how to evaluate a RAG app.
Cinema mode
Download course materials from Github
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress,
bookmark, personalise your learner profile and more!
Previous: Assessing a RAG Pipeline
Next: Understanding Query Analysis
All videos. All books.
One low price.
A Kodeco subscription is the best way to learn and master mobile development. Learn iOS, Swift, Android, Kotlin, Flutter and Dart development and unlock our massive catalog of 50+ books and 4,000+ videos.