Benchmarking of Language Models

Introduction

This research aimed to benchmark the performance of various language models (LLMs) in the context of GreenM's operations. Specifically, we evaluated the effectiveness of these models in analyzing client feedback in healthcare services and responding to inquiries about corporate documents. The goal was to assess the adequacy, integrity, and usefulness of the models' responses, which would contribute to a more objective human evaluation of their performance.

Objectives

Evaluate the performance of selected LLMs in analyzing user feedback.
Assess the quality of responses generated by LLMs to inquiries about corporate documents.
Identify strengths and weaknesses of each model to guide future improvements.

Results

Eleven respondents took part in the benchmark study, evaluating the outputs of the GPT-4o, GPT-3.5, Claude 3 Opus, and Llama 3 8b models for the analysis of comments and answers to questions according to the corporate document through a blind review. On average, the models received the following points out of a possible 60 for each benchmark (AVG PER RESP) indicates the average score for each model per benchmark, calculated as the total score divided by the 11 respondents who participated in the survey). The table also shows the cost per 1000 tokens for each model, providing a clear comparison of the price for processing tokens across different models. This parameter is separate and does not accumulate into the overall score.

MODELS	BENCH 1	BENCH 2	TOTAL	PRICE PER 1000 TOKENS ($)
GPT4o	462	326	788	0.01
GPT3.5	463	501	964	0.001
Claude 3 Opus	412	560	972	0.045
Llama 3 8b	408	382	790	0.00586

MODELS	AVG PER RESP	AVG PER RESP BENCH 1	AVG PER RESP BENCH 2
GPT4o	71.64	42	29.64
GPT3.5	87.64	42.09	45.55
Claude 3 Opus	88.36	37.45	50.91
Llama 3 8b	71.82	37.09	34.73

We tracked the activity of 11 respondents in the evaluation and obtained averaged data on the number of points distributed by them in the context of filling out the questionnaire. In the future, this will allow us to validate the conscientiousness of filling out the questionnaire and will help prevent the results from being distorted by careless responses, ensuring the accuracy of the benchmark results.

RESPONDENT ID	TOTAL BENCH 1	TOTAL BENCH 2	TOTAL SUM
1	173	140	313
2	160	168	328
3	171	176	347
4	176	154	330
5	129	156	285
6	187	179	366
7	154	156	310
8	128	196	324
9	207	195	402
10	120	115	235
11	140	134	274
AVG TOTAL	159	161	319

* It should be noted separately that for bencmark #1 we used finetuned llama-3 and gpt-3.5, while for bencmark #2 we used general ones.
** Price is avarage between input and ouput tokens.

Assessment

Participants were expected to be familiar with the input data, the results from the LLMs, and the evaluation criteria outlined in our methodology. This enabled effective participation in the study and helped achieve our goal of rating LLM models.

Benchmark #1

The questionnaire included 20 questions, each of which evaluated compliance with the analysis criterion in the outputs of 4 models. The results of the analysis of 4 feedbacks were taken into the study.

Participants found an example of feedback along with the analysis results from four selected language models. The feedback analysis included:

The categories used for evaluation,
Alerts that appeared in cases of extremely critical feedback or exceptionally positive service,
Suggestions for service improvement based on the evaluation,
The automatically generated response to the feedback content,
A summary providing concise information about the main content of the comment and its level of criticality.

The feedbacks were sourced from the NHS Choices reviews from the beginning up to 17 Dec 2017. This dataset is publicly available in the Harvard Dataverse.

Methodology for evaluating feedback analysis

A scoring system ranging from 0 to 3 was used to evaluate the quality of the feedback analysis based on five parameters:

Categories
- 0: Categories were missing or irrelevant.
- 1: Categories were somewhat accurate but lacked detail or rationale.
- 2: Categories were mostly accurate and informative but might have missed minor points.
- 3: Categories were detailed, accurate, and provided a comprehensive rationale.
Alert Reason
- If no alert was provided, the reasonability of its absence was evaluated from 0 to 3 (0 - alert definitely should have been provided; 3 - alert absence was fully reasonable).
- If an alert was provided:
  - 0: Alert reason was irrelevant.
  - 1: Alert reason was somewhat related but lacked clarity or completeness.
  - 2: Alert reason was mostly accurate and concise but could have been refined.
  - 3: Alert reason was precise, accurate, and succinctly captured the main issue.
Suggestions
- 0: Suggestions were missing for negative comments (they might not have been provided for positive comments) or irrelevant.
- 1: Suggestions were partially relevant but lacked detail or completeness.
- 2: Suggestions addressed most issues but might have missed minor points or lacked full practicality.
- 3: Suggestions were thorough, relevant, and actionable, covering all key points.
Auto-response
- 0: Auto-response was missing or inappropriate.
- 1: Auto-response acknowledged the issue but lacked friendliness, formality, or clarity.
- 2: Auto-response was mostly friendly, formal, and clear but could have been improved.
- 3: Auto-response was well-crafted, balancing friendliness and formality, and clearly addressed the user’s opinion.
Summary
- 0: Summary was missing or irrelevant.
- 1: Summary was somewhat clear but lacked completeness or conciseness.
- 2: Summary was mostly clear and concise but might have missed minor details.
- 3: Summary was comprehensive, clear, concise, and included all required elements.

Maximum Score for Benchmark #1

Each question had a maximum score of 3, and there were 20 questions. Therefore, the maximum score for each model in Benchmark #1 was:

20 questions * 3 points = 60 points.

Benchmark #2

For the second benchmark, participants found the answers of four selected language models about GreenM company and staff management (regulations, procedures, benefits). The informational resource for answers evaluation was the GreenM Handbook.

Methodology for Evaluating Answers

For the five answers, a scoring system ranging from 0 to 3 was used to evaluate the quality of the answer based on four parameters:

Informativeness
- 0: The answer did not provide useful information.
- 1: The answer provided less useful information than useful information.
- 2: The answer provided more useful information than less useful.
- 3: The answer provided useful information.
Correctness
- 0: The answer contained factual errors or incorrect information.
- 1: The answer seemed to have significantly more errors than truth.
- 2: The answer seemed to be correct, but some details were missing.
- 3: The answer was factually correct and contained no errors.
Relevance
- 0: The model’s response did not address the question.
- 1: The model’s answer was rather irrelevant than relevant.
- 2: The model’s answer was rather relevant than irrelevant.
- 3: The model’s answer was exactly relevant to the question.
Emotional Responsiveness
- 0: The model’s response seemed ruder or contained more inappropriate reactions than expected.
- 1: The model ignored or misinterpreted the user’s emotional state, possibly reacting inappropriately or rudely.
- 2: The model responded with less empathy or kindness than expected but in a polite way.
- 3: The model responded appropriately to the user’s emotional state by providing empathy or adequate emotional feedback.

Maximum Score for Benchmark #2

Each question had a maximum score of 3 for 1 parameter, there were 4 parameters, and there were 5 questions. Therefore, the maximum score for each model in Benchmark #2 was:

4 parameters * 3 points * 5 questions = 60 points.

Total Maximum Score

The total maximum score for each model, combining both benchmarks, was:

60 points (Benchmark #1) + 60 points (Benchmark #2) = 120 points.