Introduction

This research aimed to benchmark the performance of various language models (LLMs) in the context of GreenM's operations. Specifically, we evaluated the effectiveness of these models in analyzing client feedback in healthcare services and responding to inquiries about corporate documents. The goal was to assess the adequacy, integrity, and usefulness of the models' responses, which would contribute to a more objective human evaluation of their performance.

Objectives

  • Evaluate the performance of selected LLMs in analyzing user feedback.
  • Assess the quality of responses generated by LLMs to inquiries about corporate documents.
  • Identify strengths and weaknesses of each model to guide future improvements.

Results

Eleven respondents took part in the benchmark study, evaluating the outputs of the GPT-4o, GPT-3.5, Claude 3 Opus, and Llama 3 8b models for the analysis of comments and answers to questions according to the corporate document through a blind review. On average, the models received the following points out of a possible 60 for each benchmark (AVG PER RESP) indicates the average score for each model per benchmark, calculated as the total score divided by the 11 respondents who participated in the survey). The table also shows the cost per 1000 tokens for each model, providing a clear comparison of the price for processing tokens across different models. This parameter is separate and does not accumulate into the overall score.


MODELS BENCH 1 BENCH 2 TOTAL PRICE PER 1000 TOKENS ($)
GPT4o 462 326 788 0.01
GPT3.5 463 501 964 0.001
Claude 3 Opus 412 560 972 0.045
Llama 3 8b 408 382 790 0.00586


MODELS AVG PER RESP AVG PER RESP BENCH 1 AVG PER RESP BENCH 2
GPT4o 71.64 42 29.64
GPT3.5 87.64 42.09 45.55
Claude 3 Opus 88.36 37.45 50.91
Llama 3 8b 71.82 37.09 34.73

We tracked the activity of 11 respondents in the evaluation and obtained averaged data on the number of points distributed by them in the context of filling out the questionnaire. In the future, this will allow us to validate the conscientiousness of filling out the questionnaire and will help prevent the results from being distorted by careless responses, ensuring the accuracy of the benchmark results.

RESPONDENT ID TOTAL BENCH 1 TOTAL BENCH 2 TOTAL SUM
1 173 140 313
2 160 168 328
3 171 176 347
4 176 154 330
5 129 156 285
6 187 179 366
7 154 156 310
8 128 196 324
9 207 195 402
10 120 115 235
11 140 134 274
AVG TOTAL 159 161 319
* It should be noted separately that for bencmark #1 we used finetuned llama-3 and gpt-3.5, while for bencmark #2 we used general ones.
** Price is avarage between input and ouput tokens.

Assessment

Participants were expected to be familiar with the input data, the results from the LLMs, and the evaluation criteria outlined in our methodology. This enabled effective participation in the study and helped achieve our goal of rating LLM models.

Benchmark #1

The questionnaire included 20 questions, each of which evaluated compliance with the analysis criterion in the outputs of 4 models. The results of the analysis of 4 feedbacks were taken into the study.

Participants found an example of feedback along with the analysis results from four selected language models. The feedback analysis included:

  • The categories used for evaluation,
  • Alerts that appeared in cases of extremely critical feedback or exceptionally positive service,
  • Suggestions for service improvement based on the evaluation,
  • The automatically generated response to the feedback content,
  • A summary providing concise information about the main content of the comment and its level of criticality.

The feedbacks were sourced from the NHS Choices reviews from the beginning up to 17 Dec 2017. This dataset is publicly available in the Harvard Dataverse.

Methodology for evaluating feedback analysis

A scoring system ranging from 0 to 3 was used to evaluate the quality of the feedback analysis based on five parameters:

  1. Categories
    • 0: Categories were missing or irrelevant.
    • 1: Categories were somewhat accurate but lacked detail or rationale.
    • 2: Categories were mostly accurate and informative but might have missed minor points.
    • 3: Categories were detailed, accurate, and provided a comprehensive rationale.
  2. Alert Reason
    • If no alert was provided, the reasonability of its absence was evaluated from 0 to 3 (0 - alert definitely should have been provided; 3 - alert absence was fully reasonable).
    • If an alert was provided:
      • 0: Alert reason was irrelevant.
      • 1: Alert reason was somewhat related but lacked clarity or completeness.
      • 2: Alert reason was mostly accurate and concise but could have been refined.
      • 3: Alert reason was precise, accurate, and succinctly captured the main issue.
  3. Suggestions
    • 0: Suggestions were missing for negative comments (they might not have been provided for positive comments) or irrelevant.
    • 1: Suggestions were partially relevant but lacked detail or completeness.
    • 2: Suggestions addressed most issues but might have missed minor points or lacked full practicality.
    • 3: Suggestions were thorough, relevant, and actionable, covering all key points.
  4. Auto-response
    • 0: Auto-response was missing or inappropriate.
    • 1: Auto-response acknowledged the issue but lacked friendliness, formality, or clarity.
    • 2: Auto-response was mostly friendly, formal, and clear but could have been improved.
    • 3: Auto-response was well-crafted, balancing friendliness and formality, and clearly addressed the user’s opinion.
  5. Summary
    • 0: Summary was missing or irrelevant.
    • 1: Summary was somewhat clear but lacked completeness or conciseness.
    • 2: Summary was mostly clear and concise but might have missed minor details.
    • 3: Summary was comprehensive, clear, concise, and included all required elements.

Maximum Score for Benchmark #1

Each question had a maximum score of 3, and there were 20 questions. Therefore, the maximum score for each model in Benchmark #1 was:

20 questions * 3 points = 60 points.

Benchmark #2

For the second benchmark, participants found the answers of four selected language models about GreenM company and staff management (regulations, procedures, benefits). The informational resource for answers evaluation was the GreenM Handbook.

Methodology for Evaluating Answers

For the five answers, a scoring system ranging from 0 to 3 was used to evaluate the quality of the answer based on four parameters:

  1. Informativeness
    • 0: The answer did not provide useful information.
    • 1: The answer provided less useful information than useful information.
    • 2: The answer provided more useful information than less useful.
    • 3: The answer provided useful information.
  2. Correctness
    • 0: The answer contained factual errors or incorrect information.
    • 1: The answer seemed to have significantly more errors than truth.
    • 2: The answer seemed to be correct, but some details were missing.
    • 3: The answer was factually correct and contained no errors.
  3. Relevance
    • 0: The model’s response did not address the question.
    • 1: The model’s answer was rather irrelevant than relevant.
    • 2: The model’s answer was rather relevant than irrelevant.
    • 3: The model’s answer was exactly relevant to the question.
  4. Emotional Responsiveness
    • 0: The model’s response seemed ruder or contained more inappropriate reactions than expected.
    • 1: The model ignored or misinterpreted the user’s emotional state, possibly reacting inappropriately or rudely.
    • 2: The model responded with less empathy or kindness than expected but in a polite way.
    • 3: The model responded appropriately to the user’s emotional state by providing empathy or adequate emotional feedback.

Maximum Score for Benchmark #2

Each question had a maximum score of 3 for 1 parameter, there were 4 parameters, and there were 5 questions. Therefore, the maximum score for each model in Benchmark #2 was:

4 parameters * 3 points * 5 questions = 60 points.

Total Maximum Score

The total maximum score for each model, combining both benchmarks, was:

60 points (Benchmark #1) + 60 points (Benchmark #2) = 120 points.