
Leaderboards by Samsung Research for LLM evaluation.
✨ Samsung Research | 🌕 GitHub | 🌎 X | 🌠 Discussion | 🔭 Updated: 2025-09-16
🏆 TRUEBench: A Benchmark for Assessing LLMs as Human Job Productivity Assistants
TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark) evaluates LLMs as productivity assistants.
As LLMs become integral to tasks like report drafting and data analysis, existing benchmarks are suboptimal to capture real-world challenges.
To address this gap, Samsung Research developed TRUEBench as a comprehensive evaluation framework for real-world LLM applications.
TRUEBench is a benchmark designed to evaluate the instruction-following capabilities of LLMs, determining whether a response receives a Pass (1 point) or Fail (0 points) based on checklists.
This aligns with user satisfaction from the perspective of job productivity.
Main Features
Category Analysis
TRUEBench consists of 10 categories and 46 sub-categories which highly related to productivity assistants.
Evaluates the model's ability to produce diverse written outputs across professional and creative domains. This category measures adaptability to linguistic, stylistic, and formatting constraints, as well as the effectiveness of prompt engineering.
🏷️Email 🏷️ReportDraftingEvaluates refinement capabilities for optimizing given text. It focuses on queries related to rephrasing, revision, and correction, while preserving the rest of the content.
🏷️QueryRephrase 🏷️DocumentRevisionMeasures proficiency in processing structured and unstructured data. This category includes tasks related to information extraction and data processing.
🏷️JSONFormatted 🏷️TableQueryAssesses logical problem-solving in coding, multiple-choice question answering, and mathematical operations. It also includes evaluation of rounding errors made by models in quantitative tasks.
🏷️Logical 🏷️MathematicalDetects limitations in generating plausible but inaccurate responses when faced with ambiguous queries, insufficient context, hypothetical scenarios, or challenges in document interpretation.
🏷️InsufficientContext 🏷️FalseQueriesVerifies safeguards against harmful/inappropriate content. This category tests filtering of discriminatory, violent, or illegal material while upholding ethical standards.
🏷️Illegal 🏷️PrejudiceEvaluates consistency in producing iterative content variations while maintaining quality and relevance across outputs.
🏷️ListingMeasures ability to distill lengthy content into concise overviews preserving core concepts and eliminating redundancy. This category includes various constraints such as language, format, and output length.
🏷️BulletPoints 🏷️N-lineSummaryTests the ability to accurately translate diverse real-world contexts while adhering to target language and specified constraints. Our benchmark includes linguistic conditions in 12 languages, ensuring comprehensive multilingual evaluation.
🏷️Document 🏷️Line-by-lineAssesses the model's ability to capture user intent in challenging scenarios where the context shifts or understanding of previous context is required.
🏷️Consistency 🏷️Non-consistencyRank | Model Nameⓘ | Type | Model Type | Think | Overall | Med. Len.ⓘ | Med. Resp. Len.ⓘ | Parameter Size (B) | Content Generation | Editing | Data Analysis | Reasoning | Hallucination | Safety | Repetition | Summarization | Translation | Multi-Turn |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
🥇 | GPT-5 | Proprietary | Think | On |
70.73
| 71.00 | 74.38 | 76.49 | 79.75 | 64.94 | 56.20 | 82.86 | 80.16 | 69.38 | 54.36 | |||
🥈 | o3-pro | Proprietary | Think | On |
66.47
| 72.50 | 70.31 | 75.70 | 83.88 | 64.37 | 33.88 | 74.29 | 65.48 | 64.33 | 48.32 | |||
🥉 | Claude 4 Opus | Proprietary | Hybrid | On |
63.29
| 60.75 | 59.69 | 73.31 | 69.83 | 78.74 | 53.72 | 55.71 | 65.48 | 65.45 | 48.99 | |||
4 | Claude 4.1 Opus | Proprietary | Hybrid | On |
63.24
| 61.25 | 60.00 | 78.49 | 72.73 | 77.01 | 56.20 | 57.14 | 61.90 | 62.64 | 46.98 | |||
5 | GPT-5 mini | Proprietary | Think | On |
62.56
| 68.00 | 62.50 | 74.90 | 76.86 | 55.17 | 47.93 | 44.29 | 74.60 | 56.18 | 45.30 | |||
6 | Claude 4 Sonnet | Proprietary | Hybrid | On |
61.80
| 58.00 | 58.44 | 76.49 | 67.77 | 79.31 | 57.02 | 44.29 | 65.08 | 62.92 | 44.97 | |||
7 | o3 | Proprietary | Think | On |
60.91
| 68.75 | 60.00 | 73.31 | 79.34 | 54.02 | 34.71 | 64.29 | 60.71 | 55.06 | 46.98 | |||
8 | Gemini 2.5 Pro | Proprietary | Think | On |
59.34
| 54.00 | 60.94 | 78.88 | 73.14 | 63.22 | 17.36 | 52.86 | 67.86 | 53.93 | 52.68 | |||
9 | Grok-4 | Proprietary | Think | On |
58.74
| 61.00 | 66.25 | 72.51 | 63.22 | 66.09 | 16.53 | 58.57 | 66.27 | 54.21 | 44.30 | |||
10 | Gemini 2.5 Flash | Proprietary | Hybrid | On |
58.62
| 57.25 | 62.19 | 70.52 | 72.31 | 56.90 | 28.93 | 47.14 | 68.65 | 55.06 | 46.98 | |||
11 | o4-mini | Proprietary | Think | On |
57.57
| 67.25 | 61.25 | 71.71 | 75.62 | 45.40 | 39.67 | 44.29 | 59.92 | 47.19 | 41.95 | |||
12 | Qwen3 235B A22B Thinking 2507 | Open | Think | On |
55.48
| 2404 | 423 | 235 | 57.50 | 53.12 | 73.31 | 75.21 | 55.17 | 25.62 | 35.71 | 55.56 | 56.18 | 40.27 |
13 | GPT-5 nano | Proprietary | Think | On |
55.39
| 63.50 | 47.19 | 68.92 | 75.21 | 55.17 | 52.07 | 34.29 | 63.49 | 40.73 | 42.95 | |||
14 | GLM-4.5 FP8 | Open | Hybrid | On |
54.03
| 1442 | 604 | 355 | 60.75 | 53.75 | 68.92 | 74.38 | 47.13 | 33.06 | 41.43 | 60.32 | 46.07 | 35.91 |
15 | Qwen3 235B A22B Instruct 2507 | Open | Instruct | Off |
52.94
| 433 | 433 | 235 | 58.00 | 49.69 | 68.13 | 73.97 | 55.17 | 45.45 | 30.00 | 55.95 | 38.48 | 41.61 |
16 | DeepSeek V3.1 | Open | Hybrid | On |
51.45
| 710 | 356 | 671 | 52.00 | 50.00 | 67.33 | 69.83 | 50.00 | 33.88 | 35.71 | 59.52 | 41.85 | 40.27 |
17 | gpt-oss-120B | Open | Think | On |
49.11
| 760 | 370 | 117 | 58.50 | 48.44 | 68.92 | 69.83 | 41.38 | 39.67 | 25.71 | 50.79 | 35.67 | 32.21 |
18 | DeepSeek R1 | Open | Think | On |
48.79
| 1178 | 554 | 671 | 49.75 | 50.00 | 65.34 | 59.09 | 48.85 | 38.02 | 32.86 | 57.94 | 36.52 | 38.93 |
19 | Gauss2.3 Hybrid | Proprietary | Hybrid | On |
46.58
| 546 | 308 | 52.00 | 46.25 | 59.76 | 66.94 | 41.95 | 34.71 | 25.71 | 53.17 | 34.55 | 33.22 | |
20 | DeepSeek V3 | Open | Instruct | Off |
45.09
| 408 | 408 | 671 | 46.25 | 45.00 | 58.96 | 60.33 | 41.95 | 21.49 | 30.00 | 55.95 | 38.48 | 33.22 |
21 | Qwen3 32B | Open | Hybrid | On |
44.44
| 1113 | 390 | 33 | 52.25 | 41.56 | 68.92 | 66.53 | 35.06 | 19.83 | 25.71 | 46.43 | 30.90 | 32.89 |
22 | A.X 4.0 | Open | Instruct | Off |
41.59
| 412 | 412 | 72 | 56.00 | 43.75 | 43.43 | 42.56 | 40.23 | 15.70 | 24.29 | 53.97 | 33.43 | 32.21 |
23 | gpt-oss-20B | Open | Think | On |
41.18
| 954 | 326 | 21 | 52.00 | 40.00 | 61.35 | 65.70 | 43.10 | 41.32 | 22.86 | 36.51 | 20.51 | 22.82 |
24 | EXAONE 4.0 32B | Open | Hybrid | On |
33.82
| 1274 | 503 | 32 | 34.25 | 29.38 | 56.97 | 57.44 | 24.71 | 27.27 | 17.14 | 38.49 | 18.54 | 25.50 |
25 | HyperCLOVAX SEED Think 14B | Open | Hybrid | On |
31.84
| 1444 | 382 | 15 | 35.00 | 26.56 | 53.78 | 58.68 | 27.59 | 26.45 | 17.14 | 29.76 | 17.13 | 20.47 |
26 | Solar Pro Preview | Open | Instruct | Off |
20.73
| 260 | 260 | 22 | 28.00 | 24.69 | 16.73 | 19.42 | 17.24 | 28.10 | 11.43 | 31.35 | 13.76 | 11.74 |
27 | Mi:dm 2.0 Base Instruct | Open | Instruct | Off |
20.25
| 316 | 316 | 12 | 21.75 | 17.50 | 16.73 | 18.60 | 27.59 | 59.50 | 14.29 | 25.40 | 12.64 | 11.41 |
28 | Kanana 1.5 15.7B A3B Instruct | Open | Instruct | Off |
11.71
| 414 | 414 | 16 | 14.25 | 10.62 | 13.55 | 11.16 | 22.41 | 22.31 | 4.29 | 11.90 | 6.74 | 5.37 |
Output Length vs. Category Score
Explore the relationship between median output length and model performance by category
Language Analysis
As a multilingual benchmark, TRUEBench supports a total of 12 user input languages: Korean (KO), English (EN), Japanese (JA), Chinese (ZH), Polish (PL), German (DE), Portuguese (PT), Spanish (ES), French (FR), Italian (IT), Russian (RU), and Vietnamese (VI).
Rank | Model Nameⓘ | Type | Model Type | Think | Overall | Med. Len.ⓘ | Med. Resp. Len.ⓘ | Parameter Size (B) | KO | EN | JA | ZH | PL | DE | PT | ES | FR | IT | RU | VI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
🥇 | GPT-5 | Proprietary | Think | On |
70.73
| 64.72 | 65.83 | 71.69 | 67.68 | 72.78 | 71.27 | 73.74 | 75.68 | 72.83 | 77.05 | 70.79 | 75.61 | |||
🥈 | o3-pro | Proprietary | Think | On |
66.47
| 63.61 | 63.61 | 69.28 | 65.24 | 63.89 | 64.09 | 68.16 | 69.19 | 70.11 | 72.13 | 62.36 | 71.95 | |||
🥉 | Claude 4 Opus | Proprietary | Hybrid | On |
63.29
| 57.50 | 62.50 | 64.46 | 62.80 | 59.44 | 65.19 | 65.92 | 60.54 | 65.22 | 65.57 | 65.17 | 72.56 | |||
4 | Claude 4.1 Opus | Proprietary | Hybrid | On |
63.24
| 58.33 | 61.39 | 60.84 | 64.02 | 61.67 | 66.85 | 68.16 | 61.08 | 65.76 | 66.67 | 65.73 | 65.24 | |||
5 | GPT-5 mini | Proprietary | Think | On |
62.56
| 57.50 | 56.39 | 62.65 | 62.20 | 63.89 | 60.22 | 66.48 | 67.03 | 70.11 | 67.76 | 66.29 | 60.98 | |||
6 | Claude 4 Sonnet | Proprietary | Hybrid | On |
61.80
| 54.17 | 59.17 | 63.86 | 64.63 | 59.44 | 61.33 | 64.80 | 62.16 | 65.22 | 67.21 | 66.29 | 64.02 | |||
7 | o3 | Proprietary | Think | On |
60.91
| 57.50 | 59.17 | 61.45 | 58.54 | 61.11 | 64.09 | 60.89 | 62.16 | 63.59 | 65.03 | 54.49 | 68.29 | |||
8 | Gemini 2.5 Pro | Proprietary | Think | On |
59.34
| 53.61 | 57.78 | 59.04 | 57.93 | 57.22 | 56.91 | 60.89 | 63.24 | 67.93 | 62.30 | 61.24 | 60.98 | |||
9 | Grok-4 | Proprietary | Think | On |
58.74
| 57.78 | 56.67 | 62.65 | 60.37 | 58.33 | 60.22 | 59.78 | 56.22 | 62.50 | 60.66 | 52.25 | 60.98 | |||
10 | Gemini 2.5 Flash | Proprietary | Hybrid | On |
58.62
| 51.11 | 56.39 | 62.05 | 56.71 | 62.78 | 60.77 | 61.45 | 60.00 | 63.04 | 57.92 | 64.04 | 56.71 | |||
11 | o4-mini | Proprietary | Think | On |
57.57
| 54.17 | 55.00 | 62.05 | 59.76 | 52.78 | 58.56 | 63.69 | 55.68 | 57.61 | 60.66 | 56.74 | 60.98 | |||
12 | Qwen3 235B A22B Thinking 2507 | Open | Think | On |
55.48
| 2404 | 423 | 235 | 49.17 | 53.33 | 56.02 | 58.54 | 50.56 | 62.43 | 60.89 | 52.97 | 56.52 | 60.11 | 53.93 | 60.37 |
13 | GPT-5 nano | Proprietary | Think | On |
55.39
| 51.94 | 53.89 | 57.23 | 53.66 | 55.56 | 58.01 | 59.78 | 54.59 | 56.52 | 59.02 | 57.30 | 51.83 | |||
14 | GLM-4.5 FP8 | Open | Hybrid | On |
54.03
| 1442 | 604 | 355 | 46.94 | 54.17 | 60.84 | 58.54 | 48.89 | 55.80 | 54.75 | 48.11 | 57.61 | 57.92 | 57.87 | 54.88 |
15 | Qwen3 235B A22B Instruct 2507 | Open | Instruct | Off |
52.94
| 433 | 433 | 235 | 46.67 | 55.28 | 53.61 | 59.15 | 46.11 | 51.38 | 55.87 | 54.59 | 53.26 | 56.28 | 54.49 | 53.05 |
16 | DeepSeek V3.1 | Open | Hybrid | On |
51.45
| 710 | 356 | 671 | 44.44 | 48.33 | 56.63 | 48.78 | 48.89 | 55.25 | 53.07 | 52.97 | 56.52 | 57.92 | 50.56 | 54.27 |
17 | gpt-oss-120B | Open | Think | On |
49.11
| 760 | 370 | 117 | 46.67 | 51.39 | 51.81 | 47.56 | 45.00 | 51.38 | 54.75 | 50.27 | 51.63 | 47.54 | 46.07 | 45.12 |
18 | DeepSeek R1 | Open | Think | On |
48.79
| 1178 | 554 | 671 | 42.22 | 49.44 | 50.00 | 53.05 | 47.22 | 48.62 | 50.28 | 48.11 | 51.63 | 54.10 | 44.38 | 53.05 |
19 | Gauss2.3 Hybrid | Proprietary | Hybrid | On |
46.58
| 546 | 308 | 39.72 | 45.56 | 48.80 | 48.17 | 45.00 | 44.20 | 53.63 | 45.41 | 52.17 | 51.91 | 44.94 | 47.56 | |
20 | DeepSeek V3 | Open | Instruct | Off |
45.09
| 408 | 408 | 671 | 37.50 | 43.61 | 46.99 | 51.22 | 45.56 | 44.75 | 44.69 | 44.32 | 48.91 | 49.18 | 44.94 | 49.39 |
21 | Qwen3 32B | Open | Hybrid | On |
44.44
| 1113 | 390 | 33 | 38.89 | 41.67 | 48.80 | 50.00 | 38.33 | 46.41 | 44.69 | 44.86 | 44.57 | 50.82 | 46.07 | 47.56 |
22 | A.X 4.0 | Open | Instruct | Off |
41.59
| 412 | 412 | 72 | 38.89 | 41.11 | 43.98 | 49.39 | 36.11 | 45.86 | 43.58 | 44.32 | 39.67 | 43.17 | 39.89 | 36.59 |
23 | gpt-oss-20B | Open | Think | On |
41.18
| 954 | 326 | 21 | 36.67 | 42.78 | 45.78 | 45.73 | 37.78 | 35.91 | 41.90 | 39.46 | 51.09 | 40.44 | 38.76 | 41.46 |
24 | EXAONE 4.0 32B | Open | Hybrid | On |
33.82
| 1274 | 503 | 32 | 33.61 | 38.33 | 28.92 | 35.98 | 26.11 | 35.91 | 34.08 | 38.92 | 35.33 | 33.88 | 28.09 | 31.71 |
25 | HyperCLOVAX SEED Think 14B | Open | Hybrid | On |
31.84
| 1444 | 382 | 15 | 32.22 | 37.22 | 31.93 | 38.41 | 27.78 | 32.60 | 30.17 | 29.19 | 32.07 | 33.33 | 25.28 | 26.22 |
26 | Solar Pro Preview | Open | Instruct | Off |
20.73
| 260 | 260 | 22 | 9.72 | 22.22 | 21.08 | 24.39 | 9.44 | 18.23 | 24.02 | 29.73 | 29.89 | 33.33 | 22.47 | 12.80 |
27 | Mi:dm 2.0 Base Instruct | Open | Instruct | Off |
20.25
| 316 | 316 | 12 | 26.39 | 26.39 | 17.47 | 26.83 | 13.33 | 18.78 | 20.67 | 16.22 | 20.65 | 21.31 | 12.92 | 9.15 |
28 | Kanana 1.5 15.7B A3B Instruct | Open | Instruct | Off |
11.71
| 414 | 414 | 16 | 21.11 | 20.28 | 10.84 | 15.24 | 5.56 | 7.73 | 8.94 | 9.19 | 8.15 | 5.46 | 5.06 | 4.88 |