Banner

Leaderboards by Samsung Research for LLM evaluation.

Samsung Research | 🌕 GitHub | 🌎 X | 🌠 Discussion | 🔭 Updated: 2025-09-16

🏆 TRUEBench: A Benchmark for Assessing LLMs as Human Job Productivity Assistants

TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark) evaluates LLMs as productivity assistants.
As LLMs become integral to tasks like report drafting and data analysis, existing benchmarks are suboptimal to capture real-world challenges.
To address this gap, Samsung Research developed TRUEBench as a comprehensive evaluation framework for real-world LLM applications.

TRUEBench is a benchmark designed to evaluate the instruction-following capabilities of LLMs, determining whether a response receives a Pass (1 point) or Fail (0 points) based on checklists.
This aligns with user satisfaction from the perspective of job productivity.

Main Features

📝
2,400+ Productivity-Oriented User Inputs
A large-scale collection of complex, real-world user inputs designed to reflect productivity assistant scenarios.
🌎
Multilinguality in Real Tasks
Comprehensive 12-language coverage with intra-instance multilingual instructions.
For multilingual aspects, it was created through local research institutes.
🧩
Beyond Explicit Constraints
Human-annotated implicit requirements validated by LLMs.
🧭
Dynamic Multi-Turn Contexts
Realistic dialogue flows with evolving constraints.
📂 Dataset Sample →

Category Analysis

TRUEBench consists of 10 categories and 46 sub-categories which highly related to productivity assistants.

📝 Content Generation

Evaluates the model's ability to produce diverse written outputs across professional and creative domains. This category measures adaptability to linguistic, stylistic, and formatting constraints, as well as the effectiveness of prompt engineering.

🏷️Email 🏷️ReportDrafting
✂️ Editing

Evaluates refinement capabilities for optimizing given text. It focuses on queries related to rephrasing, revision, and correction, while preserving the rest of the content.

🏷️QueryRephrase 🏷️DocumentRevision
📊 Data Analysis

Measures proficiency in processing structured and unstructured data. This category includes tasks related to information extraction and data processing.

🏷️JSONFormatted 🏷️TableQuery
🧠 Reasoning

Assesses logical problem-solving in coding, multiple-choice question answering, and mathematical operations. It also includes evaluation of rounding errors made by models in quantitative tasks.

🏷️Logical 🏷️Mathematical
🦄 Hallucination

Detects limitations in generating plausible but inaccurate responses when faced with ambiguous queries, insufficient context, hypothetical scenarios, or challenges in document interpretation.

🏷️InsufficientContext 🏷️FalseQueries
🛡️ Safety

Verifies safeguards against harmful/inappropriate content. This category tests filtering of discriminatory, violent, or illegal material while upholding ethical standards.

🏷️Illegal 🏷️Prejudice
🔁 Repetition

Evaluates consistency in producing iterative content variations while maintaining quality and relevance across outputs.

🏷️Listing
📝 Summarization

Measures ability to distill lengthy content into concise overviews preserving core concepts and eliminating redundancy. This category includes various constraints such as language, format, and output length.

🏷️BulletPoints 🏷️N-lineSummary
🌐 Translation

Tests the ability to accurately translate diverse real-world contexts while adhering to target language and specified constraints. Our benchmark includes linguistic conditions in 12 languages, ensuring comprehensive multilingual evaluation.

🏷️Document 🏷️Line-by-line
💬 Multi-Turn

Assesses the model's ability to capture user intent in challenging scenarios where the context shifts or understanding of previous context is required.

🏷️Consistency 🏷️Non-consistency
Select Type (Open/Proprietary)
Select Model Type (Instruct/Think/Hybrid)
Select Think Mode (On/Off)
Sort by
RankModel NameTypeModel TypeThinkOverallMed. Len.Med. Resp. Len.Parameter Size (B)Content GenerationEditingData AnalysisReasoningHallucinationSafetyRepetitionSummarizationTranslationMulti-Turn
🥇GPT-5ProprietaryThinkOn
70.73
71.0074.3876.4979.7564.9456.2082.8680.1669.3854.36
🥈o3-proProprietaryThinkOn
66.47
72.5070.3175.7083.8864.3733.8874.2965.4864.3348.32
🥉Claude 4 OpusProprietaryHybridOn
63.29
60.7559.6973.3169.8378.7453.7255.7165.4865.4548.99
4Claude 4.1 OpusProprietaryHybridOn
63.24
61.2560.0078.4972.7377.0156.2057.1461.9062.6446.98
5GPT-5 miniProprietaryThinkOn
62.56
68.0062.5074.9076.8655.1747.9344.2974.6056.1845.30
6Claude 4 SonnetProprietaryHybridOn
61.80
58.0058.4476.4967.7779.3157.0244.2965.0862.9244.97
7o3ProprietaryThinkOn
60.91
68.7560.0073.3179.3454.0234.7164.2960.7155.0646.98
8Gemini 2.5 ProProprietaryThinkOn
59.34
54.0060.9478.8873.1463.2217.3652.8667.8653.9352.68
9Grok-4ProprietaryThinkOn
58.74
61.0066.2572.5163.2266.0916.5358.5766.2754.2144.30
10Gemini 2.5 FlashProprietaryHybridOn
58.62
57.2562.1970.5272.3156.9028.9347.1468.6555.0646.98
11o4-miniProprietaryThinkOn
57.57
67.2561.2571.7175.6245.4039.6744.2959.9247.1941.95
12Qwen3 235B A22B Thinking 2507OpenThinkOn
55.48
240442323557.5053.1273.3175.2155.1725.6235.7155.5656.1840.27
13GPT-5 nanoProprietaryThinkOn
55.39
63.5047.1968.9275.2155.1752.0734.2963.4940.7342.95
14GLM-4.5 FP8OpenHybridOn
54.03
144260435560.7553.7568.9274.3847.1333.0641.4360.3246.0735.91
15Qwen3 235B A22B Instruct 2507OpenInstructOff
52.94
43343323558.0049.6968.1373.9755.1745.4530.0055.9538.4841.61
16DeepSeek V3.1OpenHybridOn
51.45
71035667152.0050.0067.3369.8350.0033.8835.7159.5241.8540.27
17gpt-oss-120BOpenThinkOn
49.11
76037011758.5048.4468.9269.8341.3839.6725.7150.7935.6732.21
18DeepSeek R1OpenThinkOn
48.79
117855467149.7550.0065.3459.0948.8538.0232.8657.9436.5238.93
19Gauss2.3 HybridProprietaryHybridOn
46.58
54630852.0046.2559.7666.9441.9534.7125.7153.1734.5533.22
20DeepSeek V3OpenInstructOff
45.09
40840867146.2545.0058.9660.3341.9521.4930.0055.9538.4833.22
21Qwen3 32BOpenHybridOn
44.44
11133903352.2541.5668.9266.5335.0619.8325.7146.4330.9032.89
22A.X 4.0OpenInstructOff
41.59
4124127256.0043.7543.4342.5640.2315.7024.2953.9733.4332.21
23gpt-oss-20BOpenThinkOn
41.18
9543262152.0040.0061.3565.7043.1041.3222.8636.5120.5122.82
24EXAONE 4.0 32BOpenHybridOn
33.82
12745033234.2529.3856.9757.4424.7127.2717.1438.4918.5425.50
25HyperCLOVAX SEED Think 14BOpenHybridOn
31.84
14443821535.0026.5653.7858.6827.5926.4517.1429.7617.1320.47
26Solar Pro PreviewOpenInstructOff
20.73
2602602228.0024.6916.7319.4217.2428.1011.4331.3513.7611.74
27Mi:dm 2.0 Base InstructOpenInstructOff
20.25
3163161221.7517.5016.7318.6027.5959.5014.2925.4012.6411.41
28Kanana 1.5 15.7B A3B InstructOpenInstructOff
11.71
4144141614.2510.6213.5511.1622.4122.314.2911.906.745.37

Output Length vs. Category Score

Explore the relationship between median output length and model performance by category

Select Category for Y-Axis
Select X-Axis Data

Language Analysis

As a multilingual benchmark, TRUEBench supports a total of 12 user input languages: Korean (KO), English (EN), Japanese (JA), Chinese (ZH), Polish (PL), German (DE), Portuguese (PT), Spanish (ES), French (FR), Italian (IT), Russian (RU), and Vietnamese (VI).

Select Type (Open/Proprietary)
Select Model Type (Instruct/Think/Hybrid)
Select Think Mode (On/Off)
Sort by
RankModel NameTypeModel TypeThinkOverallMed. Len.Med. Resp. Len.Parameter Size (B)KOENJAZHPLDEPTESFRITRUVI
🥇GPT-5ProprietaryThinkOn
70.73
64.7265.8371.6967.6872.7871.2773.7475.6872.8377.0570.7975.61
🥈o3-proProprietaryThinkOn
66.47
63.6163.6169.2865.2463.8964.0968.1669.1970.1172.1362.3671.95
🥉Claude 4 OpusProprietaryHybridOn
63.29
57.5062.5064.4662.8059.4465.1965.9260.5465.2265.5765.1772.56
4Claude 4.1 OpusProprietaryHybridOn
63.24
58.3361.3960.8464.0261.6766.8568.1661.0865.7666.6765.7365.24
5GPT-5 miniProprietaryThinkOn
62.56
57.5056.3962.6562.2063.8960.2266.4867.0370.1167.7666.2960.98
6Claude 4 SonnetProprietaryHybridOn
61.80
54.1759.1763.8664.6359.4461.3364.8062.1665.2267.2166.2964.02
7o3ProprietaryThinkOn
60.91
57.5059.1761.4558.5461.1164.0960.8962.1663.5965.0354.4968.29
8Gemini 2.5 ProProprietaryThinkOn
59.34
53.6157.7859.0457.9357.2256.9160.8963.2467.9362.3061.2460.98
9Grok-4ProprietaryThinkOn
58.74
57.7856.6762.6560.3758.3360.2259.7856.2262.5060.6652.2560.98
10Gemini 2.5 FlashProprietaryHybridOn
58.62
51.1156.3962.0556.7162.7860.7761.4560.0063.0457.9264.0456.71
11o4-miniProprietaryThinkOn
57.57
54.1755.0062.0559.7652.7858.5663.6955.6857.6160.6656.7460.98
12Qwen3 235B A22B Thinking 2507OpenThinkOn
55.48
240442323549.1753.3356.0258.5450.5662.4360.8952.9756.5260.1153.9360.37
13GPT-5 nanoProprietaryThinkOn
55.39
51.9453.8957.2353.6655.5658.0159.7854.5956.5259.0257.3051.83
14GLM-4.5 FP8OpenHybridOn
54.03
144260435546.9454.1760.8458.5448.8955.8054.7548.1157.6157.9257.8754.88
15Qwen3 235B A22B Instruct 2507OpenInstructOff
52.94
43343323546.6755.2853.6159.1546.1151.3855.8754.5953.2656.2854.4953.05
16DeepSeek V3.1OpenHybridOn
51.45
71035667144.4448.3356.6348.7848.8955.2553.0752.9756.5257.9250.5654.27
17gpt-oss-120BOpenThinkOn
49.11
76037011746.6751.3951.8147.5645.0051.3854.7550.2751.6347.5446.0745.12
18DeepSeek R1OpenThinkOn
48.79
117855467142.2249.4450.0053.0547.2248.6250.2848.1151.6354.1044.3853.05
19Gauss2.3 HybridProprietaryHybridOn
46.58
54630839.7245.5648.8048.1745.0044.2053.6345.4152.1751.9144.9447.56
20DeepSeek V3OpenInstructOff
45.09
40840867137.5043.6146.9951.2245.5644.7544.6944.3248.9149.1844.9449.39
21Qwen3 32BOpenHybridOn
44.44
11133903338.8941.6748.8050.0038.3346.4144.6944.8644.5750.8246.0747.56
22A.X 4.0OpenInstructOff
41.59
4124127238.8941.1143.9849.3936.1145.8643.5844.3239.6743.1739.8936.59
23gpt-oss-20BOpenThinkOn
41.18
9543262136.6742.7845.7845.7337.7835.9141.9039.4651.0940.4438.7641.46
24EXAONE 4.0 32BOpenHybridOn
33.82
12745033233.6138.3328.9235.9826.1135.9134.0838.9235.3333.8828.0931.71
25HyperCLOVAX SEED Think 14BOpenHybridOn
31.84
14443821532.2237.2231.9338.4127.7832.6030.1729.1932.0733.3325.2826.22
26Solar Pro PreviewOpenInstructOff
20.73
260260229.7222.2221.0824.399.4418.2324.0229.7329.8933.3322.4712.80
27Mi:dm 2.0 Base InstructOpenInstructOff
20.25
3163161226.3926.3917.4726.8313.3318.7820.6716.2220.6521.3112.929.15
28Kanana 1.5 15.7B A3B InstructOpenInstructOff
11.71
4144141621.1120.2810.8415.245.567.738.949.198.155.465.064.88