Gradio

Leaderboards by Samsung Research for LLM evaluation.

✨ Samsung Research | 📄 Paper | 🌠 Discussion | 🔭 Updated: 2025-12-24

🏆 TRUEBench: A Benchmark for Assessing LLMs as Human Job Productivity Assistants

TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark) evaluates LLMs as productivity assistants.
As LLMs become integral to tasks like report drafting and data analysis, existing benchmarks are suboptimal to capture real-world challenges.
To address this gap, Samsung Research developed TRUEBench as a comprehensive evaluation framework for real-world LLM applications.

TRUEBench is a benchmark designed to evaluate the instruction-following capabilities of LLMs, determining whether a response receives a Pass (1 point) or Fail (0 points) based on checklists.
This aligns with user satisfaction from the perspective of job productivity.

Main Features

📝

2,400+ Productivity-Oriented User Inputs

A large-scale collection of complex, real-world user inputs designed to reflect productivity assistant scenarios.

🌎

Multilinguality in Real Tasks

Comprehensive 12-language coverage with intra-instance multilingual instructions.

For multilingual aspects, it was created through local research institutes.

🧩

Beyond Explicit Constraints

Human-annotated implicit requirements validated by LLMs.

🧭

Dynamic Multi-Turn Contexts

Realistic dialogue flows with evolving constraints.

📂 Dataset Sample →

Category Analysis

TRUEBench consists of 10 categories and 46 sub-categories which highly related to productivity assistants.

📝 Content Generation

Evaluates the model's ability to produce diverse written outputs across professional and creative domains. This category measures adaptability to linguistic, stylistic, and formatting constraints, as well as the effectiveness of prompt engineering.

🏷️Email 🏷️ReportDrafting

✂️ Editing

Evaluates refinement capabilities for optimizing given text. It focuses on queries related to rephrasing, revision, and correction, while preserving the rest of the content.

🏷️QueryRephrase 🏷️DocumentRevision

📊 Data Analysis

Measures proficiency in processing structured and unstructured data. This category includes tasks related to information extraction and data processing.

🏷️JSONFormatted 🏷️TableQuery

🧠 Reasoning

Assesses logical problem-solving in coding, multiple-choice question answering, and mathematical operations. It also includes evaluation of rounding errors made by models in quantitative tasks.

🏷️Logical 🏷️Mathematical

🦄 Hallucination

Detects limitations in generating plausible but inaccurate responses when faced with ambiguous queries, insufficient context, hypothetical scenarios, or challenges in document interpretation.

🏷️InsufficientContext 🏷️FalseQueries

🛡️ Safety

Verifies safeguards against harmful/inappropriate content. This category tests filtering of discriminatory, violent, or illegal material while upholding ethical standards.

🏷️Illegal 🏷️Prejudice

🔁 Repetition

Evaluates consistency in producing iterative content variations while maintaining quality and relevance across outputs.

🏷️Listing

📝 Summarization

Measures ability to distill lengthy content into concise overviews preserving core concepts and eliminating redundancy. This category includes various constraints such as language, format, and output length.

🏷️BulletPoints 🏷️N-lineSummary

🌐 Translation

Tests the ability to accurately translate diverse real-world contexts while adhering to target language and specified constraints. Our benchmark includes linguistic conditions in 12 languages, ensuring comprehensive multilingual evaluation.

🏷️Document 🏷️Line-by-line

💬 Multi-Turn

Assesses the model's ability to capture user intent in challenging scenarios where the context shifts or understanding of previous context is required.

🏷️Consistency 🏷️Non-consistency

Rank	Model Nameⓘ	Type	Model Type	Think	Overall	Parameter Size (B)	Content Generation	Editing	Data Analysis	Reasoning	Hallucination	Safety	Repetition	Summarization	Translation	Multi-Turn
🥇	GPT-5	Proprietary	Think	On	70.73		71.00	74.38	76.49	79.75	64.94	56.20	82.86	80.16	69.38	54.36
🥈	o3-pro	Proprietary	Think	On	66.47		72.50	70.31	75.70	83.88	64.37	33.88	74.29	65.48	64.33	48.32
🥉	GPT-5.2	Proprietary	Hybrid	On	66.18		69.25	65.62	71.31	78.51	70.69	52.07	51.43	80.56	55.90	55.03
4	GPT-5.1	Proprietary	Hybrid	On	64.57		67.00	70.00	72.51	82.64	65.52	52.07	51.43	67.06	59.55	45.64
5	Claude 4.5 Opus	Proprietary	Hybrid	On	63.41		63.50	62.50	73.71	77.69	82.76	52.89	58.57	63.49	56.74	45.97
6	Claude 4 Opus	Proprietary	Hybrid	On	63.29		60.75	59.69	73.31	69.83	78.74	53.72	55.71	65.48	65.45	48.99
7	Claude 4.1 Opus	Proprietary	Hybrid	On	63.24		61.25	60.00	78.49	72.73	77.01	56.20	57.14	61.90	62.64	46.98
8	GPT-5 mini	Proprietary	Think	On	62.56		68.00	62.50	74.90	76.86	55.17	47.93	44.29	74.60	56.18	45.30
9	Gemini 3 Pro Preview	Proprietary	Think	On	62.48		59.50	64.38	76.49	78.93	70.69	39.67	65.71	61.51	58.15	48.99
10	Claude 4 Sonnet	Proprietary	Hybrid	On	61.80		58.00	58.44	76.49	67.77	79.31	57.02	44.29	65.08	62.92	44.97
11	o3	Proprietary	Think	On	60.91		68.75	60.00	73.31	79.34	54.02	34.71	64.29	60.71	55.06	46.98
12	Gemini 2.5 Pro	Proprietary	Think	On	59.34		54.00	60.94	78.88	73.14	63.22	17.36	52.86	67.86	53.93	52.68
13	Gemini 3 Flash Preview	Proprietary	Think	On	59.26		59.50	59.69	75.30	79.34	63.22	34.71	57.14	59.92	50.84	46.31
14	GLM-4.7 FP8	Open	Hybrid	On	59.22	358	62.75	60.00	75.30	75.21	58.05	29.75	35.71	66.67	53.93	45.30
15	DeepSeek V3.2 Speciale	Open	Think	On	59.14	671	64.00	67.19	74.50	78.10	48.28	20.66	58.57	66.27	53.09	38.93
16	Grok-4	Proprietary	Think	On	58.74		61.00	66.25	72.51	63.22	66.09	16.53	58.57	66.27	54.21	44.30
17	Gemini 2.5 Flash	Proprietary	Hybrid	On	58.62		57.25	62.19	70.52	72.31	56.90	28.93	47.14	68.65	55.06	46.98
18	o4-mini	Proprietary	Think	On	57.57		67.25	61.25	71.71	75.62	45.40	39.67	44.29	59.92	47.19	41.95
19	Kimi K2 Thinking	Open	Think	On	56.84	1000	58.25	50.31	69.72	77.27	60.92	44.63	38.57	59.92	52.25	44.30
20	Qwen3 235B A22B Thinking 2507	Open	Think	On	55.48	235	57.50	53.12	73.31	75.21	55.17	25.62	35.71	55.56	56.18	40.27
21	GPT-5 nano	Proprietary	Think	On	55.39		63.50	47.19	68.92	75.21	55.17	52.07	34.29	63.49	40.73	42.95
22	GLM-4.5 FP8	Open	Hybrid	On	54.03	358	60.75	53.75	68.92	74.38	47.13	33.06	41.43	60.32	46.07	35.91
23	GLM-4.6 FP8	Open	Hybrid	On	53.30	358	57.50	51.25	71.31	71.90	53.45	24.79	28.57	58.33	44.38	43.29
24	Gemini 2.5 Flash-lite Preview	Proprietary	Think	On	53.06		55.00	55.94	68.13	70.25	47.70	23.97	30.00	60.71	46.63	42.28
25	Qwen3 235B A22B Instruct 2507	Open	Instruct	Off	52.94	235	58.00	49.69	68.13	73.97	55.17	45.45	30.00	55.95	38.48	41.61
26	DeepSeek V3.2	Open	Think	On	52.17	671	51.25	51.56	70.92	72.31	51.15	36.36	37.14	60.32	40.17	39.93
27	DeepSeek V3.1	Open	Hybrid	On	51.45	671	52.00	50.00	67.33	69.83	50.00	33.88	35.71	59.52	41.85	40.27
28	DeepSeek V3.1 Terminus	Open	Hybrid	On	51.37	671	51.50	52.19	69.32	73.14	51.72	25.62	38.57	57.14	38.76	40.94
29	Qwen3 30B A3B Thinking 2507	Open	Think	On	50.44	30	56.25	45.00	69.32	69.01	50.00	29.75	30.00	48.02	47.47	36.58
30	MiMo V2 Flash	Open	Think	On	50.32	309	54.00	48.12	67.73	68.18	44.83	48.76	28.57	53.97	40.73	35.91
31	gpt-oss-120B	Open	Think	On	49.11	117	58.50	48.44	68.92	69.83	41.38	39.67	25.71	50.79	35.67	32.21
32	DeepSeek R1	Open	Think	On	48.79	671	49.75	50.00	65.34	59.09	48.85	38.02	32.86	57.94	36.52	38.93
33	Gauss2.3 Hybrid	Proprietary	Hybrid	On	46.58		52.00	46.25	59.76	66.94	41.95	34.71	25.71	53.17	34.55	33.22
34	Mistral Large 3 675B Instruct 2512	Open	Instruct	Off	45.21	675	44.00	50.62	65.34	60.33	33.33	14.88	37.14	53.97	36.52	35.91
35	DeepSeek V3	Open	Instruct	Off	45.09	671	46.25	45.00	58.96	60.33	41.95	21.49	30.00	55.95	38.48	33.22
36	Qwen3 32B	Open	Hybrid	On	44.44	33	52.25	41.56	68.92	66.53	35.06	19.83	25.71	46.43	30.90	32.89
37	Qwen3 30B A3B Instruct 2507	Open	Instruct	Off	42.79	30	45.00	35.00	56.18	66.12	51.15	33.06	24.29	46.83	28.09	35.57
38	MiniMax-M2	Open	Think	On	42.43	230	48.75	35.62	53.39	57.02	43.10	44.63	28.57	49.21	30.06	31.21
39	A.X 4.0	Open	Instruct	Off	41.59	72	56.00	43.75	43.43	42.56	40.23	15.70	24.29	53.97	33.43	32.21
40	gpt-oss-20B	Open	Think	On	41.18	21	52.00	40.00	61.35	65.70	43.10	41.32	22.86	36.51	20.51	22.82
41	Gemma 3 27B it	Open	Instruct	Off	40.86	27	44.25	45.00	45.82	36.78	31.61	32.23	22.86	57.14	32.87	39.93
42	Tongyi DeepResearch 30B A3B	Open	Think	On	40.10	30	41.25	33.12	62.15	68.18	44.25	23.97	18.57	41.67	26.12	29.19
43	Mistral Small 3.2 24B Instruct 2506	Open	Instruct	Off	39.09	24	43.00	44.69	43.43	51.65	25.86	22.31	25.71	51.98	31.18	30.20
44	K2-Think	Open	Think	On	35.06	33	35.50	36.56	56.18	47.11	35.06	14.05	12.86	49.21	21.63	23.15
45	Kanana 2 30B A3B Thinking	Open	Think	On	34.50	31	37.50	25.00	57.77	54.55	39.66	20.66	15.71	38.10	24.72	20.47
46	KAT Dev 72B Exp	Open	Instruct	Off	33.94	72	29.25	44.06	46.22	46.69	25.86	18.18	20.00	42.86	25.56	25.50
47	Olmo 3 32B Think	Open	Think	On	33.94	32	35.25	30.94	57.37	66.53	33.33	28.93	24.29	34.52	11.80	19.80
48	EXAONE 4.0 32B	Open	Hybrid	On	33.82	32	34.25	29.38	56.97	57.44	24.71	27.27	17.14	38.49	18.54	25.50
49	Apriel 1.5 15B Thinker	Open	Think	On	31.92	15	44.25	26.56	47.41	59.09	22.99	37.19	20.00	26.98	20.22	10.07
50	HyperCLOVAX SEED Think 14B	Open	Hybrid	On	31.84	15	35.00	26.56	53.78	58.68	27.59	26.45	17.14	29.76	17.13	20.47
51	Kanana 2 30B A3B Instruct	Open	Instruct	Off	30.84	31	38.00	25.62	35.86	47.11	37.93	23.97	18.57	35.32	20.51	19.46
52	Dhanishtha-2.0 Preview	Open	Think	On	25.81	15	28.25	19.38	30.28	33.47	43.10	47.93	20.00	31.75	12.08	13.09
53	ERNIE 4.5 21B A3B Thinking	Open	Think	On	25.32	21	27.25	20.31	42.23	49.59	23.56	31.40	17.14	28.17	7.30	13.76
54	Solar Pro Preview	Open	Instruct	Off	20.73	22	28.00	24.69	16.73	19.42	17.24	28.10	11.43	31.35	13.76	11.74
55	Mi:dm 2.0 Base Instruct	Open	Instruct	Off	20.25	12	21.75	17.50	16.73	18.60	27.59	59.50	14.29	25.40	12.64	11.41
56	Kanana 1.5 15.7B A3B Instruct	Open	Instruct	Off	11.71	16	14.25	10.62	13.55	11.16	22.41	22.31	4.29	11.90	6.74	5.37

Plot

Output Length vs. Category Score

Explore the relationship between median output length and model performance by category

Median Length: Median number of tokens including both Think and Answer
Median Response Length: Median number of answer tokens, excluding Think
Note: We measured the token length of open-source models only and proprietary models were excluded from the plot.

Plot

Language Analysis

As a multilingual benchmark, TRUEBench supports a total of 12 user input languages: Korean (KO), English (EN), Japanese (JA), Chinese (ZH), Polish (PL), German (DE), Portuguese (PT), Spanish (ES), French (FR), Italian (IT), Russian (RU), and Vietnamese (VI).

Rank	Model Nameⓘ	Type	Model Type	Think	Overall	Parameter Size (B)	KO	EN	JA	ZH	PL	DE	PT	ES	FR	IT	RU	VI
🥇	GPT-5	Proprietary	Think	On	70.73		64.72	65.83	71.69	67.68	72.78	71.27	73.74	75.68	72.83	77.05	70.79	75.61
🥈	o3-pro	Proprietary	Think	On	66.47		63.61	63.61	69.28	65.24	63.89	64.09	68.16	69.19	70.11	72.13	62.36	71.95
🥉	GPT-5.2	Proprietary	Hybrid	On	66.18		61.67	61.39	69.28	64.63	68.89	66.30	70.95	63.24	68.48	70.49	70.22	68.29
4	GPT-5.1	Proprietary	Hybrid	On	64.57		57.78	62.50	65.06	62.80	65.56	60.22	65.36	68.11	74.46	70.49	67.42	63.41
5	Claude 4.5 Opus	Proprietary	Hybrid	On	63.41		59.44	60.28	66.27	64.02	66.67	65.19	63.69	62.16	63.59	64.48	65.73	67.07
6	Claude 4 Opus	Proprietary	Hybrid	On	63.29		57.50	62.50	64.46	62.80	59.44	65.19	65.92	60.54	65.22	65.57	65.17	72.56
7	Claude 4.1 Opus	Proprietary	Hybrid	On	63.24		58.33	61.39	60.84	64.02	61.67	66.85	68.16	61.08	65.76	66.67	65.73	65.24
8	GPT-5 mini	Proprietary	Think	On	62.56		57.50	56.39	62.65	62.20	63.89	60.22	66.48	67.03	70.11	67.76	66.29	60.98
9	Gemini 3 Pro Preview	Proprietary	Think	On	62.48		59.44	60.56	60.24	62.20	61.67	65.19	63.13	64.32	65.76	65.57	64.04	62.20
10	Claude 4 Sonnet	Proprietary	Hybrid	On	61.80		54.17	59.17	63.86	64.63	59.44	61.33	64.80	62.16	65.22	67.21	66.29	64.02
11	o3	Proprietary	Think	On	60.91		57.50	59.17	61.45	58.54	61.11	64.09	60.89	62.16	63.59	65.03	54.49	68.29
12	Gemini 2.5 Pro	Proprietary	Think	On	59.34		53.61	57.78	59.04	57.93	57.22	56.91	60.89	63.24	67.93	62.30	61.24	60.98
13	Gemini 3 Flash Preview	Proprietary	Think	On	59.26		53.89	57.22	61.45	57.32	56.67	61.33	57.54	58.92	64.67	67.76	60.11	61.59
14	GLM-4.7 FP8	Open	Hybrid	On	59.22	358	54.17	55.28	63.86	63.41	55.00	58.56	62.01	61.08	63.59	61.75	66.29	54.88
15	DeepSeek V3.2 Speciale	Open	Think	On	59.14	671	50.83	58.06	63.25	57.93	58.89	58.56	58.66	60.00	65.22	66.12	59.55	62.20
16	Grok-4	Proprietary	Think	On	58.74		57.78	56.67	62.65	60.37	58.33	60.22	59.78	56.22	62.50	60.66	52.25	60.98
17	Gemini 2.5 Flash	Proprietary	Hybrid	On	58.62		51.11	56.39	62.05	56.71	62.78	60.77	61.45	60.00	63.04	57.92	64.04	56.71
18	o4-mini	Proprietary	Think	On	57.57		54.17	55.00	62.05	59.76	52.78	58.56	63.69	55.68	57.61	60.66	56.74	60.98
19	Kimi K2 Thinking	Open	Think	On	56.84	1000	50.00	57.50	60.84	62.20	53.33	54.14	61.45	53.51	59.24	59.56	56.18	61.59
20	Qwen3 235B A22B Thinking 2507	Open	Think	On	55.48	235	49.17	53.33	56.02	58.54	50.56	62.43	60.89	52.97	56.52	60.11	53.93	60.37
21	GPT-5 nano	Proprietary	Think	On	55.39		51.94	53.89	57.23	53.66	55.56	58.01	59.78	54.59	56.52	59.02	57.30	51.83
22	GLM-4.5 FP8	Open	Hybrid	On	54.03	358	46.94	54.17	60.84	58.54	48.89	55.80	54.75	48.11	57.61	57.92	57.87	54.88
23	GLM-4.6 FP8	Open	Hybrid	On	53.30	358	49.17	54.17	54.22	56.71	52.22	53.04	49.16	56.76	56.52	56.28	53.93	50.61
24	Gemini 2.5 Flash-lite Preview	Proprietary	Think	On	53.06		47.78	51.11	51.20	53.66	51.67	54.70	59.22	51.89	57.07	55.74	57.87	51.83
25	Qwen3 235B A22B Instruct 2507	Open	Instruct	Off	52.94	235	46.67	55.28	53.61	59.15	46.11	51.38	55.87	54.59	53.26	56.28	54.49	53.05
26	DeepSeek V3.2	Open	Think	On	52.17	671	47.50	49.44	53.61	50.61	50.56	54.14	59.22	52.43	57.07	56.28	44.94	57.93
27	DeepSeek V3.1	Open	Hybrid	On	51.45	671	44.44	48.33	56.63	48.78	48.89	55.25	53.07	52.97	56.52	57.92	50.56	54.27
28	DeepSeek V3.1 Terminus	Open	Hybrid	On	51.37	671	46.94	50.83	51.81	53.66	50.00	53.59	51.96	55.14	53.80	54.64	48.31	50.61
29	Qwen3 30B A3B Thinking 2507	Open	Think	On	50.44	30	44.17	49.17	50.00	57.32	42.22	49.72	53.07	50.27	54.89	56.83	47.75	58.54
30	MiMo V2 Flash	Open	Think	On	50.32	309	42.22	53.06	49.40	54.27	47.78	51.93	53.63	52.97	54.89	54.64	42.13	52.44
31	gpt-oss-120B	Open	Think	On	49.11	117	46.67	51.39	51.81	47.56	45.00	51.38	54.75	50.27	51.63	47.54	46.07	45.12
32	DeepSeek R1	Open	Think	On	48.79	671	42.22	49.44	50.00	53.05	47.22	48.62	50.28	48.11	51.63	54.10	44.38	53.05
33	Gauss2.3 Hybrid	Proprietary	Hybrid	On	46.58		39.72	45.56	48.80	48.17	45.00	44.20	53.63	45.41	52.17	51.91	44.94	47.56
34	Mistral Large 3 675B Instruct 2512	Open	Instruct	Off	45.21	675	41.39	44.17	50.60	46.34	46.11	43.65	45.81	44.32	49.46	49.18	42.13	44.51
35	DeepSeek V3	Open	Instruct	Off	45.09	671	37.50	43.61	46.99	51.22	45.56	44.75	44.69	44.32	48.91	49.18	44.94	49.39
36	Qwen3 32B	Open	Hybrid	On	44.44	33	38.89	41.67	48.80	50.00	38.33	46.41	44.69	44.86	44.57	50.82	46.07	47.56
37	Qwen3 30B A3B Instruct 2507	Open	Instruct	Off	42.79	30	34.44	43.89	40.96	48.78	38.89	41.99	46.93	44.32	42.93	48.09	43.26	46.95
38	MiniMax-M2	Open	Think	On	42.43	230	31.94	46.11	37.35	45.73	38.33	45.30	45.25	48.65	41.30	46.45	42.70	46.95
39	A.X 4.0	Open	Instruct	Off	41.59	72	38.89	41.11	43.98	49.39	36.11	45.86	43.58	44.32	39.67	43.17	39.89	36.59
40	gpt-oss-20B	Open	Think	On	41.18	21	36.67	42.78	45.78	45.73	37.78	35.91	41.90	39.46	51.09	40.44	38.76	41.46
41	Gemma 3 27B it	Open	Instruct	Off	40.86	27	34.44	35.00	37.35	43.90	42.22	43.65	47.49	41.08	44.02	53.55	39.33	40.24
42	Tongyi DeepResearch 30B A3B	Open	Think	On	40.10	30	36.11	40.83	43.37	44.51	32.78	37.02	44.69	38.92	43.48	46.45	37.08	39.63
43	Mistral Small 3.2 24B Instruct 2506	Open	Instruct	Off	39.09	24	31.39	40.00	36.75	42.07	34.44	44.20	41.90	42.16	45.65	40.98	37.64	38.41
44	K2-Think	Open	Think	On	35.06	33	29.17	36.11	30.12	44.51	26.67	33.15	38.55	37.84	41.85	37.70	33.71	36.59
45	Kanana 2 30B A3B Thinking	Open	Think	On	34.50	31	25.28	43.06	38.55	40.24	25.00	34.25	37.99	32.43	34.24	37.70	28.65	38.41
46	KAT Dev 72B Exp	Open	Instruct	Off	33.94	72	25.00	32.22	31.93	37.20	34.44	33.15	43.02	37.84	36.96	37.70	30.34	38.41
47	Olmo 3 32B Think	Open	Think	On	33.94	32	30.56	41.39	30.12	31.10	25.00	34.25	35.75	33.51	36.41	37.16	31.46	35.98
48	EXAONE 4.0 32B	Open	Hybrid	On	33.82	32	33.61	38.33	28.92	35.98	26.11	35.91	34.08	38.92	35.33	33.88	28.09	31.71
49	Apriel 1.5 15B Thinker	Open	Think	On	31.92	15	23.61	39.72	30.72	38.41	24.44	40.88	37.99	32.43	32.61	22.95	28.65	31.71
50	HyperCLOVAX SEED Think 14B	Open	Hybrid	On	31.84	15	32.22	37.22	31.93	38.41	27.78	32.60	30.17	29.19	32.07	33.33	25.28	26.22
51	Kanana 2 30B A3B Instruct	Open	Instruct	Off	30.84	31	33.06	39.44	37.35	33.54	17.78	26.52	25.14	30.81	29.35	31.15	23.03	32.93
52	Dhanishtha-2.0 Preview	Open	Think	On	25.81	15	23.33	27.22	30.12	32.32	20.56	20.99	26.26	25.95	25.54	30.60	23.60	25.00
53	ERNIE 4.5 21B A3B Thinking	Open	Think	On	25.32	21	17.50	31.11	18.67	39.02	23.33	24.31	24.58	26.49	24.46	30.60	19.10	27.44
54	Solar Pro Preview	Open	Instruct	Off	20.73	22	9.72	22.22	21.08	24.39	9.44	18.23	24.02	29.73	29.89	33.33	22.47	12.80
55	Mi:dm 2.0 Base Instruct	Open	Instruct	Off	20.25	12	26.39	26.39	17.47	26.83	13.33	18.78	20.67	16.22	20.65	21.31	12.92	9.15
56	Kanana 1.5 15.7B A3B Instruct	Open	Instruct	Off	11.71	16	21.11	20.28	10.84	15.24	5.56	7.73	8.94	9.19	8.15	5.46	5.06	4.88

Plot