English WSD benchmark

SenseBench Leaderboard

SenseBench measures how well language models disambiguate English words: each model sees a word in its sentence context together with its candidate WordNet senses and must answer with the index of the correct sense. Every row is recomputed from verified, fully auditable run artifacts on the lexEN dataset, and anyone can submit a run by pull request.

Verified Runs
184
Models
56
Top Accuracy
95.60%
Dataset
lexen-v1

Re-scores the entire leaderboard — table, chart, Pareto frontier, ranks, and pairwise tests — against the chosen gold labels and sense granularity. Default is lexEN v1 · WordNet fine-grained. What do these mean? → About coarsening →

Reference Baselines

System Accuracy Dataset Provenance
MFS (WordNet first sense)
Computed at build time
61.55%
±1.39%
lexen-v1 Most frequent sense baseline: WordNet 3.0's first (frequency-ranked) sense for the target lemma and part of speech, computed directly on the dataset items.
BEM
Published predictions
79.65%
±1.13%
lexen-v1 Bi-Encoder Model (Blevins & Zettlemoyer 2020); per-item predictions released by Maru et al. 2022, scored on this dataset's items.
Reproduced predictions
81.42%
±1.08%
lexen-v1 ESCHER (Barba et al. 2021; SemCor training); predictions reproduced by Glite, scored on this dataset's items.
Reproduced predictions
84.88%
±0.99%
lexen-v1 ConSeC (Barba et al. 2021); predictions reproduced by Glite (SemCor + WordNet Gloss+Examples training, 82.9 F1 on Raganato ALL), scored on this dataset's items.

Classic WSD systems scored from per-item predictions on exactly the same dataset items as the model runs, with the same correctness rule. They appear as dashed lines on the chart.

Compare Prompt
1
OpenAI · Proprietary
95.60%
±0.59%
$10,700 p001
2
OpenAI · Proprietary
95.25%
±0.60%
$6,077 p001
3
OpenAI · Proprietary
95.19%
±0.60%
$7,718 p001
4
OpenAI · Proprietary
95.15%
±0.62%
$6,233 p004
5
OpenAI · Proprietary
95.15%
±0.60%
$11,301 p004
6
OpenAI · Proprietary
95.10%
±0.62%
$6,289 p003
7
OpenAI · Proprietary
95.00%
±0.61%
$3,516 p003
8
OpenAI · Proprietary
95.00%
±0.62%
$5,040 p001
9
OpenAI · Proprietary
94.98%
±0.63%
$4,562 p003
10
Google · Proprietary
94.92%
±0.61%
$7,227 p001
11
Google · Proprietary
94.61%
±0.63%
$3,691 p001
12
Anthropic · Proprietary
94.57%
±0.64%
$7,285 p001
13
Google · Proprietary
94.53%
±0.64%
$7,586 p004
14
Google · Proprietary
94.43%
±0.63%
$4,711 p001
15
Google · Proprietary
94.40%
±0.65%
$3,081 p003
16
OpenAI · Proprietary
94.22%
±0.67%
$5,625 p002
17
Google · Proprietary
94.20%
±0.64%
$5,557 p004
18
Anthropic · Proprietary
94.16%
±0.65%
$5,243 p001
19
Google · Proprietary
94.16%
±0.63%
$5,408 p001
20
OpenAI · Proprietary
94.03%
±0.67%
$3,790 p002
21
Anthropic · Proprietary
93.85%
±0.67%
$4,834 p001
22
Anthropic · Proprietary
93.81%
±0.66%
$7,655 p004
23
Google · Open weights · fp8 · H100 80GB
93.73%
±0.68%
$327 p003
24
Anthropic · Proprietary
93.68%
±0.66%
$4,850 p001
25
Anthropic · Proprietary
93.62%
±0.68%
$4,994 p001
26
Google · Open weights · fp8 · H100 80GB
93.44%
±0.69%
$390 p004
27
Google · Open weights · fp8 · H100 80GB
93.38%
±0.69%
$368 p001
28
Z.ai · Open weights
93.38%
±0.70%
$3,333 p001
29
Moonshot · Open weights
93.31%
±0.69%
$2,565 p001
30
xAI · Proprietary
93.13%
±0.71%
$2,583 p001
31
xAI · Proprietary
93.09%
±0.69%
$1,763 p001
32
Anthropic · Proprietary
93.01%
±0.73%
$4,894 p001
33
OpenAI · Proprietary
92.88%
±0.73%
$2,260 p003
34
xAI · Proprietary
92.88%
±0.71%
$2,816 p001
35
Alibaba · Open weights
92.82%
±0.74%
$2,152 p002
36
Alibaba · Open weights
92.70%
±0.73%
$794 p001
37
Anthropic · Proprietary
92.70%
±0.74%
$3,933 p001
38
Moonshot · Open weights
92.66%
±0.74%
$3,091 p001
39
Google · Proprietary
92.57%
±0.73%
$847 p002
40
DeepSeek · Open weights
92.41%
±0.75%
$1,984 p001
41
Google · Open weights · fp8 · H100 80GB
92.39%
±0.74%
$35.44 p001
42
Moonshot · Open weights
92.39%
±0.76%
$2,615 p002
43
Google · Open weights · fp8 · H100 80GB
92.29%
±0.75%
$20.36 p003
44
Z.ai · Open weights
92.22%
±0.75%
$3,246 p002
45
Anthropic · Proprietary
92.22%
±0.74%
$3,338 p003
46
Anthropic · Proprietary
91.79%
±0.74%
$2,121 p002
47
DeepSeek · Open weights
91.50%
±0.77%
$1,338 p002
48
Alibaba · Open weights
91.46%
±0.78%
$659 p002
49
OpenAI · Proprietary
91.22%
±0.82%
$690 p003
50
Google · Open weights · fp8 · H100 80GB
91.20%
±0.79%
$11.49 p002
51
OpenAI · Proprietary
90.99%
±0.82%
$386 p003
52
DeepSeek · Open weights
90.93%
±0.83%
$132 p001
53
OpenAI · Proprietary
90.89%
±0.81%
$512 p003
54
OpenAI · Proprietary
90.72%
±0.81%
$803 p001
55
Anthropic · Proprietary
90.72%
±0.84%
$2,141 p001
56
MiniMax · Open weights
90.62%
±0.81%
$545 p001
57
OpenAI · Proprietary
90.54%
±0.81%
$480 p001
58
OpenAI · Proprietary
90.48%
±0.83%
$217 p003
59
Anthropic · Proprietary
90.48%
±0.81%
$3,348 p001
60
Google · Proprietary
90.41%
±0.83%
$171 p001
61
OpenAI · Proprietary
90.10%
±0.84%
$685 p001
62
OpenAI · Proprietary
90.04%
±0.83%
$363 p002
63
OpenAI · Proprietary
89.94%
±0.84%
$306 p001
64
DeepSeek · Open weights
89.69%
±0.85%
$88.64 p002
65
Google · Proprietary
89.61%
±0.82%
$674 p002
66
Alibaba · Open weights · fp8 · H200 141GB
89.51%
±0.85%
$39.99 p001
67
OpenAI · Proprietary
89.49%
±0.88%
$374 p002
68
Google · Open weights · fp8 · A100 80GB
89.43%
±0.88%
$13.08 p001
69
Google · Proprietary
89.41%
±0.86%
$2,529 p001
70
Alibaba · Open weights · fp8 · H100 80GB
89.36%
±0.86%
$25.80 p001
71
Google · Open weights · fp8 · H200 141GB
89.32%
±0.87%
$15.25 p001
72
Alibaba · Open weights · gptq-int4 · B300 SXM6 AC
89.24%
±0.87%
$157 p001
73
Google · Open weights · fp8 · H100 80GB
89.20%
±0.87%
$9.29 p001
74
OpenAI · Proprietary
89.14%
±0.88%
$866 p003
75
Google · Open weights · fp8 · H100 80GB
89.12%
±0.88%
$5.20 p003
76
Google · Proprietary
89.08%
±0.88%
$59.42 p002
77
Alibaba · Open weights · bf16 · A100 80GB
89.06%
±0.86%
$48.37 p001
78
Alibaba · Open weights · bf16 · H200 141GB
88.97%
±0.87%
$60.98 p001
79
Alibaba · Open weights · bf16 · H100 80GB
88.89%
±0.87%
$36.96 p001
80
Anthropic · Proprietary
88.89%
±0.88%
$809 p002
81
Z.ai · Open weights · awq-int4 · B300 SXM6 AC
88.71%
±0.89%
$222 p001
82
OpenAI · Proprietary
88.69%
±0.87%
$1,281 p001
83
OpenAI · Proprietary
87.86%
±0.92%
$465 p002
84
Anthropic · Proprietary
87.60%
±0.92%
$1,997 p002
85
Meta · Open weights · awq-int4 · B300 SXM6 AC
87.55%
±0.94%
$1,062 p001
86
Google · Open weights · fp8 · H100 80GB
87.25%
±0.95%
$2.63 p002
87
Google · Open weights · fp8 · H200 141GB
87.25%
±0.96%
$6.95 p002
88
Alibaba · Open weights · fp8 · H100 80GB
87.20%
±0.96%
$9.56 p002
89
Alibaba · Open weights · fp8 · H200 141GB
87.18%
±0.96%
$15.27 p002
90
Z.ai · Open weights · awq-int4 · B300 SXM6 AC
87.12%
±0.95%
$89.29 p002
91
Google · Open weights · fp8 · A100 80GB
87.10%
±0.99%
$4.27 p002
92
Alibaba · Open weights · fp8 · H200 141GB
87.06%
±0.95%
$44.35 p001
93
Alibaba · Open weights · fp8 · B300 SXM6 AC
87.02%
±0.96%
$122 p001
94
Alibaba · Open weights · gptq-int4 · B300 SXM6 AC
86.69%
±0.95%
$61.20 p002
95
OpenAI · Proprietary
86.63%
±0.92%
$145 p003
96
Alibaba · Open weights · awq-int4 · H200 141GB
86.59%
±0.99%
$92.19 p001
97
Alibaba · Open weights · bf16 · H100 80GB
86.55%
±0.98%
$13.67 p002
98
Alibaba · Open weights · bf16 · H200 141GB
86.55%
±0.98%
$22.97 p002
99
OpenAI · Proprietary
86.48%
±0.95%
$209 p003
100
Alibaba · Open weights · bf16 · A100 80GB
86.46%
±0.97%
$17.78 p002
101
OpenAI · Proprietary
86.26%
±0.96%
$221 p001
102
Google · Open weights · fp8 · H100 80GB
85.58%
±1.01%
$23.34 p003
103
Alibaba · Open weights · fp8 · H100 80GB
85.56%
±1.00%
$8.77 p001
104
Alibaba · Open weights · fp8 · H200 141GB
85.50%
±1.00%
$14.52 p001
105
OpenAI · Proprietary
85.50%
±1.00%
$209 p002
106
Alibaba · Open weights · fp8 · H200 141GB
85.48%
±0.99%
$19.64 p002
107
Alibaba · Open weights · fp8 · B300 SXM6 AC
85.29%
±0.97%
$35.77 p002
108
Alibaba · Open weights · fp8 · B300 SXM6 AC
85.27%
±1.00%
$56.86 p001
109
Cohere · Open weights · fp8 · H200 141GB
85.23%
±1.01%
$137 p001
110
Google · Open weights · fp8 · H100 80GB
85.19%
±1.00%
$55.12 p001
111
OpenAI · Proprietary
85.00%
±1.01%
$185 p001
112
Alibaba · Open weights · awq-int4 · A100 80GB
84.84%
±1.00%
$53.98 p002
113
Alibaba · Open weights · awq-int4 · H100 80GB
84.82%
±1.02%
$33.17 p002
114
Alibaba · Open weights · awq-int4 · H200 141GB
84.82%
±1.02%
$54.97 p002
115
Alibaba · Open weights · awq-int4 · H200 141GB
84.74%
±1.01%
$36.16 p002
116
Alibaba · Open weights · bf16 · H200 141GB
84.45%
±1.02%
$18.11 p001
117
OpenAI · Proprietary
84.32%
±1.02%
$982 p001
118
Alibaba · Open weights · awq-int4 · H200 141GB
84.08%
±1.03%
$93.38 p001
119
NVIDIA · Open weights · fp8 · H200 141GB
83.93%
±1.02%
$21.54 p003
120
OpenAI · Proprietary
83.93%
±0.99%
$105 p002
121
OpenAI · Proprietary
83.77%
±1.04%
$127 p003
122
OpenAI · Proprietary
83.69%
±1.07%
$256 p001
123
OpenAI · Proprietary
83.56%
±1.02%
$63.56 p003
124
OpenAI · Proprietary
83.56%
±1.06%
$400 p001
125
Google · Open weights · fp8 · H200 141GB
83.44%
±1.02%
$12.11 p001
126
Alibaba · Open weights · awq-int4 · A100 80GB
83.44%
±1.04%
$138 p001
127
Alibaba · Open weights · awq-int4 · H200 141GB
83.42%
±1.03%
$143 p001
128
Alibaba · Open weights · awq-int4 · H100 80GB
83.40%
±1.03%
$84.77 p001
129
Google · Open weights · fp8 · H100 80GB
83.30%
±1.02%
$19.61 p002
130
OpenAI · Proprietary
82.90%
±1.05%
$35.26 p002
131
Alibaba · Open weights · fp8 · H100 80GB
82.72%
±1.04%
$2.97 p002
132
Alibaba · Open weights · fp8 · H200 141GB
82.70%
±1.03%
$8.07 p002
133
OpenAI · Proprietary
82.70%
±1.07%
$96.14 p001
134
NVIDIA · Open weights · fp8 · H200 141GB
82.35%
±1.06%
$16.69 p002
135
Alibaba · Open weights · fp8 · B300 SXM6 AC
82.27%
±1.08%
$23.12 p002
136
Alibaba · Open weights · bf16 · H200 141GB
82.10%
±1.09%
$9.31 p002
137
NVIDIA · Open weights · fp8 · H200 141GB
81.86%
±1.12%
$43.65 p001
138
Meta · Open weights · awq-int4 · B300 SXM6 AC
81.53%
±1.09%
$81.26 p002
139
Google · Open weights · fp8 · H100 80GB
81.44%
±1.10%
$3.68 p003
140
Alibaba · Open weights · awq-int4 · H200 141GB
81.03%
±1.10%
$36.78 p002
141
Alibaba · Open weights · bf16 · A100 80GB
80.91%
±1.13%
$15.04 p001
142
Alibaba · Open weights · bf16 · H200 141GB
80.89%
±1.15%
$20.13 p001
143
Alibaba · Open weights · bf16 · H100 80GB
80.79%
±1.13%
$11.36 p001
144
OpenAI · Proprietary
80.54%
±1.14%
$93.99 p002
145
Z.ai · Open weights · bf16 · A100 80GB
80.46%
±1.12%
$4.98 p002
146
Z.ai · Open weights · bf16 · A100 80GB
79.78%
±1.11%
$13.56 p001
147
Mistral AI · Open weights · fp8 · H200 141GB
79.39%
±1.11%
$13.49 p002
148
Mistral AI · Open weights · fp8 · H100 80GB
79.28%
±1.13%
$8.62 p002
149
Mistral AI · Open weights · fp8 · A100 80GB
79.00%
±1.13%
$20.47 p002
150
Google · Open weights · fp8 · H200 141GB
78.89%
±1.15%
$5.26 p002
151
Google · Open weights · fp8 · H100 80GB
78.89%
±1.10%
$19.89 p003
152
Google · Open weights · fp8 · H100 80GB
78.87%
±1.15%
$2.25 p002
153
Tencent · Open weights · gptq-int4 · A100 80GB
78.73%
±1.16%
$33.08 p001
154
Z.ai · Open weights · bf16 · A100 80GB
77.60%
±1.17%
$12.19 p001
155
Mistral AI · Open weights · fp8 · A100 80GB
77.17%
±1.19%
$49.07 p001
156
Google · Open weights · fp8 · H100 80GB
77.12%
±1.17%
$15.35 p002
157
OpenAI · Proprietary
76.98%
±1.17%
$125 p001
158
Mistral AI · Open weights · fp8 · H100 80GB
76.75%
±1.22%
$18.68 p001
159
Mistral AI · Open weights · fp8 · H200 141GB
76.61%
±1.25%
$27.54 p001
160
Alibaba · Open weights · bf16 · H100 80GB
76.34%
±1.17%
$4.37 p002
161
Alibaba · Open weights · bf16 · H200 141GB
76.28%
±1.19%
$7.90 p002
162
Z.ai · Open weights · bf16 · A100 80GB
75.70%
±1.19%
$4.43 p002
163
AllenAI · Open weights · fp8 · H100 80GB
73.98%
±1.21%
$24.76 p001
164
AllenAI · Open weights · fp8 · H100 80GB
73.32%
±1.25%
$9.73 p002
165
OpenAI · Proprietary
73.28%
±1.23%
$23.50 p002
166
IBM · Open weights · fp8 · H200 141GB
70.17%
±1.25%
$29.62 p001
167
Google · Open weights · fp8 · H200 141GB
70.13%
±1.27%
$10.22 p001
168
OpenAI · Proprietary
70.07%
±1.18%
$64.03 p001
169
IBM · Open weights · fp8 · H100 80GB
69.92%
±1.27%
$17.81 p001
170
NVIDIA · Open weights · fp8 · H100 80GB
69.22%
±1.36%
$2.94 p002
171
IBM · Open weights · bf16 · A100 80GB
69.02%
±1.26%
$28.12 p001
172
NVIDIA · Open weights · fp8 · H100 80GB
65.34%
±1.33%
$4.68 p003
173
Google · Open weights · fp8 · H100 80GB
65.07%
±1.33%
$2.80 p003
174
Microsoft · Open weights · bf16 · A100 80GB
63.98%
±1.29%
$5.80 p001
175
NVIDIA · Open weights · fp8 · H100 80GB
63.55%
±1.38%
$7.19 p001
176
Google · Open weights · fp8 · H100 80GB
61.65%
±1.37%
$1.75 p002
177
Google · Open weights · fp8 · H200 141GB
61.63%
±1.36%
$5.76 p002
178
OpenAI · Proprietary
59.06%
±1.36%
$37.51 p001
179
Meta · Open weights · bf16 · A100 80GB
58.82%
±1.41%
$11.34 p001
180
Meta · Open weights · fp8 · H200 141GB
58.05%
±1.42%
$11.91 p001
181
Meta · Open weights · fp8 · H100 80GB
57.99%
±1.41%
$6.19 p001
182
Meta · Open weights · fp8 · H100 80GB
44.31%
±1.39%
$2.53 p002
183
Meta · Open weights · fp8 · H200 141GB
44.21%
±1.38%
$5.33 p002
184
Meta · Open weights · bf16 · A100 80GB
39.85%
±1.37%
$4.28 p002

Compare Selected Runs

Select up to six runs from the table.

Method Notes

Public rows are rebuilt from verified run artifacts in the repository. Accuracy is recomputed from predictions, and official runs are checked against the registered dataset hash and prompt ID.

The ± value under each accuracy is half the width of a fixed-seed bootstrap 95% confidence interval. The small range under each rank lists the positions a run could plausibly occupy among the visible rows given overlapping intervals.

The Score against and Sense granularity controls re-score every run and baseline against one of nine scoring schemes: the lexEN v1, Maru 2022 (ALLamended), or original Raganato 2017 gold labels, each at WordNet 3.0 fine-grained sense or at one of two coarse concept levels — Glite or CSI. A prediction is coarse-correct when its WordNet sense maps to the same coarse concept as a gold sense (so coarse accuracy is always at least the fine-grained accuracy). The default and official score is lexEN v1 · WordNet fine-grained. See label schemes and coarsening for what each option means.

The Pareto frontier uses the visible rows and the selected chart metric (cost per million items or machine-hours per 1M items): higher accuracy is better, and a lower metric value is better. Starred rows are on the frontier.

Machine-hours per 1M items is recorded for self-hosted runs only. It measures the per-item evaluation loop on the recorded machine, excluding model and asset loading; cloud API runs are excluded from that chart.

Reference baselines are classic WSD systems scored from per-item predictions on the same dataset items; dashed chart lines show them for context.

Comparing selected runs on the same dataset uses McNemar's test on paired per-item correctness, which detects differences that overlapping confidence intervals can miss.