English WSD benchmark
SenseBench Leaderboard
SenseBench measures how well language models disambiguate English words: each model sees a word in its sentence context together with its candidate WordNet senses and must answer with the index of the correct sense. Every row is recomputed from verified, fully auditable run artifacts on the lexEN dataset, and anyone can submit a run by pull request.
- Verified Runs
- 184
- Models
- 56
- Top Accuracy
- 95.60%
- Dataset
- lexen-v1
Reference Baselines
| System | Accuracy | Dataset | Provenance |
|---|---|---|---|
|
MFS (WordNet first sense)
Computed at build time
|
61.55%
±1.39%
|
lexen-v1 | Most frequent sense baseline: WordNet 3.0's first (frequency-ranked) sense for the target lemma and part of speech, computed directly on the dataset items. |
|
Published predictions
|
79.65%
±1.13%
|
lexen-v1 | Bi-Encoder Model (Blevins & Zettlemoyer 2020); per-item predictions released by Maru et al. 2022, scored on this dataset's items. |
|
Reproduced predictions
|
81.42%
±1.08%
|
lexen-v1 | ESCHER (Barba et al. 2021; SemCor training); predictions reproduced by Glite, scored on this dataset's items. |
|
Reproduced predictions
|
84.88%
±0.99%
|
lexen-v1 | ConSeC (Barba et al. 2021); predictions reproduced by Glite (SemCor + WordNet Gloss+Examples training, 82.9 F1 on Raganato ALL), scored on this dataset's items. |
Classic WSD systems scored from per-item predictions on exactly the same dataset items as the model runs, with the same correctness rule. They appear as dashed lines on the chart.
| Compare | Prompt | ||||
|---|---|---|---|---|---|
1 |
OpenAI · Proprietary
|
★ 95.60%
±0.59%
|
$10,700 | p001 | |
2 |
OpenAI · Proprietary
|
★ 95.25%
±0.60%
|
$6,077 | p001 | |
3 |
OpenAI · Proprietary
|
95.19%
±0.60%
|
$7,718 | p001 | |
4 |
OpenAI · Proprietary
|
95.15%
±0.62%
|
$6,233 | p004 | |
5 |
OpenAI · Proprietary
|
95.15%
±0.60%
|
$11,301 | p004 | |
6 |
OpenAI · Proprietary
|
95.10%
±0.62%
|
$6,289 | p003 | |
7 |
OpenAI · Proprietary
|
★ 95.00%
±0.61%
|
$3,516 | p003 | |
8 |
OpenAI · Proprietary
|
95.00%
±0.62%
|
$5,040 | p001 | |
9 |
OpenAI · Proprietary
|
94.98%
±0.63%
|
$4,562 | p003 | |
10 |
Google · Proprietary
|
94.92%
±0.61%
|
$7,227 | p001 | |
11 |
Google · Proprietary
|
94.61%
±0.63%
|
$3,691 | p001 | |
12 |
Anthropic · Proprietary
|
94.57%
±0.64%
|
$7,285 | p001 | |
13 |
Google · Proprietary
|
94.53%
±0.64%
|
$7,586 | p004 | |
14 |
Google · Proprietary
|
94.43%
±0.63%
|
$4,711 | p001 | |
15 |
Google · Proprietary
|
★ 94.40%
±0.65%
|
$3,081 | p003 | |
16 |
OpenAI · Proprietary
|
94.22%
±0.67%
|
$5,625 | p002 | |
17 |
Google · Proprietary
|
94.20%
±0.64%
|
$5,557 | p004 | |
18 |
Anthropic · Proprietary
|
94.16%
±0.65%
|
$5,243 | p001 | |
19 |
Google · Proprietary
|
94.16%
±0.63%
|
$5,408 | p001 | |
20 |
OpenAI · Proprietary
|
94.03%
±0.67%
|
$3,790 | p002 | |
21 |
Anthropic · Proprietary
|
93.85%
±0.67%
|
$4,834 | p001 | |
22 |
Anthropic · Proprietary
|
93.81%
±0.66%
|
$7,655 | p004 | |
23 |
Google · Open weights · fp8 · H100 80GB
|
★ 93.73%
±0.68%
|
$327 | p003 | |
24 |
Anthropic · Proprietary
|
93.68%
±0.66%
|
$4,850 | p001 | |
25 |
Anthropic · Proprietary
|
93.62%
±0.68%
|
$4,994 | p001 | |
26 |
Google · Open weights · fp8 · H100 80GB
|
93.44%
±0.69%
|
$390 | p004 | |
27 |
Google · Open weights · fp8 · H100 80GB
|
93.38%
±0.69%
|
$368 | p001 | |
28 |
Z.ai · Open weights
|
93.38%
±0.70%
|
$3,333 | p001 | |
29 |
Moonshot · Open weights
|
93.31%
±0.69%
|
$2,565 | p001 | |
30 |
xAI · Proprietary
|
93.13%
±0.71%
|
$2,583 | p001 | |
31 |
xAI · Proprietary
|
93.09%
±0.69%
|
$1,763 | p001 | |
32 |
Anthropic · Proprietary
|
93.01%
±0.73%
|
$4,894 | p001 | |
33 |
OpenAI · Proprietary
|
92.88%
±0.73%
|
$2,260 | p003 | |
34 |
xAI · Proprietary
|
92.88%
±0.71%
|
$2,816 | p001 | |
35 |
Alibaba · Open weights
|
92.82%
±0.74%
|
$2,152 | p002 | |
36 |
Alibaba · Open weights
|
92.70%
±0.73%
|
$794 | p001 | |
37 |
Anthropic · Proprietary
|
92.70%
±0.74%
|
$3,933 | p001 | |
38 |
Moonshot · Open weights
|
92.66%
±0.74%
|
$3,091 | p001 | |
39 |
Google · Proprietary
|
92.57%
±0.73%
|
$847 | p002 | |
40 |
DeepSeek · Open weights
|
92.41%
±0.75%
|
$1,984 | p001 | |
41 |
Google · Open weights · fp8 · H100 80GB
|
★ 92.39%
±0.74%
|
$35.44 | p001 | |
42 |
Moonshot · Open weights
|
92.39%
±0.76%
|
$2,615 | p002 | |
43 |
Google · Open weights · fp8 · H100 80GB
|
★ 92.29%
±0.75%
|
$20.36 | p003 | |
44 |
Z.ai · Open weights
|
92.22%
±0.75%
|
$3,246 | p002 | |
45 |
Anthropic · Proprietary
|
92.22%
±0.74%
|
$3,338 | p003 | |
46 |
Anthropic · Proprietary
|
91.79%
±0.74%
|
$2,121 | p002 | |
47 |
DeepSeek · Open weights
|
91.50%
±0.77%
|
$1,338 | p002 | |
48 |
Alibaba · Open weights
|
91.46%
±0.78%
|
$659 | p002 | |
49 |
OpenAI · Proprietary
|
91.22%
±0.82%
|
$690 | p003 | |
50 |
Google · Open weights · fp8 · H100 80GB
|
★ 91.20%
±0.79%
|
$11.49 | p002 | |
51 |
OpenAI · Proprietary
|
90.99%
±0.82%
|
$386 | p003 | |
52 |
DeepSeek · Open weights
|
90.93%
±0.83%
|
$132 | p001 | |
53 |
OpenAI · Proprietary
|
90.89%
±0.81%
|
$512 | p003 | |
54 |
OpenAI · Proprietary
|
90.72%
±0.81%
|
$803 | p001 | |
55 |
Anthropic · Proprietary
|
90.72%
±0.84%
|
$2,141 | p001 | |
56 |
MiniMax · Open weights
|
90.62%
±0.81%
|
$545 | p001 | |
57 |
OpenAI · Proprietary
|
90.54%
±0.81%
|
$480 | p001 | |
58 |
OpenAI · Proprietary
|
90.48%
±0.83%
|
$217 | p003 | |
59 |
Anthropic · Proprietary
|
90.48%
±0.81%
|
$3,348 | p001 | |
60 |
Google · Proprietary
|
90.41%
±0.83%
|
$171 | p001 | |
61 |
OpenAI · Proprietary
|
90.10%
±0.84%
|
$685 | p001 | |
62 |
OpenAI · Proprietary
|
90.04%
±0.83%
|
$363 | p002 | |
63 |
OpenAI · Proprietary
|
89.94%
±0.84%
|
$306 | p001 | |
64 |
DeepSeek · Open weights
|
89.69%
±0.85%
|
$88.64 | p002 | |
65 |
Google · Proprietary
|
89.61%
±0.82%
|
$674 | p002 | |
66 |
Alibaba · Open weights · fp8 · H200 141GB
|
89.51%
±0.85%
|
$39.99 | p001 | |
67 |
OpenAI · Proprietary
|
89.49%
±0.88%
|
$374 | p002 | |
68 |
Google · Open weights · fp8 · A100 80GB
|
89.43%
±0.88%
|
$13.08 | p001 | |
69 |
Google · Proprietary
|
89.41%
±0.86%
|
$2,529 | p001 | |
70 |
Alibaba · Open weights · fp8 · H100 80GB
|
89.36%
±0.86%
|
$25.80 | p001 | |
71 |
Google · Open weights · fp8 · H200 141GB
|
89.32%
±0.87%
|
$15.25 | p001 | |
72 |
Alibaba · Open weights · gptq-int4 · B300 SXM6 AC
|
89.24%
±0.87%
|
$157 | p001 | |
73 |
Google · Open weights · fp8 · H100 80GB
|
★ 89.20%
±0.87%
|
$9.29 | p001 | |
74 |
OpenAI · Proprietary
|
89.14%
±0.88%
|
$866 | p003 | |
75 |
Google · Open weights · fp8 · H100 80GB
|
★ 89.12%
±0.88%
|
$5.20 | p003 | |
76 |
Google · Proprietary
|
89.08%
±0.88%
|
$59.42 | p002 | |
77 |
Alibaba · Open weights · bf16 · A100 80GB
|
89.06%
±0.86%
|
$48.37 | p001 | |
78 |
Alibaba · Open weights · bf16 · H200 141GB
|
88.97%
±0.87%
|
$60.98 | p001 | |
79 |
Alibaba · Open weights · bf16 · H100 80GB
|
88.89%
±0.87%
|
$36.96 | p001 | |
80 |
Anthropic · Proprietary
|
88.89%
±0.88%
|
$809 | p002 | |
81 |
Z.ai · Open weights · awq-int4 · B300 SXM6 AC
|
88.71%
±0.89%
|
$222 | p001 | |
82 |
OpenAI · Proprietary
|
88.69%
±0.87%
|
$1,281 | p001 | |
83 |
OpenAI · Proprietary
|
87.86%
±0.92%
|
$465 | p002 | |
84 |
Anthropic · Proprietary
|
87.60%
±0.92%
|
$1,997 | p002 | |
85 |
Meta · Open weights · awq-int4 · B300 SXM6 AC
|
87.55%
±0.94%
|
$1,062 | p001 | |
86 |
Google · Open weights · fp8 · H100 80GB
|
★ 87.25%
±0.95%
|
$2.63 | p002 | |
87 |
Google · Open weights · fp8 · H200 141GB
|
87.25%
±0.96%
|
$6.95 | p002 | |
88 |
Alibaba · Open weights · fp8 · H100 80GB
|
87.20%
±0.96%
|
$9.56 | p002 | |
89 |
Alibaba · Open weights · fp8 · H200 141GB
|
87.18%
±0.96%
|
$15.27 | p002 | |
90 |
Z.ai · Open weights · awq-int4 · B300 SXM6 AC
|
87.12%
±0.95%
|
$89.29 | p002 | |
91 |
Google · Open weights · fp8 · A100 80GB
|
87.10%
±0.99%
|
$4.27 | p002 | |
92 |
Alibaba · Open weights · fp8 · H200 141GB
|
87.06%
±0.95%
|
$44.35 | p001 | |
93 |
Alibaba · Open weights · fp8 · B300 SXM6 AC
|
87.02%
±0.96%
|
$122 | p001 | |
94 |
Alibaba · Open weights · gptq-int4 · B300 SXM6 AC
|
86.69%
±0.95%
|
$61.20 | p002 | |
95 |
OpenAI · Proprietary
|
86.63%
±0.92%
|
$145 | p003 | |
96 |
Alibaba · Open weights · awq-int4 · H200 141GB
|
86.59%
±0.99%
|
$92.19 | p001 | |
97 |
Alibaba · Open weights · bf16 · H100 80GB
|
86.55%
±0.98%
|
$13.67 | p002 | |
98 |
Alibaba · Open weights · bf16 · H200 141GB
|
86.55%
±0.98%
|
$22.97 | p002 | |
99 |
OpenAI · Proprietary
|
86.48%
±0.95%
|
$209 | p003 | |
100 |
Alibaba · Open weights · bf16 · A100 80GB
|
86.46%
±0.97%
|
$17.78 | p002 | |
101 |
OpenAI · Proprietary
|
86.26%
±0.96%
|
$221 | p001 | |
102 |
Google · Open weights · fp8 · H100 80GB
|
85.58%
±1.01%
|
$23.34 | p003 | |
103 |
Alibaba · Open weights · fp8 · H100 80GB
|
85.56%
±1.00%
|
$8.77 | p001 | |
104 |
Alibaba · Open weights · fp8 · H200 141GB
|
85.50%
±1.00%
|
$14.52 | p001 | |
105 |
OpenAI · Proprietary
|
85.50%
±1.00%
|
$209 | p002 | |
106 |
Alibaba · Open weights · fp8 · H200 141GB
|
85.48%
±0.99%
|
$19.64 | p002 | |
107 |
Alibaba · Open weights · fp8 · B300 SXM6 AC
|
85.29%
±0.97%
|
$35.77 | p002 | |
108 |
Alibaba · Open weights · fp8 · B300 SXM6 AC
|
85.27%
±1.00%
|
$56.86 | p001 | |
109 |
Cohere · Open weights · fp8 · H200 141GB
|
85.23%
±1.01%
|
$137 | p001 | |
110 |
Google · Open weights · fp8 · H100 80GB
|
85.19%
±1.00%
|
$55.12 | p001 | |
111 |
OpenAI · Proprietary
|
85.00%
±1.01%
|
$185 | p001 | |
112 |
Alibaba · Open weights · awq-int4 · A100 80GB
|
84.84%
±1.00%
|
$53.98 | p002 | |
113 |
Alibaba · Open weights · awq-int4 · H100 80GB
|
84.82%
±1.02%
|
$33.17 | p002 | |
114 |
Alibaba · Open weights · awq-int4 · H200 141GB
|
84.82%
±1.02%
|
$54.97 | p002 | |
115 |
Alibaba · Open weights · awq-int4 · H200 141GB
|
84.74%
±1.01%
|
$36.16 | p002 | |
116 |
Alibaba · Open weights · bf16 · H200 141GB
|
84.45%
±1.02%
|
$18.11 | p001 | |
117 |
OpenAI · Proprietary
|
84.32%
±1.02%
|
$982 | p001 | |
118 |
Alibaba · Open weights · awq-int4 · H200 141GB
|
84.08%
±1.03%
|
$93.38 | p001 | |
119 |
NVIDIA · Open weights · fp8 · H200 141GB
|
83.93%
±1.02%
|
$21.54 | p003 | |
120 |
OpenAI · Proprietary
|
83.93%
±0.99%
|
$105 | p002 | |
121 |
OpenAI · Proprietary
|
83.77%
±1.04%
|
$127 | p003 | |
122 |
OpenAI · Proprietary
|
83.69%
±1.07%
|
$256 | p001 | |
123 |
OpenAI · Proprietary
|
83.56%
±1.02%
|
$63.56 | p003 | |
124 |
OpenAI · Proprietary
|
83.56%
±1.06%
|
$400 | p001 | |
125 |
Google · Open weights · fp8 · H200 141GB
|
83.44%
±1.02%
|
$12.11 | p001 | |
126 |
Alibaba · Open weights · awq-int4 · A100 80GB
|
83.44%
±1.04%
|
$138 | p001 | |
127 |
Alibaba · Open weights · awq-int4 · H200 141GB
|
83.42%
±1.03%
|
$143 | p001 | |
128 |
Alibaba · Open weights · awq-int4 · H100 80GB
|
83.40%
±1.03%
|
$84.77 | p001 | |
129 |
Google · Open weights · fp8 · H100 80GB
|
83.30%
±1.02%
|
$19.61 | p002 | |
130 |
OpenAI · Proprietary
|
82.90%
±1.05%
|
$35.26 | p002 | |
131 |
Alibaba · Open weights · fp8 · H100 80GB
|
82.72%
±1.04%
|
$2.97 | p002 | |
132 |
Alibaba · Open weights · fp8 · H200 141GB
|
82.70%
±1.03%
|
$8.07 | p002 | |
133 |
OpenAI · Proprietary
|
82.70%
±1.07%
|
$96.14 | p001 | |
134 |
NVIDIA · Open weights · fp8 · H200 141GB
|
82.35%
±1.06%
|
$16.69 | p002 | |
135 |
Alibaba · Open weights · fp8 · B300 SXM6 AC
|
82.27%
±1.08%
|
$23.12 | p002 | |
136 |
Alibaba · Open weights · bf16 · H200 141GB
|
82.10%
±1.09%
|
$9.31 | p002 | |
137 |
NVIDIA · Open weights · fp8 · H200 141GB
|
81.86%
±1.12%
|
$43.65 | p001 | |
138 |
Meta · Open weights · awq-int4 · B300 SXM6 AC
|
81.53%
±1.09%
|
$81.26 | p002 | |
139 |
Google · Open weights · fp8 · H100 80GB
|
81.44%
±1.10%
|
$3.68 | p003 | |
140 |
Alibaba · Open weights · awq-int4 · H200 141GB
|
81.03%
±1.10%
|
$36.78 | p002 | |
141 |
Alibaba · Open weights · bf16 · A100 80GB
|
80.91%
±1.13%
|
$15.04 | p001 | |
142 |
Alibaba · Open weights · bf16 · H200 141GB
|
80.89%
±1.15%
|
$20.13 | p001 | |
143 |
Alibaba · Open weights · bf16 · H100 80GB
|
80.79%
±1.13%
|
$11.36 | p001 | |
144 |
OpenAI · Proprietary
|
80.54%
±1.14%
|
$93.99 | p002 | |
145 |
Z.ai · Open weights · bf16 · A100 80GB
|
80.46%
±1.12%
|
$4.98 | p002 | |
146 |
Z.ai · Open weights · bf16 · A100 80GB
|
79.78%
±1.11%
|
$13.56 | p001 | |
147 |
Mistral AI · Open weights · fp8 · H200 141GB
|
79.39%
±1.11%
|
$13.49 | p002 | |
148 |
Mistral AI · Open weights · fp8 · H100 80GB
|
79.28%
±1.13%
|
$8.62 | p002 | |
149 |
Mistral AI · Open weights · fp8 · A100 80GB
|
79.00%
±1.13%
|
$20.47 | p002 | |
150 |
Google · Open weights · fp8 · H200 141GB
|
78.89%
±1.15%
|
$5.26 | p002 | |
151 |
Google · Open weights · fp8 · H100 80GB
|
78.89%
±1.10%
|
$19.89 | p003 | |
152 |
Google · Open weights · fp8 · H100 80GB
|
★ 78.87%
±1.15%
|
$2.25 | p002 | |
153 |
Tencent · Open weights · gptq-int4 · A100 80GB
|
78.73%
±1.16%
|
$33.08 | p001 | |
154 |
Z.ai · Open weights · bf16 · A100 80GB
|
77.60%
±1.17%
|
$12.19 | p001 | |
155 |
Mistral AI · Open weights · fp8 · A100 80GB
|
77.17%
±1.19%
|
$49.07 | p001 | |
156 |
Google · Open weights · fp8 · H100 80GB
|
77.12%
±1.17%
|
$15.35 | p002 | |
157 |
OpenAI · Proprietary
|
76.98%
±1.17%
|
$125 | p001 | |
158 |
Mistral AI · Open weights · fp8 · H100 80GB
|
76.75%
±1.22%
|
$18.68 | p001 | |
159 |
Mistral AI · Open weights · fp8 · H200 141GB
|
76.61%
±1.25%
|
$27.54 | p001 | |
160 |
Alibaba · Open weights · bf16 · H100 80GB
|
76.34%
±1.17%
|
$4.37 | p002 | |
161 |
Alibaba · Open weights · bf16 · H200 141GB
|
76.28%
±1.19%
|
$7.90 | p002 | |
162 |
Z.ai · Open weights · bf16 · A100 80GB
|
75.70%
±1.19%
|
$4.43 | p002 | |
163 |
AllenAI · Open weights · fp8 · H100 80GB
|
73.98%
±1.21%
|
$24.76 | p001 | |
164 |
AllenAI · Open weights · fp8 · H100 80GB
|
73.32%
±1.25%
|
$9.73 | p002 | |
165 |
OpenAI · Proprietary
|
73.28%
±1.23%
|
$23.50 | p002 | |
166 |
IBM · Open weights · fp8 · H200 141GB
|
70.17%
±1.25%
|
$29.62 | p001 | |
167 |
Google · Open weights · fp8 · H200 141GB
|
70.13%
±1.27%
|
$10.22 | p001 | |
168 |
OpenAI · Proprietary
|
70.07%
±1.18%
|
$64.03 | p001 | |
169 |
IBM · Open weights · fp8 · H100 80GB
|
69.92%
±1.27%
|
$17.81 | p001 | |
170 |
NVIDIA · Open weights · fp8 · H100 80GB
|
69.22%
±1.36%
|
$2.94 | p002 | |
171 |
IBM · Open weights · bf16 · A100 80GB
|
69.02%
±1.26%
|
$28.12 | p001 | |
172 |
NVIDIA · Open weights · fp8 · H100 80GB
|
65.34%
±1.33%
|
$4.68 | p003 | |
173 |
Google · Open weights · fp8 · H100 80GB
|
65.07%
±1.33%
|
$2.80 | p003 | |
174 |
Microsoft · Open weights · bf16 · A100 80GB
|
63.98%
±1.29%
|
$5.80 | p001 | |
175 |
NVIDIA · Open weights · fp8 · H100 80GB
|
63.55%
±1.38%
|
$7.19 | p001 | |
176 |
Google · Open weights · fp8 · H100 80GB
|
★ 61.65%
±1.37%
|
$1.75 | p002 | |
177 |
Google · Open weights · fp8 · H200 141GB
|
61.63%
±1.36%
|
$5.76 | p002 | |
178 |
OpenAI · Proprietary
|
59.06%
±1.36%
|
$37.51 | p001 | |
179 |
Meta · Open weights · bf16 · A100 80GB
|
58.82%
±1.41%
|
$11.34 | p001 | |
180 |
Meta · Open weights · fp8 · H200 141GB
|
58.05%
±1.42%
|
$11.91 | p001 | |
181 |
Meta · Open weights · fp8 · H100 80GB
|
57.99%
±1.41%
|
$6.19 | p001 | |
182 |
Meta · Open weights · fp8 · H100 80GB
|
44.31%
±1.39%
|
$2.53 | p002 | |
183 |
Meta · Open weights · fp8 · H200 141GB
|
44.21%
±1.38%
|
$5.33 | p002 | |
184 |
Meta · Open weights · bf16 · A100 80GB
|
39.85%
±1.37%
|
$4.28 | p002 |
Compare Selected Runs
Method Notes
Public rows are rebuilt from verified run artifacts in the repository. Accuracy is recomputed from predictions, and official runs are checked against the registered dataset hash and prompt ID.
The ± value under each accuracy is half the width of a fixed-seed bootstrap 95% confidence interval. The small range under each rank lists the positions a run could plausibly occupy among the visible rows given overlapping intervals.
The Score against and Sense granularity controls re-score every run and baseline against one of nine scoring schemes: the lexEN v1, Maru 2022 (ALLamended), or original Raganato 2017 gold labels, each at WordNet 3.0 fine-grained sense or at one of two coarse concept levels — Glite or CSI. A prediction is coarse-correct when its WordNet sense maps to the same coarse concept as a gold sense (so coarse accuracy is always at least the fine-grained accuracy). The default and official score is lexEN v1 · WordNet fine-grained. See label schemes and coarsening for what each option means.
The Pareto frontier uses the visible rows and the selected chart metric (cost per million items or machine-hours per 1M items): higher accuracy is better, and a lower metric value is better. Starred rows are on the frontier.
Machine-hours per 1M items is recorded for self-hosted runs only. It measures the per-item evaluation loop on the recorded machine, excluding model and asset loading; cloud API runs are excluded from that chart.
Reference baselines are classic WSD systems scored from per-item predictions on the same dataset items; dashed chart lines show them for context.
Comparing selected runs on the same dataset uses McNemar's test on paired per-item correctness, which detects differences that overlapping confidence intervals can miss.