LLMBenchmarks Data

Source Notebook

Results from the Wolfram LLM Benchmarking Project

Examples

Basic Examples (1) 

Obtain the benchmark data:

In[1]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]
Out[145]=

Visualizations (1) 

Display a bar chart with the top 10 models:

In[146]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)][
 BarChart[Reverse@Take[#, 10], BarOrigin -> Left] &, Labeled[#CorrectFunctionality, #Model] &]
Out[146]=

Analysis (4) 

Get the top three models by code generation correctness:

In[147]:=
Query[TakeLargestBy["CorrectFunctionality", 3], "Model"][
 ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]]
Out[147]=

Select all models from Meta:

In[148]:=
Query[Select[#Vendor == "Meta" &], "Model"][ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]]
Out[148]=

Select the top model for each vendor:

In[149]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)][GroupBy["Vendor"], TakeLargestBy["CorrectFunctionality", 1], "Model"]
Out[149]=

Sort the vendors by their average model score on generating valid Wolfram Language syntax:

In[150]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)][
 ReverseSort@GroupBy[#, #Vendor & -> (#CorrectSyntax &), Mean] &]
Out[150]=

Wolfram Research, "LLMBenchmarks Data" from the Wolfram Data Repository (2025)  

Data Resource History

Source Metadata

Publisher Information