LLMBenchmarks Data

Source Notebook

Results from the Wolfram LLM Benchmarking Project

Examples

Basic Examples (1) 

Obtain the benchmark data:

In[1]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]

Visualizations (3) 

Display a bar chart with the top 10 models:

In[2]:=
BarChart[
 ConstructColumns[
    "value" -> Function[Labeled[#CorrectFunctionality, #Model]]]@
   SortBy["CorrectFunctionality"]@MaximalBy[ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)], "CorrectFunctionality", 10] -> "value", BarOrigin -> Left]
Out[2]=

Display all the correct functionality results over time:

In[3]:=
ListPlot[ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)] -> {"ReleaseDate", "CorrectFunctionality"}, AxesLabel -> {"ReleaseDate", "CorrectFunctionality"}]
Out[3]=

Display all the correct functionality results for Google models over time:

In[4]:=
ListPlot[Select[#Vendor == "Google" &]@ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)] -> {"ReleaseDate", "CorrectFunctionality"}, AxesLabel -> {"ReleaseDate", "CorrectFunctionality"}]
Out[4]=

Analysis (4) 

Get the top three models by code generation correctness:

In[5]:=
Query[TakeLargestBy["CorrectFunctionality", 3], "Model"][
 ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]]
Out[5]=

Select all models from Meta:

In[6]:=
Query[Select[#Vendor == "Meta" &], "Model"][ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]]
Out[6]=

Select the top model for each vendor:

In[7]:=
AggregateRows[ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)], "Model" -> Function[First@#Model], "Vendor"]
Out[7]=

Sort the vendors by their average model score on generating valid Wolfram Language syntax:

In[8]:=
AggregateRows[ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)], "AverageCorrectSyntax" -> Function[Mean[#CorrectSyntax]], "Vendor"] // ReverseSortBy["AverageCorrectSyntax"]
Out[8]=

Wolfram Research, "LLMBenchmarks Data" from the Wolfram Data Repository (2025)  

Data Resource History

Source Metadata

Data Downloads

Publisher Information