LLMBenchmarks Data

Source Notebook

Results from the Wolfram LLM Benchmarking Project

Examples

Basic Examples (1) 

Obtain the benchmark data:

In[1]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]
Out[2]=

Visualizations (1) 

Display a bar chart with the top 10 models:

In[3]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)][
 BarChart[Reverse@Take[#, 10], BarOrigin -> Left] &, Labeled[#CorrectFunctionality, #Model] &]
Out[3]=

Analysis (4) 

Get the top three models by code generation correctness:

In[4]:=
Query[TakeLargestBy["CorrectFunctionality", 3], "Model"][
 ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]]
Out[4]=

Select all models from Meta:

In[5]:=
Query[Select[#Vendor == "Meta" &], "Model"][ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)]]
Out[5]=

Select the top model for each vendor:

In[6]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)][GroupBy["Vendor"], TakeLargestBy["CorrectFunctionality", 1], "Model"]
Out[6]=

Sort the vendors by their average model score on generating valid Wolfram Language syntax:

In[7]:=
ResourceData[\!\(\*
TagBox["\"\<LLMBenchmarks Data\>\"",
#& ,
BoxID -> "ResourceTag-LLMBenchmarks Data-Input",
AutoDelete->True]\)][
 ReverseSort@GroupBy[#, #Vendor & -> (#CorrectSyntax &), Mean] &]
Out[7]=

Wolfram Research, "LLMBenchmarks Data" from the Wolfram Data Repository (2025)  

Data Resource History

Source Metadata

Data Downloads

Publisher Information