id,name,unit,description,createdAt,updatedAt,code,coverage,timespan,datasetId,sourceId,shortUnit,display,columnOrder,originalMetadata,grapherConfigAdmin,shortName,catalogPath,dimensions,schemaVersion,processingLevel,processingLog,titlePublic,titleVariant,attributionShort,attribution,descriptionShort,descriptionFromProducer,descriptionKey,descriptionProcessing,licenses,license,grapherConfigETL,type,sort,dataChecksum,metadataChecksum
736556,Average score on Atari games relative to humans - state of the art,%,"Average performance across 57 Atari 2600 games, such as Frogger and Pac-Man. Measured relative to human performance.",2023-07-03 14:54:57,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""Average score relative to humans (100%)"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_atari_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_atari_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],d7f9dcc0ad874eb1e2fe7260b12e3342,563d791fd517d44c90296e9f932d64a0
736555,Top-5 accuracy - state of the art,%,"The top-5 accuracy measure is used to assess how frequently a model's top five predictions include the correct answer from a list of 1000 options. Here's an example to illustrate what this benchmark tests:

When an image classification model is presented with an image of an animal, it will assign probabilities to each possible label. Based on these probabilities, the model generates its top five predictions out of a total of 1000 animal labels. For instance, the model might output the following predictions as its top five guesses: * 		Cat * 		Dog * 		Elephant * 		Lion * 		Tiger

Suppose the correct label for the image is ""dog."" If ""dog"" appears among the model's top five predictions, then the model's prediction is considered correct according to the top-5 accuracy metric.

On the other hand, if the correct label is ""giraffe"" and ""giraffe"" is not included in the model's top five predictions, then the model's prediction would be considered incorrect based on the top-5 accuracy measure.

To calculate the top-5 accuracy, researchers evaluate the model's performance on a large dataset with known labels. They compute the percentage of examples in the dataset where the correct label is present within the model's top five predictions out of the 1000 possible options. This measure provides a broader perspective on the model's performance by considering whether the correct answer is among its top guesses, even if it's not the model's absolute top prediction.
",2023-07-03 14:54:57,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""Top-5 accuracy"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,papers_with_code_imagenet_top5_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#papers_with_code_imagenet_top5_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],62c1f0df7ed63b479161c02b89807098,4424413581606c35ff8203d92e7deb4e
736554,Top-1 accuracy - state of the art,%,"The top-1 accuracy measure is used to assess how frequently a model's absolute top prediction matches the correct answer from a given set of options.  Here's an example to illustrate what this benchmark tests:

Imagine an image classification model that is presented with an image of an animal. The model assigns probabilities to each potential label and generates its highest-confidence prediction. For instance, when analyzing an image, the model might predict ""Cat"" as the most probable label. To evaluate the model's accuracy using the top-1 measure, researchers compare this prediction with the correct label. If the model's top prediction matches the correct label (e.g., if the actual animal in the image is indeed a cat), then the model's prediction is considered correct according to the top-1 accuracy metric. On the other hand, if the model's top prediction does not match the correct label (e.g., if the image shows a dog, but the model predicts a cat), then the model's prediction is considered incorrect based on the top-1 accuracy measure. To calculate the top-1 accuracy, researchers analyze the model's performance on a large dataset where the correct labels are known. They determine the percentage of examples in the dataset where the model's highest-confidence prediction matches the actual label.

This measure provides a focused evaluation of the model's ability to make accurate predictions by considering only its absolute top guess.
",2023-07-03 14:54:56,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""Top-1 accuracy"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,papers_with_code_imagenet_top1_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#papers_with_code_imagenet_top1_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],c3edb83eb26f04725e38dd677ed265d8,407fbeaf907a93a51011e14827e7880e
736553,Accuracy on STEM subjects knowledge tests - state of the art,%,"This benchmark assesses the accuracy of models in STEM subjects knowledge based on the MMLU benchmark.
The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities.
",2023-07-03 14:54:56,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""STEM"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_stem_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_stem_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],88bb055bc368ea6c24195bad08a4d0a6,62b43560ab33d6efb5e0775cc63ed8dd
736552,Accuracy on Social Sciences knowledge tests - state of the art,%,"This benchmark assesses the accuracy of models in social sciences knowledge based on the MMLU benchmark.
The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities.
",2023-07-03 14:54:56,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""Social Sciences"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_social_sciences_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_social_sciences_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],f04e4879d7e4233495cd8af8659c155e,f4b77284c73a6b1fb22a6b85b763271e
736551,Accuracy on other knowledge tests - state of the art,%,"This benchmark assesses the average accuracy of models across subjects other than STEM, humanities, social sciences based on the MMLU benchmark.
The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities.
",2023-07-03 14:54:56,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""Other subjects"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_other_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_other_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],bea126b32654692af657904a6632b056,cf77260a3f43400fbba2407e9dfe826b
736550,Performance on math and problem-solving tasks - state of the art,%,"This benchmark assesses the accuracy of models on math and problem solving tasks based on the MATH benchmark.
The MATH benchmark consists of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
",2023-07-03 14:54:56,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""Math"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_math_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_math_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],c137694412389a7ce8227c4b838c3b88,d52a4bc3ace23a783eff1c9425d6e787
736549,Average accuracy on all knowledge tests - state of the art,%,"This benchmark assesses the average accuracy of models across all subjects based on the MMLU benchmark.
The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities.
",2023-07-03 14:54:56,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""All knowledge tests"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_language_average_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_language_average_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],af0bf91d65df364d67dd163d4e44a25e,0d9d524a5be665904706435b4fb42429
736548,Accuracy on Humanities knowledge tests - state of the art,%,"This benchmark assesses the accuracy of models in humanities knowledge based on the MMLU benchmark.
The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities.
",2023-07-03 14:54:56,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""Humanities"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_humanities_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_humanities_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],24961a79aa262e5cce8769cfa124d269,c91d9a55d6ab1d0591cf78c233f8a562
736547,Coding performance on interviews - state of the art,%,"This benchmark assesses the accuracy of models in coding interviews based on the APPS benchmark. The APPS benchmark focuses on coding ability and problem-solving in a natural language context, simulating the evaluation process employed during human programmer interviews. It presents coding problems in unrestricted natural language and evaluates the correctness of solutions.
The coding tasks within this benchmark are sourced from open-access coding websites such as Codeforces and Kattis. These tasks cover a spectrum of difficulty levels, ranging from introductory to collegiate competition level. The benchmark measures the accuracy of models in solving programming tasks specifically tailored for coding interviews.
",2023-07-03 14:54:56,2024-07-08 16:38:16,,,,6103,29583,%,"{""name"": ""Coding interviews"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_code_any_interview_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_code_any_interview_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],44c47793e87683e34b61c0e2880952b6,b72a908de22a8f69842216484807e768
736546,Coding performance on competitions - state of the art,%,"This benchmark measures the accuracy of models in coding competitions based on the APPS benchmark. The APPS benchmark focuses on coding ability and problem-solving in a natural language context. It aims to replicate the evaluation process used for human programmers by presenting coding problems in unrestricted natural language and assessing the correctness of solutions.
The coding tasks included in this benchmark are sourced from open-access coding websites such as Codeforces and Kattis. These tasks span a range of difficulty levels, from introductory to collegiate competition level. The benchmark evaluates the accuracy of models in solving programming tasks specifically designed for coding competitions.
",2023-07-03 14:54:56,2024-07-08 16:38:15,,,,6103,29583,%,"{""name"": ""Coding competitions"", ""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_code_any_competition_state_of_the_art,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks_state_of_the_art/papers_with_code_benchmarks_state_of_the_art#performance_code_any_competition_state_of_the_art,,1,,,,,,,,,[],,,,,float,[],e38c6810553c8c10b11ca0cd40cb53f6,c1218e88d9542613ef75a81d5a877570