id,name,unit,description,createdAt,updatedAt,code,coverage,timespan,datasetId,sourceId,shortUnit,display,columnOrder,originalMetadata,grapherConfigAdmin,shortName,catalogPath,dimensions,schemaVersion,processingLevel,processingLog,titlePublic,titleVariant,attributionShort,attribution,descriptionShort,descriptionFromProducer,descriptionKey,descriptionProcessing,licenses,license,grapherConfigETL,type,sort,dataChecksum,metadataChecksum 736545,Average score on Atari games relative to humans (100%),%,"Average performance across 57 Atari 2600 games, such as Frogger and Pac-Man. Measured relative to human performance.",2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 0}",0,,,performance_atari,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_atari,,1,,,,,,,,,[],,,,,float,[],debbb42b8e0418e91d54651ef2936daf,600894df7ff422d688a1765249bba1f1 736544,Average score on Atari games relative to humans - state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:32,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_atari_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_atari_improved,,1,,,,,,,,,[],,,,,string,[],686ff21987bb60f088eb68224744ff0e,0ec21035225b9b60867b29b72dce261f 736543,Top-5 accuracy - state of the art color code,%,,2023-07-03 14:54:48,2024-07-08 15:20:32,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 0}",0,,,papers_with_code_imagenet_top5_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#papers_with_code_imagenet_top5_improved,,1,,,,,,,,,[],,,,,string,[],f42064b3b9a6c49fb5f9142dd026d002,df3cbc0fca766a6c39ffb7029b9e57ee 736542,Top-1 accuracy - state of the art color code,%,,2023-07-03 14:54:48,2024-07-08 15:20:32,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 0}",0,,,papers_with_code_imagenet_top1_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#papers_with_code_imagenet_top1_improved,,1,,,,,,,,,[],,,,,string,[],0ddc701cb29eabe7a6b287dac24a3f23,86d7b79e8fd5ee68d2a42d7cbd7d61d5 736541,Top-5 accuracy,%,"The top-5 accuracy measure is used to assess how frequently a model's top five predictions include the correct answer from a list of 1000 options. Here's an example to illustrate what this benchmark tests: When an image classification model is presented with an image of an animal, it will assign probabilities to each possible label. Based on these probabilities, the model generates its top five predictions out of a total of 1000 animal labels. For instance, the model might output the following predictions as its top five guesses: * Cat * Dog * Elephant * Lion * Tiger Suppose the correct label for the image is ""dog."" If ""dog"" appears among the model's top five predictions, then the model's prediction is considered correct according to the top-5 accuracy metric. On the other hand, if the correct label is ""giraffe"" and ""giraffe"" is not included in the model's top five predictions, then the model's prediction would be considered incorrect based on the top-5 accuracy measure. To calculate the top-5 accuracy, researchers evaluate the model's performance on a large dataset with known labels. They compute the percentage of examples in the dataset where the correct label is present within the model's top five predictions out of the 1000 possible options. This measure provides a broader perspective on the model's performance by considering whether the correct answer is among its top guesses, even if it's not the model's absolute top prediction. ",2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 0}",0,,,papers_with_code_imagenet_top5,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#papers_with_code_imagenet_top5,,1,,,,,,,,,[],,,,,float,[],87596d558e2e33107e675da91d259c0c,6dda9259f7ebf5d6eeba088e2c7b7d34 736540,Accuracy on Social Sciences knowledge tests - state of the art- state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_social_sciences_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_social_sciences_improved,,1,,,,,,,,,[],,,,,string,[],b992ab3e8f86d8d221143acabd9c2938,dcd398383d6aeddbb86fc35c7df6db5c 736539,Accuracy on other knowledge tests - state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_other_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_other_improved,,1,,,,,,,,,[],,,,,string,[],39850a6eb1bff77861a1d84d342db5ab,d7ce4bd85ecb2cc3459df565224c8d30 736538,Accuracy on STEM subjects knowledge tests - state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_stem_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_stem_improved,,1,,,,,,,,,[],,,,,string,[],b0654fd329f56dfa613289ba85e9871d,a17db9c89960c16ccb919fd4c0b2f9ba 736537,Top-1 accuracy,%,"The top-1 accuracy measure is used to assess how frequently a model's absolute top prediction matches the correct answer from a given set of options. Here's an example to illustrate what this benchmark tests: Imagine an image classification model that is presented with an image of an animal. The model assigns probabilities to each potential label and generates its highest-confidence prediction. For instance, when analyzing an image, the model might predict ""Cat"" as the most probable label. To evaluate the model's accuracy using the top-1 measure, researchers compare this prediction with the correct label. If the model's top prediction matches the correct label (e.g., if the actual animal in the image is indeed a cat), then the model's prediction is considered correct according to the top-1 accuracy metric. On the other hand, if the model's top prediction does not match the correct label (e.g., if the image shows a dog, but the model predicts a cat), then the model's prediction is considered incorrect based on the top-1 accuracy measure. To calculate the top-1 accuracy, researchers analyze the model's performance on a large dataset where the correct labels are known. They determine the percentage of examples in the dataset where the model's highest-confidence prediction matches the actual label. This measure provides a focused evaluation of the model's ability to make accurate predictions by considering only its absolute top guess. ",2023-07-03 14:54:48,2024-07-08 15:20:32,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 0}",0,,,papers_with_code_imagenet_top1,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#papers_with_code_imagenet_top1,,1,,,,,,,,,[],,,,,float,[],469596c8ce37b6a7a480eee72e11ffa6,ed95d530167cd069ffe70e1aea3b8639 736536,With/without extra training data (ImageNet),,,2023-07-03 14:54:48,2024-07-08 15:20:31,,,-2222-1612,6102,29574,,{},0,,,training_data,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#training_data,,1,,,,,,,,,[],,,,,string,[],16ebe6290f249f34d13b6720fab1bbb6,706500cc1f1a1b578abf42d96708f963 736535,Performance on math and problem-solving tasks - state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:32,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_math_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_math_improved,,1,,,,,,,,,[],,,,,string,[],d2339daf1dc31d6450499d42ce3cd586,afccc5e233fdbdbc24cf0564a33be5dc 736534,Average accuracy on all knowledge tests - state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_language_average_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_language_average_improved,,1,,,,,,,,,[],,,,,string,[],4403269846ee7c80f5af373bb56dbf5b,0797550ec27d98778d0213ad896d0a4f 736533,Coding performance on interviews - state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_code_any_interview_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_code_any_interview_improved,,1,,,,,,,,,[],,,,,string,[],c0a0be512d226d8b4cc9ff795927bc66,badc9080ee8d8f301752ae06edcda06c 736532,Accuracy on Humanities subjects - state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:32,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_humanities_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_humanities_improved,,1,,,,,,,,,[],,,,,string,[],b992ab3e8f86d8d221143acabd9c2938,65347d31a418002f0d207d7db7b8ae11 736531,Coding performance on competitions - state of the art color code,,,2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,,"{""zeroDay"": ""2019-01-01"", ""yearIsDay"": true}",0,,,performance_code_any_competition_improved,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_code_any_competition_improved,,1,,,,,,,,,[],,,,,string,[],ac9e342e1ebeedaf101000a573fead9c,ad72379039ff7138e494137582c8b6e5 736530,Accuracy on STEM subjects knowledge tests,%,"This benchmark assesses the accuracy of models in STEM subjects knowledge based on the MMLU benchmark. The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities. ",2023-07-03 14:54:48,2024-07-08 15:20:30,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_stem,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_stem,,1,,,,,,,,,[],,,,,float,[],513861990b890431367eccbc6ec71d7e,5c9842d32397d9bd8ce9d2998fed12f7 736529,Accuracy on other knowledge tests,%,"This benchmark assesses the average accuracy of models across subjects other than STEM, humanities, social sciences based on the MMLU benchmark. The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities. ",2023-07-03 14:54:48,2024-07-08 15:20:31,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_other,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_other,,1,,,,,,,,,[],,,,,float,[],774883813cce879e7ac5d7a67fdc1df0,6432122c3d15ef28b79e639d497dbf8c 736528,Accuracy on Social Sciences knowledge tests,%,"This benchmark assesses the accuracy of models in social sciences knowledge based on the MMLU benchmark. The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities. ",2023-07-03 14:54:48,2024-07-08 15:20:30,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_social_sciences,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_social_sciences,,1,,,,,,,,,[],,,,,float,[],a1e53397b333244f23dfd6be697bf314,0e7cfd6ae1bdc23802beb32679306b20 736527,Performance on math and problem-solving tasks,%,"This benchmark assesses the accuracy of models on math and problem solving tasks based on the MATH benchmark. The MATH benchmark consists of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. ",2023-07-03 14:54:48,2024-07-08 15:20:30,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_math,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_math,,1,,,,,,,,,[],,,,,float,[],a4f2238d9420d82691a35fdb80133198,4c3dffb26fcfc27eaf7a989b1a2fe3ba 736526,Coding performance on interviews,%,"This benchmark assesses the accuracy of models in coding interviews based on the APPS benchmark. The APPS benchmark focuses on coding ability and problem-solving in a natural language context, simulating the evaluation process employed during human programmer interviews. It presents coding problems in unrestricted natural language and evaluates the correctness of solutions. The coding tasks within this benchmark are sourced from open-access coding websites such as Codeforces and Kattis. These tasks cover a spectrum of difficulty levels, ranging from introductory to collegiate competition level. The benchmark measures the accuracy of models in solving programming tasks specifically tailored for coding interviews. ",2023-07-03 14:54:48,2024-07-08 15:20:30,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_code_any_interview,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_code_any_interview,,1,,,,,,,,,[],,,,,float,[],fe03953458ce64c74fe18dc40fc1035c,a0f205e9c914f0a51b0d8df8ebf7c0a8 736525,Accuracy on Humanities knowledge tests,%,"This benchmark assesses the accuracy of models in humanities knowledge based on the MMLU benchmark. The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities. ",2023-07-03 14:54:48,2024-07-08 15:20:30,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_humanities,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_humanities,,1,,,,,,,,,[],,,,,float,[],2b39edfe3d2594d0629d9ceab4df6ccd,33cc4c0d6b20bf33ea4f7fc9e31aae0e 736524,Average accuracy on all knowledge tests,%,"This benchmark assesses the average accuracy of models across all subjects based on the MMLU benchmark. The MMLU benchmark covers a wide range of 57 subjects, including STEM, humanities, social sciences, and more. It encompasses subjects of varying difficulty levels, spanning from elementary concepts to advanced professional topics. This comprehensive benchmark assesses not only world knowledge but also problem-solving abilities. ",2023-07-03 14:54:48,2024-07-08 15:20:30,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_language_average,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_language_average,,1,,,,,,,,,[],,,,,float,[],350d88ac6e544f9a30fdfc2d60f9ca9f,a4e7bac7e8e71ce4627973ea44b60302 736523,Coding performance on competitions,%,"This benchmark measures the accuracy of models in coding competitions based on the APPS benchmark. The APPS benchmark focuses on coding ability and problem-solving in a natural language context. It aims to replicate the evaluation process used for human programmers by presenting coding problems in unrestricted natural language and assessing the correctness of solutions. The coding tasks included in this benchmark are sourced from open-access coding websites such as Codeforces and Kattis. These tasks span a range of difficulty levels, from introductory to collegiate competition level. The benchmark evaluates the accuracy of models in solving programming tasks specifically designed for coding competitions. ",2023-07-03 14:54:48,2024-07-08 15:20:30,,,,6102,29574,%,"{""unit"": ""%"", ""zeroDay"": ""2019-01-01"", ""shortUnit"": ""%"", ""yearIsDay"": true, ""numDecimalPlaces"": 1}",0,,,performance_code_any_competition,grapher/artificial_intelligence/2023-06-14/papers_with_code_benchmarks/papers_with_code_benchmarks#performance_code_any_competition,,1,,,,,,,,,[],,,,,float,[],477f6de5c6607f577f8c7992350adfa7,bc2253ceace36d8dbcc5d84fe34cda70