variables: 852592
Data license: CC-BY
This data as json
id | name | unit | description | createdAt | updatedAt | code | coverage | timespan | datasetId | sourceId | shortUnit | display | columnOrder | originalMetadata | grapherConfigAdmin | shortName | catalogPath | dimensions | schemaVersion | processingLevel | processingLog | titlePublic | titleVariant | attributionShort | attribution | descriptionShort | descriptionFromProducer | descriptionKey | descriptionProcessing | licenses | license | grapherConfigETL | type | sort | dataChecksum | metadataChecksum |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
852592 | Test scores of the AI relative to human performance | 2024-04-02 22:09:24 | 2024-07-08 15:19:13 | 1998-2023 | 6457 | {} |
0 | performance | grapher/artificial_intelligence/2024-04-02/dynabench/dynabench#performance | 2 | minor | Human performance, as the benchmark, is set to zero. The capability of each AI system is normalized to an initial performance of -1. | [] |
We mapped the benchmarks to their respective domains based on a review of each benchmark's primary focus and the specific capabilities it tests within AI systems: - MNIST was mapped to "Handwriting recognition", as it tests AI systems' ability to recognize and classify handwritten digits, a fundamental task in the domain of digital image processing. - GLUE was categorized under "Language understanding" due to its assessment of models across a variety of linguistic tasks, highlighting the general capabilities of AI in understanding human language. - ImageNet was categorized as "Image recognition", focusing on the ability of AI systems to accurately identify and categorize images into predefined classes, showcasing the advancements in visual perception. - SQuAD 1.1 and SQuAD 2.0 were distinguished as "Reading comprehension" and "Reading comprehension with unanswerable questions" respectively. While both benchmarks evaluate reading comprehension, SQuAD 2.0 adds an extra layer of complexity with the introduction of unanswerable questions, demanding deeper understanding and reasoning from AI models. - BBH was aligned with "Complex reasoning", as it challenges AI with tasks that require not just logical reasoning but also creative thinking, simulating complex problem-solving scenarios. - Switchboard was associated with "Speech recognition" due to its focus on transcribing and understanding human speech within a conversational context, evaluating AI's ability to process and respond to spoken language. - MMLU was placed in "General knowledge tests", given its assessment across multiple disciplines and topics, requiring a broad and comprehensive understanding of language. - HellaSwag was mapped to "Predictive reasoning" for its evaluation of AI's ability to predict logical continuations within given contexts, testing commonsense reasoning and understanding. - HumanEval was categorized under "Code generation", focusing on AI's capability to understand programming languages and generate code that solves specific problems, highlighting skills in logical thinking and algorithmic problem-solving. - SuperGLUE was designated as "Nuanced language interpretation" due to its advanced set of linguistic tasks that require deep understanding, reasoning, and interpretation of text, pushing the boundaries of what AI can comprehend. - GSK8k was mapped to "Math problem-solving", as it tests AI on solving mathematical problems that involve reasoning and logical deduction, reflecting capabilities in numerical understanding and problem-solving. | float | [] |
a6595b55743a42288a208f1ac3835d7f | d1c49cd1390867c73746c31287476c96 |