ru

About
Publications
Blog
Careers

ru

Source

NeurIPS Workshop

DATE OF PUBLICATION

12/15/2021

Authors

Tatyana Shavrina

Valentin Malykh

Share

How not to Lie with a Benchmark: Rearranging NLP Leaderboards

Abstract

Comparison with a human is an essential requirement for a benchmark for it to be a reliable measurement of model capabilities. Nevertheless, the methods for model comparison could have a fundamental flaw - the arithmetic mean of separate metrics is used for all tasks of different complexity, different size of test and training sets.
In this paper, we examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean (appropriate for averaging rates) according to their reported results. We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME. The analysis shows that e.g. human level on SuperGLUE is still not reached, and there is still room for improvement for the current models.

Full text

Similar publications

mGPT: Few-Shot Learners Go Multilingual

Oleh Shliazhko, Alena Fenogenova, Maria Tikhonova, Anastasia Kozlova, Vladislav Mikhailov, Tatyana Shavrina

SOURCE

DetIE: Multilingual Open Information Extraction Inspired by Object Detection.

Michael Vasilkovsky, Anton Alekseev, Valentin Malykh, Ilya Shenbin, Elena Tutubalina, Dmitriy Salikhov, Mikhail Stepnov, Andrey Chertok, Sergey Nikolenko

SOURCE

A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification

Varvara Logacheva, Daryna Dementieva, Irina Krotova, Alena Fenogenova, Irina Nikishina, Tatyana Shavrina, Alexander Panchenko

SOURCE

Vote’n’Rank: Revision of Benchmarking with Social Choice Theory

Mark Rofin, Vladislav Mikhailov, Mikhail Florinskiy , Andrey Kravchenko, Elena Tutubalina, Tatyana Shavrina, Daniel Karabekyan, Ekaterina Artemova

SOURCE

A System for Answering Simple Questions in Multiple Languages

Anton Razzhigaev, Mikhail Salnikov, Valentin Malykh, Pavel Braslavski, Alexander Panchenko

SOURCE

WikiOmnia: generative QA corpus on the whole Russian Wikipedia

Tatyana Shavrina, Dina Pisarevskaya

SOURCE

5q032e@SMM4H’22: Transformer-based classification of premise in tweets related to COVID-19

Vadim Porvatov, Natalia Semenova

SOURCE

Artificial Intelligence Research Institute AIRI

You can ask us a question or suggest a joint project in the field of AI

About
Publications
Blog
Careers

partner@airi.net

For scientific cooperation and
partnership

pr@airi.net

For journalists and media

people@airi.net

For any questions connected with
employees and employment

© 2024, AIRI

Join AIRI

Name Email Your message I'm not a robot By submitting the form, I consent to the processing of my personal data

Message sent.

Thank you!

Something went wrong. Try again

About
- Values
- Numbers
- Focus areas
- Research
- Partners
- Management
- Contacts
Publications
Blog
Careers

Contact us

Join AIRI

You can ask us a question or suggest a joint project in the field of AI

Name Email Your message I'm not a robot By submitting the form, I consent to the processing of my personal data

Message sent.

Thank you!

Something went wrong. Try again

partner@airi.net

For scientific cooperation and
partnership

pr@airi.net

For journalists and media