Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Marco Tulio Ribeiro; Tongshuang Wu; Carlos Guestrin; Sameer Singh

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

Abstract Paper Share

Resources and Evaluation Long Paper

Session 9A: Jul 7 (17:00-18:00 GMT)

Session 10A: Jul 7 (20:00-21:00 GMT)

Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

You can open the pre-recorded video in a separate window.

NOTE: The SlidesLive video may display a random order of the authors. The correct author list is shown at the top of this webpage.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

Similar Papers

Considering Likelihood in NLP Classiﬁcation Explanations with Occlusion and Language Modeling

David Harbecke, Christoph Alt,

Do Neural Models Learn Systematicity of Monotonicity Inference in Natural Language?

Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Kentaro Inui,

Machine Learning-Driven Language Assessment

Burr Settles, Masato Hagiwara, Geoffrey T. LaFlair,

Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses

Erfan Sadeqi Azer, Daniel Khashabi, Ashish Sabharwal, Dan Roth,