Generalized Entropy Regularization or: There's Nothing Special about Label Smoothing

Clara Meister; Elizabeth Salesky; Ryan Cotterell

Generalized Entropy Regularization or: There's Nothing Special about Label Smoothing

Clara Meister, Elizabeth Salesky, Ryan Cotterell

Abstract Paper Share

Machine Learning for NLP Long Paper

Session 12A: Jul 8 (08:00-09:00 GMT)

Session 14B: Jul 8 (18:00-19:00 GMT)

Abstract: Prior work has explored directly regularizing the output distributions of probabilistic models to alleviate peaky (i.e. over-confident) predictions, a common sign of overfitting. This class of techniques, of which label smoothing is one, has a connection to entropy regularization. Despite the consistent success of label smoothing across architectures and data sets in language generation tasks, two problems remain open: (1) there is little understanding of the underlying effects entropy regularizers have on models, and (2) the full space of entropy regularization techniques is largely unexplored. We introduce a parametric family of entropy regularizers, which includes label smoothing as a special case, and use it to gain a better understanding of the relationship between the entropy of a model and its performance on language generation tasks. We also find that variance in model performance can be explained largely by the resulting entropy of the model. Lastly, we find that label smoothing provably does not allow for sparsity in an output distribution, an undesirable property for language generation models, and therefore advise the use of other entropy regularization methods in its place.

You can open the pre-recorded video in a separate window.

NOTE: The SlidesLive video may display a random order of the authors. The correct author list is shown at the top of this webpage.

Generalized Entropy Regularization or: There's Nothing Special about Label Smoothing

Clara Meister, Elizabeth Salesky, Ryan Cotterell

Similar Papers

Effective Estimation of Deep Generative Language Models

Tom Pelsmaeker, Wilker Aziz,

On the Encoder-Decoder Incompatibility in Variational Text Modeling and Beyond

Chen Wu, Prince Zizhuang Wang, William Yang Wang,

Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network

Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu, Ting Liu,

Predicting Performance for Natural Language Processing Tasks

Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham Neubig,