A Mixture of h - 1 Heads is Better than h Heads

Hao Peng; Roy Schwartz; Dianqi Li; Noah A. Smith

A Mixture of h - 1 Heads is Better than h Heads

Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith

Abstract Paper Share

Machine Learning for NLP Long Paper

Session 11B: Jul 8 (06:00-07:00 GMT)

Session 14B: Jul 8 (18:00-19:00 GMT)

Abstract: Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead “reallocate” them—the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over “transformer-base” by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.

You can open the pre-recorded video in a separate window.

NOTE: The SlidesLive video may display a random order of the authors. The correct author list is shown at the top of this webpage.

A Mixture of h - 1 Heads is Better than h Heads

Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith

Similar Papers

Multiscale Collaborative Deep Models for Neural Machine Translation

Xiangpeng Wei, Heng Yu, Yue Hu, Yue Zhang, Rongxiang Weng, Weihua Luo,

Adaptive Transformers for Learning Multimodal Representations

Prajjwal Bhargava,

Hard-Coded Gaussian Attention for Neural Machine Translation

Weiqiu You, Simeng Sun, Mohit Iyyer,

Roles and Utilization of Attention Heads in Transformer-based Neural Language Models

Jae-young Jo, Sung-Hyon Myaeng,