EL-Attention: Memory Efficient Lossless Attention for Generation
Yu Yan 1 Jiusheng Chen 1 Weizhen Qi * 2 Nikhil Bhendawade * 1 Yeyun Gong * 3 Nan Duan 3 Ruofei Zhang 4
Abstract pruning layer (Fan et al., 2019) or training a smaller stu-
dent model (Shleifer & Rush, 2020). Another way is non-
Transformer model with multi-head attention re- autoregressive generation (Gu et al., 2018; Lee et al., 2018;
quires caching inte ...


雷达卡




京公网安备 11010802022788号







