Unifying Vision-and-Language Tasks via Text Generation
Jaemin Cho 1 Jie Lei Hao Tan Mohit Bansal
UNC Chapel Hill
{jmincho,jielei,haotan,mbansal}@cs.unc.edu
Abstract
Existing methods for vision-and-language learn-
ing typically require designing task-specific ar-
chitectures and objectives for each task. For ex-
ample, a multi-label answer classifier for visual
question answering, a region scorer for ref ...


雷达卡




京公网安备 11010802022788号







