ViLT: Vision-and-Language Transformer
Without Convolution or Region Supervision
Wonjae Kim * 1 Bokyung Son * 1 Ildoo Kim 2
Abstract Visual Embedding Schema
Region Feature CNN Region
Vision-and-Language Pre-training (VLP) has im- (ViLBERT, UNITER, ...)
Image
...


雷达卡




京公网安备 11010802022788号







