Towers of Babel: Combining Images, Language, and 3D Geometry for
Learning Multimodal Vision
Xiaoshi Wu1 Hadar Averbuch-Elor2 Jin Sun2 Noah Snavely2
1 2
Tsinghua University Cornell Tech, Cornell University
Figure 1: Our WikiScenes dataset combines 3D reconstructions, images, and language descriptions for dozens of landmarks, like the
Barcelona and Reims Cathedrals pictured above. WikiScenes enable ...


雷达卡




京公网安备 11010802022788号







