![]() ![]() The widely used pre-training objectives such as masked language modeling (MLM) are prone to overfitting to the noisy text which would hurt representation learning. Limitation 3: The datasets used for pre-training mostly consist of noisy image-text pairs collected from the Web. Furthermore, most methods use a pre-trained object detector for image feature extraction, which is both annotation-expensive and computation-expensive. Since the visual features and textual features reside in their own spaces, it is challenging for the multimodal encoder to learn to model their interactions. However, the input to the multimodal transformer contains unaligned region-based image features and word token embeddings. Limitation 2: Methods represented by UNITER employ a multimodal encoder to jointly model image and text. However, they lack the ability to model complex interactions between image and text, hence they are not good at tasks that require fine-grained image-text understanding. Limitation 1: Methods represented by CLIP and ALIGN learn unimodal image encoder and text encoder, and achieve impressive performance on representation learning tasks. However, existing methods have three major limitations. Vision-and-language pre-training (VLP) has emerged as an effective approach to address this problem. It has been a long-standing goal in AI to build intelligent machines that can jointly understand vision data (images) and language data (texts). Vision and language are two of the most fundamental channels for humans to perceive the world. ![]() TL DR: We propose a new vision-language representation learning framework which achieves state-of-the-art performance by first aligning the unimodal representations before fusing them.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |