Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

By Teaching AI to Make Pictures and Write, Scientists Improve Its Grasp of Vision and Language

Posted on Sep 22 • Originally published at notes.aimodels.fyi Machine learning models that can understand and generate both images and text have seen rapid progress recently. But how can we get these "multimodal" models to synergize and mutually boost their abilities in comprehending and creating visual and textual content? That's the key question tackled in an interesting new paper from Anthropic, Tsinghua University, Xi'an Jiaotong University, and MEGVII Technology.Subscribe or follow me on Twitter for more content like this!Multimodal models could be transformative for how we interact with AI systems. Imagine asking your personal assistant to not just describe a concept, but also generate or edit an image to illustrate it. Or searching for media on the internet by describing it instead of keywords. Enabling fluid joint understanding and generation of vision and language is a stepping stone towards more natural and intuitive human-AI interaction.The authors propose DREAMLLM, a novel framework for training large multimodal language models (MLLMs) that can both understand and generate images and text.Here's the key elements:Uses diffusion models for image generation, which creates images by gradually refining random noise into the desired output. This avoids compressing images into discrete tokens that lose details.Images are generated by score distillation - getting a pretrained diffusion Model like Stable Diffusion to guide the training, instead of trying to reproduce its internal representations. This prevents information loss.Trains the model to generate free-form interleaved documents with both text and images, modeling all possible combinations of conditioned outputs. This unified approach allows full learning synergy.Introduces "dream queries" - learnable embeddings that extract multimodal semantics from the MLLM to condition image generation, avoiding tampering with its core output space.Experiments show SOTA results on common multimodal benchmarks, significantly outperforming other MLLMs. DREAMLLM also demonstrates promising zero-shot capabilities in conditional image generation, compositional image editing, and generating coherent interleaved content from prompts.This work moves us closer to AI assistants that can truly understand and generate both visual and textual information.Key takeaways:By training on free-form documents, DREAMLLM learns real-world patterns of interleaving text and images. This helps it develop a joint understanding of vision and language.Modeling images as pixels instead of discrete tokens preserves visual details. The dream queries act as an interpreter between modalities.Not forcing the model to match CLIP's image representations avoids bottlenecks and allows full knowledge transfer between modalities.Strong zero-shot performance shows the model develops a robust general intelligence spanning both images and text.Capabilities like conditional image editing hint at future applications in quickly generating customized visual content.Of course, we're still far from human-level intelligence. There are concerns around bias, safety and misuse of generative models. But frameworks like DREAMLLM point the way towards more capable and cooperative AI assistants in the future.The key insight is that jointly training generative abilities in both images and text leads to superior understanding and creativity overall. As AI continues crossing modalities, finding synergies between perception, reasoning and creation will pave the path ahead.Subscribe or follow me on Twitter for more content like this!Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well Confirm For further actions, you may consider blocking this person and/or reporting abuse Matteo - Sep 1 Gbengacode - Aug 28 irafrog - Sep 19 Ayomikun Sholademi - Sep 10 Once suspended, mikeyoung44 will not be able to comment or publish posts until their suspension is removed. Once unsuspended, mikeyoung44 will be able to comment and publish posts again. Once unpublished, all posts by mikeyoung44 will become hidden and only accessible to themselves. If mikeyoung44 is not suspended, they can still re-publish their posts from their dashboard. Note: Once unpublished, this post will become invisible to the public and only accessible to Mike Young. They can still re-publish the post if they are not suspended. Thanks for keeping DEV Community safe. Here is what you can do to flag mikeyoung44: mikeyoung44 consistently posts content that violates DEV Community's code of conduct because it is harassing, offensive or spammy. Unflagging mikeyoung44 will restore default visibility to their posts. DEV Community — A constructive and inclusive social network for software developers. With you every step of your journey. Built on Forem — the open source software that powers DEV and other inclusive communities.Made with love and Ruby on Rails. DEV Community © 2016 - 2023. We're a place where coders share, stay up-to-date and grow their careers.



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

By Teaching AI to Make Pictures and Write, Scientists Improve Its Grasp of Vision and Language

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×