Multimodal Text Input and Output

20d

Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning

Chinese AI startup Zhipu AI aka Z.ai has released its GLM-4.6V series, a new generation of open-source vision-language models (VLMs) optimized for multimodal reasoning, frontend automation, and ...

InfoQ

Multi-Modal LLM NExT-GPT Handles Text, Images, Videos, and Audio

A monthly overview of things you need to know as an architect or aspiring architect. Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with ...

InfoQ

Apple Open-Sources Multimodal AI Model 4M-21

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources. Vivek Yadav, an engineering manager from ...

Forbes

Beyond The Screen: Designing Multimodal Interfaces For A Human-Centered Future

Technology has long promised to bring people closer together, yet so much of our digital life is flattened into a single pane of glass. Screens dominate our work, communication and entertainment. They ...

Forbes

Multimodal AI: A Powerful Leap With Complex Trade-Offs

Artificial intelligence is evolving into a new phase that more closely resembles human perception and interaction with the world. Multimodal AI enables systems to process and generate information ...

Geeky Gadgets

Google Gemini 2.0 Flash Multimodal AI Development Just Got Faster

Google today unveiled Gemini 2.0 Flash Experimental, designed to enable more immersive and interactive applications while introducing new coding agents that enhance workflows by acting directly on ...

Computer Weekly

Cloudinary developer lead: When (and when not) to use multi-modal LLMs for visual content

The latest trends in software development from the Computer Weekly Application Developer Network. This is a guest post by Sanjay Sarathy, VP of developer experience and self-service at Cloudinary, an ...

Ars Technica

Microsoft unveils AI model that understands image content, solves visual puzzles

On Monday, researchers from Microsoft introduced Kosmos-1, a multimodal model that can reportedly analyze images for content, solve visual puzzles, perform visual text recognition, pass visual IQ ...

13d

Guide to Talking Assistants With LangChain : Faster Calls, Smarter Tools, Clear Steps

Build a LangChain voice agent using a sandwich-style pipeline, targeting 250–750 ms replies and VAD, so conversations stay ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results