OpenAI last week demonstrated its GPT-4o multimodal AI model, and Google followed a day later with a demonstration of its Project Astra (a set of features coming later to Google’s Gemini). Both initiatives use video input (along with audio) to prompt sophisticated, powerful and natural AI chatbot responses.
Both demos were impressive and ground-breaking, and performed similar feats.
OpenAI is either further ahead of Google, or less timid, (probably both) as the company promised public availability of what it demonstrated within weeks, whereas Google promised something “later this year.” More to the point, OpenAI claims that its new model is twice as fast as and half the cost of GPT-4 Turbo. (Google didn’t feel confident enough to brag about the performance or cost of Astra features.)
Before these demos, the public knew the word “multimodal” mostly from Meta, which heavily promoted multimodal features for its Ray-Ban Meta glasses to the public in the past couple of months.
The experience of using Ray-Ban Meta glasses’ multimodal feature goes something like this: You say, “Hey, Meta, look and tell me what you see.” You hear a click, indicating that a picture is being taken, and, after a few seconds, the answer is spoken to you with information like: “it’s a building” or some general description of objects in the frame of the picture.
Ray-Ban Metas use the integrated camera —for a still image, not video — and the result is somewhat unimpressive, especially in light of the multimodal demos by OpenAI and Google.
2024-05-21 11:51:02
Source from www.computerworld.com