pull down to refresh

KOSMOS-2 brings grounding to vision-language models, letting AI pinpoint visual regions based on text. In this blog, I explore how well it performs through real-world experiments and highlight both its promise and limitations in grounding and image understanding.
21 sats \ 1 reply \ @k00b 23h
VLMs, in contrast, use "the visual reality". They are trained on massive datasets containing millions, or even billions, of image-text pairs.
This process forces the model to connect the word "sunset" not just to other words, but to the actual pixels, colors, and shapes that constitute a sunset in a photograph. This grounding in the visual world makes their understanding richer, more concrete, and fundamentally closer to how humans perceive the world.
I remain super curious about the data pipelines of these things.
The model demonstrates impressive capabilities in associating textual phrases with visual regions, especially for straightforward prompts and well-lit, distinct images.
However, it still struggles with contextual complexity and logical reasoning, often producing hallucinated or incorrect outputs when faced with open-ended or abstract questions.
I'm also curious about how wrong models can be and still be useful. Like, how is often is something that works sometimes better than nothing?
reply
Like, how is often is something that works sometimes better than nothing?
If you're a gambler, all the time. The problem starts when you have standards then it becomes crappy real fast, unless you have a scalable means to judge output and discard at scale too.
I think that this is the hardest part of all generation in practice.
reply
This is cool for computer vision projects, but at the same time, it's kinda creepy when you think about what people can do with it and what could go wrong.
reply