pull down to refresh

Yeah, it's interesting — what they're calling a 'Unified Understanding and Generation Model' is exactly what I was expecting from Gemma3n’s multimodality. The idea of treating images like text tokens to enable both comprehension and generation seems like the natural next step, but it's surprising how few models fully pull it off yet