pull down to refresh
Yeah, it's interesting — what they're calling a 'Unified Understanding and Generation Model' is exactly what I was expecting from Gemma3n’s multimodality. The idea of treating images like text tokens to enable both comprehension and generation seems like the natural next step, but it's surprising how few models fully pull it off yet
It's funny that part of what is described under "Unified Understanding and Generation Model" was basically my expectation for Gemma3n's multimodality yesterday:
I guess I have to be patient.