Basically the focus of this project is its ability to know which part of an image you're referring to. They show a few examples but this can be a single point, box or freeform trace around the object.
It then uses the model to convert (ground) that into a specific part of the image, they visualise this as a bounding box around it.
Currently you'd have to ask "What's that thing on the table?" or trust a service like ChatGPT to correctly understand the part of the image you've circled. This is meant to make that way more accurate.