USA Net: Unified Semantic and Affordance
Representations for Robot Memory

Meta AI  




In order for robots to follow open-ended instructions like go open the brown cabinet over the sink, they require an understanding of both the scene geometry and the semantics of their environment. Robotic systems often handle these through separate pipelines, sometimes using very different representation spaces, which can be suboptimal when the two objectives conflict. In this work, we present USA Net, a simple method for constructing a world representation that encodes both the semantics and spatial affordances of a scene in a differentiable map. This allows us to build a gradient-based planner which can navigate to locations in the scene specified using open-ended vocabulary. We use this planner to consistently generate trajectories which are both shorter 5-10% shorter and 10-30% closer to our goal query in CLIP embedding space than paths from comparable grid-based planners which don't leverage gradient information. To our knowledge, this is the first end-to-end differentiable planner optimizes for both semantics and affordance in a single implicit map.

Bullet Points

  • Robots often want to navigate to a location in the world specified using open-ended vocabulary

  • A differentiable planner requires a representation of the world which encodes both affordance (so that it doesn't run into things) and semantics (so that it knows where to go to reach the goal)

  • We present a simple world representation which uses an MLP to map each coordinate in a scene to a semantic embedding and the distance to the nearest obstacle

  • This representation allows us to implement various planners on top of it, which we evaluate on a variety of scenes and goals

Model Architecture

We use a simple MLP to map each coordinate in a scene to a semantic embedding and the distance to the nearest obstacle. The semantic representation is the CLIP embedding of a random crop of the scene, centered at the nearest neighbor surface point to the given coordinate. The loss when supervising this embedding is weighted by the distance to the point. The affordance representation is the SDF value at the given coordinate. We adapt the ground truth SDF value to account for noise from our real-world depth sensor.

Model architecture

Path Lengths

The figures below show the path lengths of different planners navigating between two points using the affordance representation. Because the gradient-based planner operates in continous space it is able to achieve smoother and shorter paths than the grid-based planners. We compare with a baseline planner constructed on top of the occupancy map constructed from the raw point cloud.


If you have any questions, please feel free to reach out to Ben Bolte.