Robotic foundation models will absorb Gaussian splats, not kill them

Imagine an apple falling off a tree. A program that forecasted the apple falling at 9.8 m/s² would understand gravity explicitly. The underlying truth of gravity would be directly defined. This is the kind of scientific observation that used to get people really talking about you.

Times have changed. Now imagine some AI system trained on a gazillion videos of apples falling off trees. You could take any image of an apple on a tree and generate a video of it falling. This video model would represent gravity faithfully because it had seen so many examples of gravity in action. Nowhere would "9.8 m/s²" actually be written out. The system would model gravity implicitly. Had Newton or Galileo expressed this instead, they would have gotten a lot more venture capital funding.

Foundation models for robotics are trying to form an implicit understanding of the spatial properties of their environment in order to plan actions. Concepts like form, depth, or collision need only be respected in latent space, as opposed to being explicitly defined the way they would be in a game engine.

As the problem you are trying to model grows to match the complexity of the real world, implicit learning can wrap itself around the problem as it is experienced. The explicit style must painstakingly author new rules for every new edge case. One approach scales with compute and data. The other scales with human effort. The implicit approach wins out eventually.

The Bitter Lesson

This is a restatement of the bitter lesson, Rich Sutton's famous observation that general methods leveraging computation have always ultimately won out over methods that try to bake in human knowledge about the domain. It was bitter because researchers spent careers handcrafting features and rules that got steamrolled by simple models with more data and more compute. Chess, Go, speech, vision. The pattern keeps repeating.

The question I find most interesting right now is what the bitter lesson means for the explicit 3D representations that currently populate the robotics and computer vision landscape. NeRFs, signed distance functions, Gaussian splats, occupancy grids. These are clever, beautiful, painstakingly engineered representations of three-dimensional space. They are also, in the framing of the bitter lesson, explicit. They encode human priors about how 3D geometry works. And if the bitter lesson has taught us anything, explicit priors tend to get absorbed by scale.

The Sitzmann Debate

This was the crux of a fantastic blog post from Vincent Sitzmann that lit up the computer vision community. Sitzmann made an almost teleological argument: computer vision research was always pointing towards something, and that something is ultimately autonomy via embodied intelligence. On a long enough horizon, robotic action planning will come to be dominated by world models with merely implicit understanding of 3D spatial relations, rather than clever explicit 3D representations sitting in-the-loop for policy planning.

The bitter lesson applies to representations too.

A lot of researchers in 3D vision seemed amenable to this view, at least in the limit. The Twitter discourse was remarkably candid. People who have spent years on NeRFs and Gaussian splats openly acknowledged that yes, in the long run, end-to-end implicit models will likely subsume the role their representations currently play. The disagreement was never really about whether. It was about when and how we get there.

My Assumption

My working assumption is that implicit 3D representations will devour explicit 3D representations in robotics. But the part I think people gloss over is that they will do so by absorbing the actual information those explicit representations convey. The bitter lesson doesn't say human knowledge is useless. It says human knowledge doesn't win when hardcoded into the system at inference time. That knowledge can still be enormously valuable when it's used to train the system.

So it's still worthwhile to work on explicit 3D representations for robotics. They may not live forever in the inference loop, but they're the best vehicle we have right now for injecting geometric understanding into the models that will eventually replace them.

The Data Reality

The abundance of 3D data (point clouds, meshes, depth maps, multi-view reconstructions) pales in comparison to the abundance of web video. There are billions of hours of video on the internet capturing every conceivable physical interaction. There are comparatively very few high-quality 3D reconstructions of robotic environments with action labels.

So pretraining will be dominated by web video, and rightly so. Projects like DreamDojo are already showing what happens when you pretrain a world model on 44,000 hours of egocentric human video. You get a system that understands gravity, collisions, and object permanence without anyone writing out a single physics equation. The implicit approach, scaling on video, doing its thing.

But pretraining is only half the story. The question is what happens after pretraining, when you need to fine-tune these models for specific robotic tasks in specific physical environments. This is where 3D explicits re-enter the picture.

I think the best way for explicit 3D representations to be absorbed by foundation models is through post-training and fine-tuning. Depth maps, surface normals, signed distance fields, occupancy labels. These can serve as auxiliary supervision signals that shape a model's latent representations during fine-tuning without needing to be present at inference time. Jon Barron has gestured at this idea, that the insights of 3D vision are most valuable when they're used to condition and constrain the training of larger models, rather than being deployed as standalone systems.

The Pipeline Opportunity

The best way to generate those 3D explicits at scale might just be to extract them from web video itself. Monocular depth estimation, structure-from-motion on internet footage, self-supervised multi-view reconstruction. These are all ways to bootstrap geometric supervision from the same ocean of video that's already being used for pretraining. You take the implicit data (video), extract explicit structure (depth, normals, camera poses), and feed that structure back as supervision for fine-tuning. The explicit becomes a waypoint in the implicit's journey. I have some interesting ideas for distilling point clouds of objects from mobile video.

As the quantity of robotics data comes out of its current deficiency (and it will, between simulation, teleoperation, and human video pipelines) the bottleneck will shift. It won't be about getting enough data. It will be about processing that data effectively. Curating it. Extracting the right geometric signals. Designing the right auxiliary losses. Building the pipelines that turn raw video into structured supervision for fine-tuning.

Explicit 3D representations may not be permanent fixtures in the robotic policy loop, but we can wield them as tools to accelerate the training of the implicit models that will eventually render them unnecessary at inference time. 3D vision's future might look more like absorption than death. It moves from the inference pipeline to the training pipeline, from the policy to the data engine.

Looking Forward

I'm going to monitor the contribution of 3D vision to foundation model fine-tuning very closely. Which geometric supervision signals actually help during post-training? How much does depth or normal prediction as an auxiliary loss improve manipulation performance? Can you get meaningful gains from self-supervised 3D extraction on web video, or do you need ground-truth geometry? At what point does the model internalize enough spatial understanding that the explicit supervision stops helping?

These are empirical questions, and I plan on tinkering with some pipelines, extractions, and fine-tunes of my own, as in this project article, to start forming answers. Just because a technology's shelf life is within view doesn't mean it will expire on its own. Plenty of work still needs to be done to transfer the insights of these representations into the training pipelines that will eventually supersede them.

I'm sure GPT-9 will know how to do just that. But until then, one of these nerds needs to figure it out!