Geometry First
The idea is not to replace everything with one model. It is to separate the job into layers. Diffusion handles what it is good at: surface appearance, texture, lighting, and fine visual detail. The rest should be anchored elsewhere. Geometry should come from predictive representations that persist across time, and motion should come from physics-informed models that constrain how objects are allowed to deform. That is the core split.
In that framework, objects are not treated as frame-by-frame images but as stable geometric entities. Physics-informed geometric representations keep surfaces from drifting into implausible shapes, while predictive geometry gives those objects continuity across time. Physics-informed neural networks then shape motion by embedding material or dynamical constraints directly into the model, so cloth, water, or other deformable structures move in ways that remain coherent instead of jittering from one frame to the next.
What diffusion loses in control is what this stack tries to recover. Instead of asking one model to invent both structure and appearance at once, structure is solved first and appearance is applied after. That makes the system more controllable, more interpretable, and more useful in settings where geometry and motion actually matter, including animation and engineering. Diffusion remains part of the pipeline, but as the finish rather than the foundation.
The larger claim is that animation should be built on persistent representations, not sampled frames. Once geometry is stable and motion is constrained, appearance can be layered on top without carrying the full burden of coherence. That is the appeal of the approach: not less learning, but a cleaner assignment of responsibilities across models.