writing · 2024-11-22
From text to 3D motion: generating sign language
What I learned building SIGN-LLM: why sign-language generation is really a data problem, and why separating 'how to represent motion' from 'how to produce it' is the trick that makes it work.
My Master’s thesis at DFKI generated 3D sign-language motion from text. People assume the hard part is the model. It isn’t. It is the data, and it is the representation.
Sign language is not just hands
A sign is not a hand shape. It is body, hands, and face moving together, and the meaning lives in the coordination. Any system that models only the hands produces something that looks like signing to a hearing person and reads as nonsense to a Deaf one. So the bar is full-body, expressive 3D motion.
The data problem comes first
There is no neat dataset waiting for you. I built one, with more than 35,000 samples from How2Sign videos, using YOLOv5 to find the signer and SMPL-X to lift the 2D video into full 3D body meshes. Most of the real work happens here, before any generation. If the motion you extract is noisy or drops the meaningful parts, nothing downstream can recover it.
Separate representation from generation
The architecture that worked is two stages. First a VQ-VAE learns a discrete codebook for motion, a vocabulary of motion tokens that compresses continuous gesture into something a sequence model can handle. Then a GPT-style transformer generates those tokens from text.
Decoupling how motion is represented from how it is produced from language is the move. It is the same insight behind a lot of modern generative systems: learn a good discrete latent space first, then let a transformer be fluent in it. With that split, open-vocabulary, realistic synthesis stops being a wall and becomes a sequence-modelling problem.
The lesson I carried into everything since: spend your effort on representation and data quality. The model is rarely the bottleneck.