project · 2024 · Thesis · DFKI

SIGN-LLM: Text to 3D Sign Language

Master's thesis at DFKI: an end-to-end system that translates text into expressive 3D sign-language motion (body, hands, and face) using VQ-VAE motion encoding and a GPT-based transformer.

PyTorchVQ-VAEGPTYOLOv5SMPL-XPython

SIGN-LLM (“SignLLM T2M”) translates natural-language text into expressive, realistic 3D sign-language gestures, not just hand shapes but coordinated body, hand, and facial motion.

The data problem first

Sign-language generation is bottlenecked by data. I built a custom dataset of 35,000+ samples from How2Sign videos, using YOLOv5 to detect signers and SMPL-X to extract and model full 3D body meshes, optimised for a sign-language representation that keeps the parts that carry meaning.

A two-stage architecture

The system pairs a VQ-VAE that encodes continuous motion into a discrete codebook with a GPT-style transformer that generates those motion tokens from text. Decoupling “how to represent motion” from “how to produce it from language” is what makes open-vocabulary, realistic synthesis possible.

Result

The model reached state-of-the-art results with strong alignment between the input text and the generated motion, contextually accurate signing rather than averaged, mushy gestures.

← all work