American Sign Language Recognition and Analysis Using Deep Learning
Date
2026-06-19
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In this work I build a system that recognizes isolated American Sign Language (ASL) words, and I use it to ask one fairly direct question: when training data is scarce, is it better to look at the video pixels or at the geometry of the signer’s body? To find out, I train two very different models on exactly the same clips. The first is appearance-based. Every frame is run through standard preprocessing and a ResNet50 backbone pre-trained on ImageNet, which turns it into a 2048-dimensional feature vector, and a Bidirectional LSTM then reads that sequence over time. The second model never sees a pixel. It works only on Media Pipe key points, the tracked coordinates of the body and the two hands, and feeds them to a Transformer encoder. So both models have to learn the same two things, the shape of the hands in each frame and the way those shapes move across frames, and both are trained, validated and tested under one identical protocol. What I care about throughout is a recognizer that is accurate but still light enough to be useful in practice, so it could eventually make communication a little easier between people who sign and people who do not.
Description
This dissertation has been completed under the supervision of Prof. Ujjwal Bhattacharya
Keywords
American Sign Language, Resnet, Transformer
Citation
48p.
