Author: Dragomir Božoki
1. Introduction
During the final year of my Bachelor's studies, once I realized I would pass all my exams on time, I knew it was time to choose a topic for my thesis. Throughout my studies, I gradually became interested in signal processing – a field where you can work with various types of signals to generate new images, analyze text, interpret brain activity, and more.
This interest grew even stronger when AI started booming in late 2022. Topics related to machine learning and data analysis suddenly became the center of attention across the tech industry. That was the moment I knew – this is the field I want to specialize in.
When I spoke with my professors about potential thesis topics, they proposed an idea that instantly caught my attention: teaching a machine to understand language purely through visual input—without any audio—by analyzing only lip movements. The concept sounded absolutely fascinating to me. It was something I had never worked on before, which made it feel like a big challenge and an exciting next step in my career. Without hesitation, I said yes.
2. Introduction to Project
Lip reading is the art of interpreting spoken language through visual cues alone, without relying on any audio signals. This skill is particularly valuable for those with hearing impairments, as it allows them to follow conversations and understand spoken words by observing the movements of the lips and face. However, mastering this ability is notoriously difficult. Even the most skilled lip readers can correctly interpret only about 50% of spoken words, and this accuracy can drop to just 10-15% without context. This is where technology can make a significant impact, and it inspired me to take on the challenge of building a model to bridge this gap.
To lay the groundwork for this project, it’s essential to understand the field of Visual Speech Recognition (VSR). VSR is a multidisciplinary area that combines computer vision and natural language processing to transform silent lip movements into meaningful text. While the concept has been around for decades, progress in this field has historically been slow. However, recent breakthroughs in machine learning, driven by the availability of larger datasets and the power of modern GPUs, have sparked a new wave of innovation. Today, VSR is emerging as a hot topic, with each new model pushing the boundaries of what is possible and delivering exponentially better results.
So lets start with how I approached this problem and how I decided to solve this really complex problem.
3. Project Setup and Key Technologies
The entire system for this project was developed using Python v3.9 in a Julia environment. For data processing and video frame extraction, I relied on the opencv-python
library (v4.9.0.80) and numpy
(v1.26.2) for efficient matrix operations and numerical computations.
For training the neural network and implementing the machine learning models, I used TensorFlow (v2.16.1), a powerful open-source deep learning framework. To visualize the results and evaluate the model's performance, I combined the plotting capabilities of matplotlib
(v3.8.2) and seaborn
(v0.13.8) with TensorFlow’s built-in evaluation functions.
This combination of tools provided a flexible and efficient workflow, allowing me to focus on building and refining my lip-reading model without worrying about low-level optimization details.
For this project, I chose to work with the GRID Corpus – a large, structured database designed specifically for research in speech and visual recognition. It contains recordings of 34 different speakers (18 men and 16 women) each speaking 1,000 sentences in English, resulting in a total of 34,000 audio and video clips.
The sentences in the GRID Corpus follow a fixed structure, making it ideal for training machine learning models. Each sentence is a combination of six components: a command (e.g., "set," "place"), a color (e.g., "red," "blue"), a preposition (e.g., "at," "by"), a letter (A-Z), a digit (0-9), and an adverb (e.g., "again," "now"). This structure allows for up to 64,000 unique sentence combinations, such as "set blue by four please" or "place red at C zero again."
For my project, I used a subset of this dataset, focusing on the first 500 video clips from a single speaker (s1). This approach provides a manageable starting point for training the initial version of my lip-reading model, while still capturing a wide variety of facial movements and speech patterns.
In the next section, I'll walk you through exactly how I tackled this preprocessing step.
Preprocessing
First thing we had to do is extract lip region from the video, for that we used dlib libary to detect face and points around a mouth, and then we extracted 64x64 region around lips.
else:
for face in faces:
landmarks = predictor(gray_frame, face)
lip_left = landmarks.part(48).x
lip_right = landmarks.part(54).x
lip_top = min(landmarks.part(50).y, landmarks.part(51).y)
lip_bottom = max(landmarks.part(58).y, landmarks.part(59).y)
lip_frame = frame[lip_top:lip_bottom, lip_left:lip_right]
lip_frame_resized = cv2.resize(lip_frame, (width, height))
lip_frame_gray = cv2.cvtColor(lip_frame_resized, cv2.COLOR_BGR2GRAY)
frames.append(lip_frame_gray)
# rest of the function u can find on github
return frames
Every frame of lip region is scaled and turn into the gray scale, then they are turn into the list which is transformed into tensor. This is an example of one frame ready for neural network.
3.2 one frame ready for neural network It’s quite remarkable when you think about it – this seemingly simple grayscale frames is all that the neural network needs to interpret spoken words. Despite being just a collection of tiny pixel values, it holds enough visual information for the model to learn and make accurate predictions about the speech content, capturing the subtle dynamics of lip movements.
4. Preprocessing – From Raw Video to Neural Network Input
The very first technical challenge was to extract the lip region from each video frame. To do this, I used the dlib
library, which provides pre-trained models for facial landmark detection. Specifically, I used the shape_predictor_68_face_landmarks.dat
model, which identifies 68 key facial points – including the outline of the lips.
Using this, I isolated a square region of size 64x64 pixels around the lips in every frame of the video. Here’s the relevant part of the code:
This entire preprocessing step converts each raw video into a standardized tensor, where each frame contains only the grayscale lip region. These are then grouped into sequences of 75 frames per video and normalized (mean-subtracted and scaled by standard deviation).
What’s fascinating is that this minimal visual information — just a tiny square of lip movements — contains enough signal for a neural network to learn how to transcribe speech.
5. Label Extraction and Alignment
Once the frames were ready, I needed labels — the corresponding transcriptions of what was said in each video. Fortunately, the GRID corpus provides .align
files for each clip, which include phoneme-level alignment.
I wrote a parser that reads these alignments and filters out silence tokens (sil
). The remaining tokens were mapped to a fixed vocabulary of 41 characters:
abcdefghijklmnopqrstuvwxyz'?!123456789 (space)
Each sentence was transformed into a sequence of numerical tokens using tf.keras.layers.StringLookup
, and padded to a fixed length of 40 characters for uniformity.
6. Model Architecture – Seeing Lips, Predicting Words
With the input and output tensors ready, I moved on to building the model. I designed a deep neural network that combines spatial and temporal processing:
-
Three 3D Convolutional Layers
These layers extract spatiotemporal features from the input video sequence. Each applies a 3D kernel over the time and space dimensions to capture motion and shape patterns. -
Flattening & TimeDistributed Layer
After spatial features are extracted, each time step is flattened independently. -
Two Bidirectional GRU Layers
These layers allow the model to learn temporal dependencies in both directions (past and future). GRUs (Gated Recurrent Units) are a lightweight alternative to LSTMs, and I used a dropout of 0.5 after each to reduce overfitting. -
Dense Softmax Output
The final layer outputs a probability distribution over the 41-character vocabulary for every frame in the sequence.
Here’s a simplified view of the architecture:
The model is trained using Connectionist Temporal Classification (CTC) Loss, which is ideal for sequence-to-sequence problems where the alignment between input and output is not known.
7. Training the Model
I trained the model using the Adam optimizer with a learning rate of 0.0001
. To help with convergence, I implemented a learning rate scheduler that keeps the rate constant for the first 50 epochs, then decays it exponentially.
In total, I trained for 500 epochs, using speaker-specific data and rotating speakers every 50 epochs to prevent overfitting.
Custom training callbacks helped monitor progress:
-
ProduceExample
: shows predicted vs real text after each epoch. -
SaveHistoryCallback
: logs training metrics every 5 epochs. -
ModelCheckpoint
: saves model weights at each epoch.
All training was done using TensorFlow 2.16.1 on a local machine with a GPU.
8. Final Thoughts and Future Work
This project showed me how powerful deep learning can be — not just in terms of results, but in its ability to learn from minimal and noisy input. Despite using only grayscale lip regions, the model was able to correctly transcribe many sentences. The results were far from perfect, but they were remarkably promising.
Looking forward, I see several directions to improve and expand this system:
-
✳️ Test on more speakers and evaluate generalization.
-
🎥 Integrate webcam input for real-time lipreading.
-
🔤 Combine lipreading with language models for better sentence reconstruction.
Most importantly, this project confirmed my passion for working at the intersection of AI, language, and vision — and gave me a solid foundation to move into more advanced work in machine learning and deep learning.
Comments
Post a Comment