DeepRead

Who: Jerry Lu (clu28), Alex Liang (aliang19), Carson Harrell (charrell), Jefferey Cai (jcai33)

Introduction: What problem are you trying to solve and why?

If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper. If you are doing something new, detail how you arrived at this topic and what motivated you. What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc.

Whether through grandparents or friends, we all know someone who struggles with hearing. Trying to understand a conversation without sound is challenging. Inspired by this issue, we want to develop a tool that can generate captions based on the visual cues of a speaker’s mouth movement. Essentially a deep learning approach to lip reading, this structured prediction approach involves mapping visual features to linguistic representations, leveraging a combination of computer vision and natural language processing techniques.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list.

End-to-end Sentence Lipreading

Combining Residual Networks with LSTMs for Lipreading

Lip Reading Using Computer Vision and Deep Learning

This implementation provides a strong starting framework for our project. The project utilizes supervised ensemble deep learning models, particularly 1-D and 2-D convolutional neural networks (CNNs), to classify lip movements into phonemes and then stitch phonemes into words. The dataset consists of images of segmented mouths, each labeled with a phoneme, created using video recordings and Gentle, a forced aligner that can align a video transcript to the corresponding video frames. We may borrow this idea for our preprocessing.

Data: What data are you using (if any)?

If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it).

We are going to be using data from YouTube. We will have a web scraper that scrapes YouTube for videos of people talking, then cuts out the audio. It will also retrieve the transcript of the video and that will be the labels. YouTube had an API so we will use that and if that doesn’t work we will use a simple web scraping tool. There are also many standard datasets on Kaggle that we will explore.

How big is it? Will you need to do significant preprocessing?

The dataset is very big, since it is both text data and also video data. Some preprocessing includes but is not limited to:

Doing regular NLP preprocessing to the labels (transcript of the video). This will include ing the rare words and removing punctuation and capital letters so the data can be standardized and ready to use
Video preprocessing will be a bit more difficult, but can be generalized in the following ways:
- cropping the videos all to a standard size
- Making all the videos black and white
- Convert frames into numpy arrays so they are in a format that the model can understand and use to train
Lip Reading Using Computer Vision and Deep Learning describes an example preprocessing procedure
- Obtain video of someone speaking
- Segment mouth portion of video
- Process each frame (image downsize, transform)
- Label each frame with matching word/phenome via Gentle

Methodology: What is the architecture of your model?

How are you training the model? If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here. If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.

Model architecture:

Use a CNN layer(s) in order to put the video data into a format that the model can use
Use transformers to convert the video data into text
Use LSTMs to take sequence of frames and develop text captions

Some backup ideas include:

Multiple CNNs at one time combined together in order to come up with a result that captures more information
GRUs instead of transformers in order to generate the text data

In order to train the model, we will split the data into train, validation, and testing. We can then define a loss function which will probably combine some sort of visual loss with text based loss. We can then use our model architecture to optimize this loss function. Finally, we can use early stopping and other techniques in order to minimize overfitting of the loss function.

We can then evaluate and fine tune the model.

Metrics: What constitutes “success?”

What experiments do you plan to run? For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

Accuracy still applies since our goal is to match visemes to phonemes, or visual mouth movement to the correct auditory transcript. However, since certain words/sounds in the English language are more common than others (z versus vowels), our accuracy should take into account some measure of importance/popularity. We plan to test whether generated audio transcripts match the correct labels, and whether they sound at all reasonable.

Accuracy is very appropriate because the output is macronutrients which can be quantified. We will potentially include a range for accuracy or range for output since a given photo angle has many factors that could mess with the accuracy not to the fault of the model.

If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model.

In a similar project, the authors had similar goals. However, they only used CNNs and were hoping to achieve greater than 30% accuracy, since that’s around the capabaility of human lip-reading experts.

Projects similar to ours exist, but the authors were all hoping to find something different. Because macros can be quantified, many of them use accuracy and range to quantify results of the model. That being said, because cuisines are so diverse, they have to bucket the cuisine in certain categories if not a super comprehensive data set / model.

If you are doing something new, explain how you will assess your model’s performance. What are your base, target, and stretch goals?

We will assess performance based off of our accuracy metrics on validation data. We hope to reach a base goal of 30% to match human capabilities and target an accuracy near 50%.

Ethics: Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.)

What broader societal issues are relevant to your chosen problem space?

Disability (in particular deafness). More than 13% of U.S. adults suffer from hearing loss.

Why is Deep Learning a good approach to this problem?

Deep learning is a particularly effective approach to this problem because it excels at learning complex patterns and extracting meaningful information from large datasets, such as visual cues from lip movements. By leveraging deep learning techniques, we can develop accurate and efficient solutions for real-time captioning and communication assistance, ultimately enhancing the quality of life for individuals with hearing impairments.