S.lang AI is a cutting-edge, next-generation AI/ML software solution designed to break barriers in the world of sign language translation. Unlike many existing solutions available today, S.lang AI is a trailblazer in its field, providing a bidirectional translation experience, taking sign language communication to a whole new level.
Traditionally, sign language translation tools have been limited to a unidirectional approach, converting sign gestures into written or spoken text. While these solutions have undoubtedly been transformative for the deaf and hard-of-hearing communities, there has been a crucial missing link in the process - enabling the translation of spoken or written text back into sign language. S.lang AI bridges this gap with unparalleled proficiency.
- Captures Video Frames: The software begins by capturing real-time video frames from an input source, such as a webcam or a pre-recorded video. OpenCV, a popular open-source computer vision library, provides the necessary tools to access video streams and individual frames.
- Hand Detection and Tracking with MediaPipe: MediaPipe, developed by Google, is a powerful framework for building multimodal applied machine learning pipelines. This software utilizes MediaPipe's Hand Tracking module to detect and track hands within each video frame. This process involves identifying key landmarks on the hand, such as fingertips, knuckles, and wrist, using a machine learning-based hand pose estimation model. The model is capable of robustly tracking hands even under various orientations, lighting conditions, and occlusions.
- Hand Gesture Classification: Once MediaPipe successfully tracks the hand landmarks, the software extracts relevant hand features and encodes them into a suitable representation for classification. These features may include the spatial positions of fingers, angles between joints, or any other relevant information that characterizes different hand gestures.
- Translation and Output: Once the classifier determines the most probable hand gesture from the extracted features, S.lang translates it into the corresponding letter or symbol in the chosen sign language. The output is then displayed to the screen for the other user to comprehend.
- Text to Gesture As for the other user, if they would like to communicate back in sign language, they would simply input a prompt or their text to generate unique images corresponding to the input text. What makes this possible is our trained GAN model that runs on two neural networks to generate new, replicated instances of data. Read more about it here
- Real-Time Performace One of the key strengths of your software is its real-time performance. By leveraging efficient algorithms and optimizations provided by OpenCV and MediaPipe, this application can process video frames rapidly, providing instantaneous hand gesture recognition and translation.
When Training and Testing, the results for the first iteration were promising. The training portion came back at 100% while the testing came back at about 99.22% for a near perfectly trained model without overfitting.
When running the main script, landmarks are made to only outline each finger of the user. This allows S.lang to recognize different gesetures without focusing on other figures within the image.
Upon further trials, it could be concluded that gesture to text translation was made rapidly with minimal latency issues.
As you can see from this video, we were also able to account for gestures that may require movement such as 'J' or 'Z' letters. Next, we will want to continuously train the model on more gestures and engineer it to translate from text to gesture using DALLE models for text to image generation.
A GAN (Generative Adversarial Network) is a type of AI model used to generate new data that resembles a given dataset. Its main components include the generator(creates new data) and discriminator(tries to distinguish between real data from training and fake data from the generator). The two components are trained together in a competitive process until the generator becomes proficient at producing realistic data such as images, video, speech, etc.
When it comes to translating from text to gesture, we found that trying to capture videos and photos of all gestures and storing them for later reference was time and space consuming. To combat this, we decided to use AI in order for our model to learn patterns and eventually train itself to come up with more gestures without the need for supervised training.
What we expect to accomplish in the following months is a full scale model that can work via desktop or phone in a bidirectional translation method.
To achieve dual translation, we want to incorporate AI image generation that can create creative images based on prompt or direct text as specified by the user.
In terms of security, we will follow strict protocols in order to achieve a safe/appropriate speaking environment for users along with the addition of multi-language selection