Control Robotic Eyes With My Eyes Using AI and Deep Learning
by thomas9363 in Circuits > Computers
2618 Views, 43 Favorites, 0 Comments
Control Robotic Eyes With My Eyes Using AI and Deep Learning
The objective of this project is to control the rolling and closing of a pair of robotic eyes using the movement of my own eyes. A computer vision-based Python program, utilizing the MediaPipe framework, is employed for real-time detection and tracking of my eyes in video frames, triggering the robotic eyes to respond and follow. After my initial attempts, I found it challenging to differentiate between the upward and downward rolling of the eyes. Therefore, I implemented a deep learning algorithm in artificial intelligence into my program. This article describes how the approach was executed and what the outcomes were. All codes are available on my GitHub repository.
Setup
Building a Pair of Robotic Eyes:
In the past, I created a pair of robotic eyes that can be controlled using an Android app. The eyeballs are connected to a universal joint, allowing them to move in the x and y directions. The upper and lower eyelids are made from halves of a 40mm lottery ball, hinged together using brass strips for opening and closing. The pushrods connecting the servo horn to the eye are made of 1mm brass wire. Two servos per eye are used to control the eyeball movements in the x and y directions, while one servo handles the closing of the eyelid. Before connecting them, the servo angles are adjusted to 90 degrees. The overall construction is shown above.
Controller and Software:
The controller is a Raspberry Pi 4 with 4GB of RAM, running the Debian 12 Bookworm operating system. To manage the dependencies of my project, I created a virtual environment and installed the required modules, including TensorFlow 2.16.1, OpenCV 4.9.0, MediaPipe 0.10.9, and NumPy 1.26.4. The tracking of human eye movements is performed using Google’s MediaPipe Face Mesh framework via a Logitech C920 USB camera.
Wiring:
The servos are connected to the Pi via a PCA9685 servo driver using its I2C interface. Additionally, I utilized the Adafruit ServoKit library to facilitate servo movement in my code. The wiring configuration is depicted in the connection diagram above.
First Attempt:
MediaPipe Face Mesh is a framework developed by Google that focuses on detecting and tracking facial landmarks in real time. It can track a total of 478 landmark points on a face when iris tracking is enabled, including 468 landmarks for the face mesh and 5 landmarks per eye dedicated to the iris, as shown above. Since my project is about eye control, I am only interested in landmarks around that region.
My initial approach was to track the position of the pupil (point E) relative to points A, B, C, or D to determine the rolling direction of the eyeball. I also calculated the aspect ratio of the eye to determine whether it was open or closed. Since most people have both eyes rolling in the same direction, I only needed to track the position of one pupil—specifically, point E (p[468]) of the right eye. To detect the closing status of either the left or right eye, I calculated the aspect ratio separately. This required four key points from each eye (right eye: p[33], p[133], p[159], p[145] [A, B, C, D]; left eye: p[362], p[263], p[386], p[374]), as defined in the picture above. By calculating the ratios AE/AD and CE/CD, I could determine whether the eyeball was rolling left, right, up, or down. By calculating the ratio CD/AE, I could determine the closing status of the eyes.
The program is straightforward. It uses a camera to detect a face, extract the landmarks using MediaPipe, and convert them into a NumPy array. The x and y coordinates of the nine points mentioned above are used to calculate the necessary distances and ratios. These ratios serve as thresholds to determine the direction of eye movement. For example, if the ratio is greater than 0.6, the eyes are rolling to the left; if it is less than 0.4, the eyes are rolling to the right. When the pupil is centered (between 0.4 and 0.6), the aspect ratios of the eyes are used to assess whether they are open or closed. If the aspect ratio is less than 0.2, the eye is considered closed. Detection and tracking are fast, running at approximately 30 frames per second.
On average, most people blink around 15 to 20 times per minute, with each blink lasting between 0.1 and 0.4 seconds. Therefore, it's crucial to distinguish between intentional closing/opening and involuntary blinking. Otherwise, blinking could cause rapid, unintended eye closures. To prevent this, I introduced a debouncing time of one second, ensuring that uncontrollable blinking does not trigger a closing event.
I have two versions of Python scripts available in my GitHub repository. If you have no robotic eyes but are interested in testing the results, you can run eye_control_ball.py on your Windows machine. You will see a ball moving left or right when you roll your eyes in either direction. You can also close one eye or both eyes to see the color change. If you have built a robotic eye, you can run eye_control_servo.py on your Pi. You can adjust the threshold values to fit your own eyes.
Problem Encountered
The animated gif file at the beginning of this article shows how I used my eyes to control the robotic eyes. The script works well when tracking the opening and closing of the eyes, as well as when the eyes are rolling horizontally. However, it doesn't accurately calculate the up and down rolling. I have adjusted the up-down threshold ratio many times to determine the rolling status. I found that determining whether the eyes are rolling up or down is nearly impossible. I plotted the ratios for left-right movement and up-down movement. The plot shows that left-right movement has three clearly distinct regions: right, center, and left. However, the ratios for up-down movement are all clustered together. Even a slight forward or tilting movement of the head in front of the camera can result in different ratios.
Clearly, using a straight line to separate such a complex scenario isn't feasible. This is where deep learning in artificial intelligence comes into play. Ideally, a "squiggly" line can be found to distinguish between "up" and "down" movements. In the next section, I will describe how I use deep learning to find the solution.
AI Approach
In the past, I trained custom hand gestures using MediaPipe Hand to control a pan/tilt mechanism. Using the same algorithm, I aim to train the computer to recognize my eye movements. Since the Raspberry Pi 4 is considered a less powerful edge device, I conducted the training on a 16GB Surface Pro 9 laptop running Windows 11. The trained model is then transferred to the Raspberry Pi for inference to control the robotic eyes.
The software installed on my laptop includes Anaconda and PyCharm. I created a virtual environment using Conda and installed project-specific software, including TensorFlow 2.14, OpenCV 4.8.1, MediaPipe 0.10.8, NumPy 1.26.4, and Jupyter Notebook. Other versions of these modules may also work, but some modules may evolve faster than others. If you encounter any error messages, you can refer to the specified versions to ensure compatibility with my code.
To begin the training, I set up eight classes of eye movement from 0 to 7 as shown above. They are labeled as 'up', 'down', 'right', 'left', 'center', 'both close', 'right close' and 'left close'. You need to remember what each number represents, as they will be used in the detection scripts. The work is divided into three stages: (1) data collection; (2) model training; and (3) model deployment.
Data Collection
The script is ‘iris_creat_csv.py’. It uses MediaPipe face to detect and extract face landmarks in video frames. Since I am using eye region only, I only extract the following points:
- right_eye_indices = [33, 7, 163, 144, 145, 153, 154, 155, 133, 173, 157, 158, 159, 160, 161, 246]
- left_eye_indices = [362, 382, 381, 380, 374, 373, 390, 249, 263, 466, 388, 387, 386, 385, 384, 398]
- left_iris_indices = [474, 475, 476, 477, 473]
- right_iris_indices = [469, 470, 471, 472, 468]
I position my face in front of the camera at slightly different angles. By pressing keyboard keys 0 to 7, the x and y coordinates of the landmarks in the eye region at that frame are extracted. The count of each class in the dataset is displayed on the terminal. A center point between the two eyes is calculated. All extracted points for this frame are then normalized to this point, flattened into a NumPy array. The data are sorted at the end of execution. The extracted landmarks, along with their corresponding gesture labels, are stored in CSV file format, namely iris_gesture_data.csv, and are ready for training.
Model Training
The training script is 'iris_train.ipynb'. I use Jupyter Notebook in PyCharm on my laptop to run the training process. The data from the CSV file is split into training, validation, and test sets, which are then fed into a neural network model built with TensorFlow and Keras. The model layers, including input, hidden, and output layers, are defined using the Keras Sequential API. A simple ReLU activation function is applied to the layers. The neural network is trained on the preprocessed landmark data using TensorFlow's training functionalities. After training and validation, the model is saved in two formats: .h5 and .tflite. The iris_gesture_model.tflite is the trained model that I use, which can be deployed on either Windows OS or a Raspberry Pi.
The training process is fast and can be completed within a minute. The accuracy of my model is approximately 0.967. If you encounter issues with accuracy, adding more data to your eye movement dataset and increasing the number of epochs during training can often help improve performance.
Model Deployment
There are two detection scripts. The first script on the left, iris_detect_tflite_ball.py, uses a graphical interface to test the accuracy of the data on my laptop. This script displays a pair of eyes on the screen that follow the movement of your own eyes. The second script on the right, iris_detect_tflite_servo.py, is used to control the robotic eyes. You should copy this script and the model iris_gesture_model.tflite to a directory on your Raspberry Pi and run it within your virtual environment. The gesture labels defined during training are included at the beginning of the scripts. If you use a different sequence or naming convention for your eye movements, you will need to update the script accordingly. You can also add more movements to suit your needs.
Conclusions
The results are demonstrated in the video above. The first part is the computer simulation on my laptop and the second part is running on Pi to control the robotic eyes. In summary, the scripts successfully detect and control both the graphical eyes and the robotic eyes. Unlike my initial approach, this method can differentiate between up and down rolling. However, since I only collected a limited amount of data for each eye gesture (between 50 to 75 samples), I sometimes need to move my face and adjust its angle in front of the camera to achieve the desired results. To improve accuracy, you may want to collect more data or include datasets from other individuals for each eye movement.
The procedure described above can be applied to various daily applications, such as controlling electronic devices or software. In my next article, I will demonstrate how to control a software application using eye movements.