Hear the World Using Azure OpenAI and a Raspberry Pi

by marcogerber in Circuits > Raspberry Pi

248 Views, 2 Favorites, 0 Comments

Hear the World Using Azure OpenAI and a Raspberry Pi

Hear the world using Azure OpenAI and a Raspberry Pi
overview-components-front.jpg
overview-components-back.jpg
assembled-device-06.jpg
assembled-device-07.jpg
circuit-02.jpg
circuit-03.jpg
circuit-04.jpg
circuit-05.jpg
circuit-06.jpg
circuit-07.jpg
circuit-08.jpg
circuit-09.jpg
flow-chart.png

In my free time, I share a big interest in electronics like microcomputers (i.e. Raspberry Pi), microcontrollers (i.e. ESP32), and many different sensors and circuits. I’m fascinated by how we can integrate hardware, physics, and software and build great things out of it, which today is more accessible than ever. I learn best with practical tasks, which is why I always set my own challenges. This little project is one of those challenges that I would like to introduce to you and show you that not everything is as complex as it seems. I’m no expert in this by any means, so feel free to give feedback.

This project is designed to help visually impaired individuals by recognizing their surroundings and providing audio feedback in a small handheld device. It combines several aspects, like various hardware and sensors, audible and haptic feedback, Python code, and Azure AI services.

The project is also explained in greater detail in my blog post: Hear the world using Azure OpenAI and a Raspberry Pi - marcogerber.ch


How it works

The logic of the device is pretty straightforward. We use several hardware components in a circuit, like a camera, touch sensor, vibration motor, analog speaker, LED, and display. The entire logic and orchestration of the hardware components as well as Azure services like Azure OpenAI or Speech Service lies in the Python code.

Supplies

overview-components-breadboard.jpg

Hardware

  1. Raspberry Pi Zero 2 W | Buy
  2. Zero Spy Camera | Buy
  3. SSD1306 OLED Display | Buy
  4. Vibration Motor | Buy
  5. Touch Sensor | Buy
  6. MAX98357A Amplifier | Buy
  7. Adafruit Mini Oval Speaker | Buy
  8. LED and 220 Ohms Resistor | Buy
  9. Jumper cables | Buy
  10. Breadboard for prototyping | Buy


Software

  1. Azure OpenAI Service with a gpt-4o model deployed
  2. Azure Speech Service
  3. Python 3.x
  4. Required Python libraries: os, time, base64, requests, python-dotenv, requests, RPi.GPIO, gpiozero, openai, Adafruit-SSD1306, adafruit-python-shell, pillow==9.5.0, pygame
  5. Other libraries and tools: git, curl, libsdl2-mixer-2.0-0, libsdl2-image-2.0-0, libsdl2-2.0-0, libopenjp2-7, libcap-dev, python3-picamera2, i2samp.py

Raspberry Pi Setup

1. Connect to the Raspberry Pi. I use the Raspberry Pi OS Lite which does not include a GUI. I prefer connecting using plain SSH or Visual Studio Code with the Remote - SSH extension. This allows me to connect natively via Visual Studio and work with the files and folders as they would be on your local machine.

2. Enable I2C serial communication protocol in raspi-config:

sudo raspi-config > Interface Options > I2C > Yes > Finish
sudo reboot

3. Install missing libraries and tools:

sudo apt-get install git curl libsdl2-mixer-2.0-0 libsdl2-image-2.0-0 libsdl2-2.0-0 libopenjp2-7
sudo apt install -y python3-picamera2 libcap-dev

4. Install I2S amplifier prerequisites:

sudo apt install -y wget
wget https://github.com/adafruit/Raspberry-Pi-Installer-Scripts/raw/main/i2samp.py
sudo -E env PATH=$PATH python3 i2samp.py

5. Create a Python virtual environment:

python3 -m venv --system-site-packages .venv
source .venv/bin/activate

6. Install Python modules:

python3 -m pip install python-dotenv requests RPi.GPIO gpiozero openai Adafruit-SSD1306 adafruit-python-shell pillow==9.5.0 pygame

7. Clone my Git repository to the Raspberry Pi

git clone https://github.com/gerbermarco/hear-the-world.git

Prepare the Azure Environment

Our device needs to analyze the photo and generate audio from the text description of the photo. We use two Azure AI services to achieve that:

  • Azure OpenAI with GPT-4o: Analysis of the photo taken
  • Azure Speech Service: Generate audio from text (text-to-speech)

Make a note of the Azure OpenAI endpoint URL, key, gpt-4o deployment name, as well as the key and region of your Azure Speech Service. We need these values in the next step.

Update the .env File in the Root Directory of the Project With Your Own Values:

AZURE_OPENAI_ENDPOINT=<your_azure_openai_endpoint>      # The endpoint for your Azure OpenAI service.
AZURE_OPENAI_API_KEY=<your_azure_openai_api_key> # The API key for your Azure OpenAI service.
AZURE_OPENAI_DEPLOYMENT=<your_azure_openai_deployment> # The deployment name for your Azure OpenAI service.
SPEECH_KEY=<your_azure_speech_key> # The key for your Azure Speech service.
SPEECH_REGION=<your_azure_speech_region> # The region for your Azure Speech service.

Circuitry

circuit-diagram.jpg
FRV5D9MLZSFNW89.jpg

During the prototyping phase, I tried to learn to handle each component one by one and to control it using the Raspberry Pi and Python code. Using a breadboard, I could easily connect and organize the sensors using simple jumper cables while keeping somewhat of an overview. One by one the project slowly took shape.

I created a circuit diagram showing the several connections. These are also the actual GPIO pin numbers which should match the code. For the final product, I just staked and screwed together the sensors to look somewhat like a handheld device.

Code

All the code and associated files can be found in my GitHub repository. The main.py file includes the entire logic of the device. I tried to document the code as best as possible using inline comments.

import os
from time import sleep
import base64
import requests
from dotenv import load_dotenv
import RPi.GPIO as GPIO
from gpiozero import LED
from picamera2 import Picamera2, Preview
from openai import AzureOpenAI
import Adafruit_SSD1306
from PIL import Image, ImageDraw, ImageFont
from gpiozero import DigitalOutputDevice
import pygame

load_dotenv()

# Initialize LED
green_led = LED(12)

# Initialize SSD1306 LCD screen
RST, DC, SPI_PORT, SPI_DEVICE = 24, 23, 0, 0
disp = Adafruit_SSD1306.SSD1306_128_64(rst=RST)
disp.begin()
disp.clear()
disp.display()

# Create blank image for drawing
width, height = disp.width, disp.height
image = Image.new("1", (width, height))
draw = ImageDraw.Draw(image)

# Set font and padding
font = ImageFont.truetype("includes/fonts/PixelOperator.ttf", 16)
padding, top, bottom, x = -2, -2, height + 2, 0

# Setup touch sensor
touch_pin = 16
GPIO.setmode(GPIO.BCM)
GPIO.setup(touch_pin, GPIO.IN, pull_up_down=GPIO.PUD_UP)

# Camera setup
picam2 = Picamera2()
preview_config = picam2.create_preview_configuration(main={"size": (1024, 768)})
picam2.configure(preview_config)
image_path = "snapshots/snap.jpg"

# Vibration motor setup
vibration_motor = DigitalOutputDevice(25)

# Azure OpenAI setup
oai_api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
oai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
oai_deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT")
oai_api_version = "2023-12-01-preview"

client = AzureOpenAI(
api_key=oai_api_key,
api_version=oai_api_version,
base_url=f"{oai_api_base}/openai/deployments/{oai_deployment_name}",
)

# Azure Speech Service setup
speech_key = os.getenv("SPEECH_KEY")
speech_region = os.getenv("SPEECH_REGION")

# Define audio file paths
audio_file_path_response = "audio/response.mp3"
audio_file_path_device_ready = "includes/audio_snippets/device_ready.mp3"
audio_file_path_analyze_picture = "includes/audio_snippets/analyze_view.mp3"
audio_file_path_hold_still = "includes/audio_snippets/hold_still.mp3"

# Initialize Pygame mixer for audio playback
pygame.mixer.init()


# Helper functions
# Function to update OLED display
def display_screen():
disp.image(image)
disp.display()


def scroll_text(display, text):
# Create blank image for drawing
width = disp.width
height = disp.height
image = Image.new("1", (width, height))

# Get a drawing context
draw = ImageDraw.Draw(image)

# Load a font
font = ImageFont.truetype("includes/fonts/PixelOperator.ttf", 16)
font_width, font_height = font.getsize("A") # Assuming monospace font, get width and height of a character

# Calculate the maximum number of characters per line
max_chars_per_line = width // font_width

# Split the text into lines that fit within the display width
lines = []
current_line = ""
for word in text.split():
if len(current_line) + len(word) + 1 <= max_chars_per_line:
current_line += word + " "
else:
lines.append(current_line.strip())
current_line = word + " "
if current_line:
lines.append(current_line.strip())

# Calculate total text height
total_text_height = (len(lines) * font_height) + 10

# Initial display of the text
y = 0
draw.rectangle((0, 0, width, height), outline=0, fill=0)
for i, line in enumerate(lines):
draw.text((0, y + i * font_height), line, font=font, fill=255)
display_screen()

if total_text_height > height:
# If text exceeds screen size, scroll the text
y = 0
while y > -total_text_height + height:
draw.rectangle((0, 0, width, height), outline=0, fill=0)
for i, line in enumerate(lines):
draw.text((0, y + i * font_height), line, font=font, fill=255)
disp.image(image)
disp.display()
y -= 2.5

# Clear the display after scrolling is complete
sleep(2)
display_screen()


def vibration_pulse():
vibration_motor.on()
sleep(0.1)
vibration_motor.off()


def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")


def play_audio(audio_file_path):
pygame.mixer.music.load(audio_file_path)
pygame.mixer.music.play()


def synthesize_speech(text_input):
url = f"https://{speech_region}.tts.speech.microsoft.com/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": speech_key,
"Content-Type": "application/ssml+xml",
"X-Microsoft-OutputFormat": "audio-16khz-128kbitrate-mono-mp3",
"User-Agent": "curl",
}
data = f"""<speak version='1.0' xml:lang='en-US'>
<voice xml:lang='en-US' xml:gender='Male' name='en-US-ChristopherNeural'>
{text_input}
</voice>
</speak>"""
response = requests.post(url, headers=headers, data=data)
with open(audio_file_path_response, "wb") as f:
f.write(response.content)
play_audio(audio_file_path_response)


# Play audio "Device is ready"
play_audio(audio_file_path_device_ready)

while True:
try:
green_led.on()
input_state = GPIO.input(touch_pin)

draw.rectangle((0, 0, width, height), outline=0, fill=0)
draw.text((x, top + 2), "Device is ready", font=font, fill=255)
display_screen()

if input_state == 1:

play_audio(audio_file_path_hold_still)

green_led.off()
sleep(0.1)
green_led.on()

vibration_pulse()

state = 0

print("Taking photo 📸")
draw.rectangle((0, 0, width, height), outline=0, fill=0)
draw.text((x, top + 2), "Taking photo ...", font=font, fill=255)
display_screen()
picam2.start()
sleep(1)
metadata = picam2.capture_file(image_path)
# picam2.close()
picam2.stop()

play_audio(audio_file_path_analyze_picture)
print("Analysing image ...")
draw.rectangle((0, 0, width, height), outline=0, fill=0)
draw.text((x, top + 2), "Analysing image ...", font=font, fill=255)
display_screen()

# Open the image file and encode it as a base64 string
base64_image = encode_image(image_path)

if state == 0:
green_led.blink(0.1, 0.1)

response = client.chat.completions.create(
model=oai_deployment_name,
messages=[
{
"role": "system",
"content": "You are a device that helps visually impaired people recognize objects. Describe the pictures so that it is as understandable as possible for visually impaired people. Limit your answer to two to three sentences. Only describe the most important part in the image.",
},
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image:"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_image}"
},
},
],
},
],
max_tokens=2000,
)

response = response.choices[0].message.content

vibration_pulse()
sleep(0.1)
vibration_pulse()

print("Response:")
print(response)

synthesize_speech(response)

draw.rectangle((0, 0, width, height), outline=0, fill=0)
scroll_text(display_screen, response)

state = 1
sleep(5)

except KeyboardInterrupt:
draw.rectangle((0, 0, width, height), outline=0, fill=0)
display_screen()
print("Interrupted")
break

except IOError:
print("Error")
print(IOError)

Run the Project

1. Ensure your Raspberry Pi is properly set up with the necessary hardware and software prerequisites.

2. Run the Python script:

python3 main.py

3. The system will initialize and display “Device is ready” on the OLED screen as well as play an audio description.

4. Touch the sensor to capture an image, which will be analyzed and described using Azure OpenAI services. The description will be played back via audio.


See the video on the top to see the results. This project was enormous fun and showed me that even people with less expertise can develop exciting projects using today’s technologies. I’m excited to see what the future holds and I’m already looking forward to new projects. Have fun building it yourself!

Check out my blog for more: All things cloud - marcogerber.ch