Contents
In my free time, I share a big interest in electronics like microcomputers (i.e. Raspberry Pi), microcontrollers (i.e. ESP32), and many different sensors and circuits. I’m fascinated by how we can integrate hardware, physics, and software and build great things out of it, which today is more accessible than ever. I learn best with practical tasks, which is why I always set my own challenges. This little project is one of those challenges that I would like to introduce to you and show you that not everything is as complex as it seems. I’m no expert in this by any means, so feel free to give feedback.
As a small teaser, this is what we’re going to build in this blog post:
The challenge
My challenge to myself contained the following requirements:
- Because I wanted to improve my Python skills, I chose a Raspberry Pi Zero 2 W that was lying around
- Process real-world data using Azure OpenAI and a GPT model, preferably GPT-4o
- Use various input and output sensors
- Fit everything into a device-like constriction
- Provide a real benefit or showcase
The idea
Let’s build a small device, helping visually impaired people understand their surroundings.
The fundamental concept is to develop a device that you can point at your surroundings and receive an audio description of your environment, a point-and-shoot kind of device. As the goal of the device is to help visually impaired people, I have to consider physical aspects like haptic feedback as well.
The final product
I use a Raspberry Pi Zero 2 W as the heart of the device, which runs the code and controls input/output pins in order to communicate with the connected sensors. The usage is simple, point the device in the desired direction, click a button, and listen to the audio describing your surroundings. Take a look:
How it works
The logic of the device is pretty straightforward. We use several hardware components in a circuit, like a camera, touch sensor, vibration motor, analog speaker, LED, and display. The entire logic and orchestration of the hardware components as well as Azure services like Azure OpenAI or Speech Service lies in the Python code, which looks something like this:
Let’s build it
Requirements
Hardware
- Raspberry Pi Zero 2 W
- Zero Spy Camera
- SSD1306 OLED Display
- Vibration Motor
- Touch Sensor
- MAX98357A Amplifier
- Adafruit Mini Oval Speaker
- LED and 220 Ohms Resistor
- Jumper cables
- Breadboard for prototyping
Software
- Azure OpenAI Service with a gpt-4o model deployed
- Azure Speech Service
- Python 3.x
- Required Python libraries:
os
,time
,base64
,requests
,python-dotenv
,requests
,RPi.GPIO
,gpiozero
,openai
,Adafruit-SSD1306
,adafruit-python-shell
,pillow==9.5.0
,pygame
- Other libraries and tools:
git
,curl
,libsdl2-mixer-2.0-0
,libsdl2-image-2.0-0
,libsdl2-2.0-0
,libopenjp2-7
,libcap-dev
,python3-picamera2
,i2samp.py
Raspberry Pi setup
To run the final code, we first need to prepare the Raspberry Pi by installing some libraries and modules. These libraries and modules provide the necessary functionalities to communicate with and control the connected hardware components.
1. Connect to the Raspberry Pi. I use the Raspberry Pi OS Lite which does not include a GUI. I prefer connecting using plain SSH or Visual Studio Code with the Remote - SSH
extension. This allows me to connect natively via Visual Studio and work with the files and folders as they would be on your local machine.
2. Enable I2C serial communication protocol in raspi-config:
sudo raspi-config > Interface Options > I2C > Yes > Finish sudo reboot
3. Install missing libraries and tools:
sudo apt-get install git curl libsdl2-mixer-2.0-0 libsdl2-image-2.0-0 libsdl2-2.0-0 libopenjp2-7 sudo apt install -y python3-picamera2 libcap-dev
4. Install I2S amplifier prerequisites:
sudo apt install -y wget wget https://github.com/adafruit/Raspberry-Pi-Installer-Scripts/raw/main/i2samp.py sudo -E env PATH=$PATH python3 i2samp.py
5. Create a Python virtual environment:
python3 -m venv --system-site-packages .venv source .venv/bin/activate
6. Install Python modules:
python3 -m pip install python-dotenv requests RPi.GPIO gpiozero openai Adafruit-SSD1306 adafruit-python-shell pillow==9.5.0 pygame
7. Clone my Git repository to the Raspberry Pi
git clone https://github.com/gerbermarco/hear-the-world.git
Prepare the Azure environment
Our device needs to analyze the photo and generate audio from the text description of the photo. We use two Azure AI services to achieve that:
- Azure OpenAI with GPT-4o: Analysis of the photo taken
- Azure Speech Service: Generate audio from text (text-to-speech)
Make a note of the Azure OpenAI endpoint URL, key, gpt-4o deployment name, as well as the key and region of your Azure Speech Service. We need these values in the next step.
Update the .env
file in the root directory of the project with your own values:
AZURE_OPENAI_ENDPOINT=<your_azure_openai_endpoint> # The endpoint for your Azure OpenAI service. AZURE_OPENAI_API_KEY=<your_azure_openai_api_key> # The API key for your Azure OpenAI service. AZURE_OPENAI_DEPLOYMENT=<your_azure_openai_deployment> # The deployment name for your Azure OpenAI service. SPEECH_KEY=<your_azure_speech_key> # The key for your Azure Speech service. SPEECH_REGION=<your_azure_speech_region> # The region for your Azure Speech service.
Circuitry
During the prototyping phase, I tried to learn to handle each component one by one and to control it using the Raspberry Pi and Python code. Using a breadboard, I could easily connect and organize the sensors using simple jumper cables while keeping somewhat of an overview. One by one the project slowly took shape.
I created a circuit diagram showing the several connections. These are also the actual GPIO pin numbers which should match the code.
For the final product, I just staked and screwed together the sensors to look somewhat like a handheld device.
Code
All the code and associated files can be found in my GitHub repository. The main.py
file includes the entire logic of the device. I tried to document the code as best as possible using inline comments.
import os from time import sleep import base64 import requests from dotenv import load_dotenv import RPi.GPIO as GPIO from gpiozero import LED from picamera2 import Picamera2, Preview from openai import AzureOpenAI import Adafruit_SSD1306 from PIL import Image, ImageDraw, ImageFont from gpiozero import DigitalOutputDevice import pygame load_dotenv() # Initialize LED green_led = LED(12) # Initialize SSD1306 LCD screen RST, DC, SPI_PORT, SPI_DEVICE = 24, 23, 0, 0 disp = Adafruit_SSD1306.SSD1306_128_64(rst=RST) disp.begin() disp.clear() disp.display() # Create blank image for drawing width, height = disp.width, disp.height image = Image.new("1", (width, height)) draw = ImageDraw.Draw(image) # Set font and padding font = ImageFont.truetype("includes/fonts/PixelOperator.ttf", 16) padding, top, bottom, x = -2, -2, height + 2, 0 # Setup touch sensor touch_pin = 16 GPIO.setmode(GPIO.BCM) GPIO.setup(touch_pin, GPIO.IN, pull_up_down=GPIO.PUD_UP) # Camera setup picam2 = Picamera2() preview_config = picam2.create_preview_configuration(main={"size": (1024, 768)}) picam2.configure(preview_config) image_path = "snapshots/snap.jpg" # Vibration motor setup vibration_motor = DigitalOutputDevice(25) # Azure OpenAI setup oai_api_base = os.getenv("AZURE_OPENAI_ENDPOINT") oai_api_key = os.getenv("AZURE_OPENAI_API_KEY") oai_deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT") oai_api_version = "2023-12-01-preview" client = AzureOpenAI( api_key=oai_api_key, api_version=oai_api_version, base_url=f"{oai_api_base}/openai/deployments/{oai_deployment_name}", ) # Azure Speech Service setup speech_key = os.getenv("SPEECH_KEY") speech_region = os.getenv("SPEECH_REGION") # Define audio file paths audio_file_path_response = "audio/response.mp3" audio_file_path_device_ready = "includes/audio_snippets/device_ready.mp3" audio_file_path_analyze_picture = "includes/audio_snippets/analyze_view.mp3" audio_file_path_hold_still = "includes/audio_snippets/hold_still.mp3" # Initialize Pygame mixer for audio playback pygame.mixer.init() # Helper functions # Function to update OLED display def display_screen(): disp.image(image) disp.display() def scroll_text(display, text): # Create blank image for drawing width = disp.width height = disp.height image = Image.new("1", (width, height)) # Get a drawing context draw = ImageDraw.Draw(image) # Load a font font = ImageFont.truetype("includes/fonts/PixelOperator.ttf", 16) font_width, font_height = font.getsize("A") # Assuming monospace font, get width and height of a character # Calculate the maximum number of characters per line max_chars_per_line = width // font_width # Split the text into lines that fit within the display width lines = [] current_line = "" for word in text.split(): if len(current_line) + len(word) + 1 <= max_chars_per_line: current_line += word + " " else: lines.append(current_line.strip()) current_line = word + " " if current_line: lines.append(current_line.strip()) # Calculate total text height total_text_height = (len(lines) * font_height) + 10 # Initial display of the text y = 0 draw.rectangle((0, 0, width, height), outline=0, fill=0) for i, line in enumerate(lines): draw.text((0, y + i * font_height), line, font=font, fill=255) display_screen() if total_text_height > height: # If text exceeds screen size, scroll the text y = 0 while y > -total_text_height + height: draw.rectangle((0, 0, width, height), outline=0, fill=0) for i, line in enumerate(lines): draw.text((0, y + i * font_height), line, font=font, fill=255) disp.image(image) disp.display() y -= 2.5 # Clear the display after scrolling is complete sleep(2) display_screen() def vibration_pulse(): vibration_motor.on() sleep(0.1) vibration_motor.off() def encode_image(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode("utf-8") def play_audio(audio_file_path): pygame.mixer.music.load(audio_file_path) pygame.mixer.music.play() def synthesize_speech(text_input): url = f"https://{speech_region}.tts.speech.microsoft.com/cognitiveservices/v1" headers = { "Ocp-Apim-Subscription-Key": speech_key, "Content-Type": "application/ssml+xml", "X-Microsoft-OutputFormat": "audio-16khz-128kbitrate-mono-mp3", "User-Agent": "curl", } data = f"""<speak version='1.0' xml:lang='en-US'> <voice xml:lang='en-US' xml:gender='Male' name='en-US-ChristopherNeural'> {text_input} </voice> </speak>""" response = requests.post(url, headers=headers, data=data) with open(audio_file_path_response, "wb") as f: f.write(response.content) play_audio(audio_file_path_response) # Play audio "Device is ready" play_audio(audio_file_path_device_ready) while True: try: green_led.on() input_state = GPIO.input(touch_pin) draw.rectangle((0, 0, width, height), outline=0, fill=0) draw.text((x, top + 2), "Device is ready", font=font, fill=255) display_screen() if input_state == 1: play_audio(audio_file_path_hold_still) green_led.off() sleep(0.1) green_led.on() vibration_pulse() state = 0 print("Taking photo 📸") draw.rectangle((0, 0, width, height), outline=0, fill=0) draw.text((x, top + 2), "Taking photo ...", font=font, fill=255) display_screen() picam2.start() sleep(1) metadata = picam2.capture_file(image_path) # picam2.close() picam2.stop() play_audio(audio_file_path_analyze_picture) print("Analysing image ...") draw.rectangle((0, 0, width, height), outline=0, fill=0) draw.text((x, top + 2), "Analysing image ...", font=font, fill=255) display_screen() # Open the image file and encode it as a base64 string base64_image = encode_image(image_path) if state == 0: green_led.blink(0.1, 0.1) response = client.chat.completions.create( model=oai_deployment_name, messages=[ { "role": "system", "content": "You are a device that helps visually impaired people recognize objects. Describe the pictures so that it is as understandable as possible for visually impaired people. Limit your answer to two to three sentences. Only describe the most important part in the image.", }, { "role": "user", "content": [ {"type": "text", "text": "Describe this image:"}, { "type": "image_url", "image_url": { "url": f"data:image/png;base64,{base64_image}" }, }, ], }, ], max_tokens=2000, ) response = response.choices[0].message.content vibration_pulse() sleep(0.1) vibration_pulse() print("Response:") print(response) synthesize_speech(response) draw.rectangle((0, 0, width, height), outline=0, fill=0) scroll_text(display_screen, response) state = 1 sleep(5) except KeyboardInterrupt: draw.rectangle((0, 0, width, height), outline=0, fill=0) display_screen() print("Interrupted") break except IOError: print("Error") print(IOError)
Run the project
1. Ensure your Raspberry Pi is properly set up with the necessary hardware and software prerequisites.
2. Run the Python script:
python3 main.py
3. The system will initialize and display “Device is ready” on the OLED screen as well as play an audio description.
4. Touch the sensor to capture an image, which will be analyzed and described using Azure OpenAI services. The description will be played back via audio.
Conclusion
This project was enormous fun and showed me that even people with less expertise can develop exciting projects using today’s technologies. I’m excited to see what the future holds and I’m already looking forward to new projects. Have fun building it yourself!