August 6, 2024
August 6, 2024 Marco

Hear the world using Azure OpenAI and a Raspberry Pi

In my free time, I share a big interest in electronics like microcomputers (i.e. Raspberry Pi), microcontrollers (i.e. ESP32), and many different sensors and circuits. I’m fascinated by how we can integrate hardware, physics, and software and build great things out of it, which today is more accessible than ever. I learn best with practical tasks, which is why I always set my own challenges. This little project is one of those challenges that I would like to introduce to you and show you that not everything is as complex as it seems. I’m no expert in this by any means, so feel free to give feedback.

As a small teaser, this is what we’re going to build in this blog post:

The challenge

My challenge to myself contained the following requirements:

  • Because I wanted to improve my Python skills, I chose a Raspberry Pi Zero 2 W that was lying around
  • Process real-world data using Azure OpenAI and a GPT model, preferably GPT-4o
  • Use various input and output sensors
  • Fit everything into a device-like constriction
  • Provide a real benefit or showcase

The idea

Let’s build a small device, helping visually impaired people understand their surroundings.

The fundamental concept is to develop a device that you can point at your surroundings and receive an audio description of your environment, a point-and-shoot kind of device. As the goal of the device is to help visually impaired people, I have to consider physical aspects like haptic feedback as well.

The final product

I use a Raspberry Pi Zero 2 W as the heart of the device, which runs the code and controls input/output pins in order to communicate with the connected sensors. The usage is simple, point the device in the desired direction, click a button, and listen to the audio describing your surroundings. Take a look:

How it works

The logic of the device is pretty straightforward. We use several hardware components in a circuit, like a camera, touch sensor, vibration motor, analog speaker, LED, and display. The entire logic and orchestration of the hardware components as well as Azure services like Azure OpenAI or Speech Service lies in the Python code, which looks something like this:

Let’s build it

Requirements

Hardware

  • Raspberry Pi Zero 2 W
  • Zero Spy Camera
  • SSD1306 OLED Display
  • Vibration Motor
  • Touch Sensor
  • MAX98357A Amplifier
  • Adafruit Mini Oval Speaker
  • LED and 220 Ohms Resistor
  • Jumper cables
  • Breadboard for prototyping

Software

  • Azure OpenAI Service with a gpt-4o model deployed
  • Azure Speech Service
  • Python 3.x
  • Required Python libraries: ostimebase64requestspython-dotenvrequestsRPi.GPIOgpiozeroopenaiAdafruit-SSD1306adafruit-python-shellpillow==9.5.0pygame
  • Other libraries and tools: gitcurllibsdl2-mixer-2.0-0libsdl2-image-2.0-0libsdl2-2.0-0libopenjp2-7libcap-devpython3-picamera2i2samp.py

Raspberry Pi setup

To run the final code, we first need to prepare the Raspberry Pi by installing some libraries and modules. These libraries and modules provide the necessary functionalities to communicate with and control the connected hardware components.

1. Connect to the Raspberry Pi. I use the Raspberry Pi OS Lite which does not include a GUI. I prefer connecting using plain SSH or Visual Studio Code with the Remote - SSH extension. This allows me to connect natively via Visual Studio and work with the files and folders as they would be on your local machine.

2. Enable I2C serial communication protocol in raspi-config:

sudo raspi-config > Interface Options > I2C > Yes > Finish
sudo reboot

3. Install missing libraries and tools:

sudo apt-get install git curl libsdl2-mixer-2.0-0 libsdl2-image-2.0-0 libsdl2-2.0-0 libopenjp2-7
sudo apt install -y python3-picamera2 libcap-dev

4. Install I2S amplifier prerequisites:

sudo apt install -y wget
wget https://github.com/adafruit/Raspberry-Pi-Installer-Scripts/raw/main/i2samp.py
sudo -E env PATH=$PATH python3 i2samp.py

5. Create a Python virtual environment:

python3 -m venv --system-site-packages .venv
source .venv/bin/activate

6. Install Python modules:

python3 -m pip install python-dotenv requests RPi.GPIO gpiozero openai Adafruit-SSD1306 adafruit-python-shell pillow==9.5.0 pygame

7. Clone my Git repository to the Raspberry Pi

git clone https://github.com/gerbermarco/hear-the-world.git

Prepare the Azure environment

Our device needs to analyze the photo and generate audio from the text description of the photo. We use two Azure AI services to achieve that:

  • Azure OpenAI with GPT-4o: Analysis of the photo taken
  • Azure Speech Service: Generate audio from text (text-to-speech)

Make a note of the Azure OpenAI endpoint URL, key, gpt-4o deployment name, as well as the key and region of your Azure Speech Service. We need these values in the next step.

Update the .env file in the root directory of the project with your own values:

AZURE_OPENAI_ENDPOINT=<your_azure_openai_endpoint>      # The endpoint for your Azure OpenAI service.
AZURE_OPENAI_API_KEY=<your_azure_openai_api_key>        # The API key for your Azure OpenAI service.
AZURE_OPENAI_DEPLOYMENT=<your_azure_openai_deployment>  # The deployment name for your Azure OpenAI service.
SPEECH_KEY=<your_azure_speech_key>                      # The key for your Azure Speech service.
SPEECH_REGION=<your_azure_speech_region>                # The region for your Azure Speech service.

Circuitry

During the prototyping phase, I tried to learn to handle each component one by one and to control it using the Raspberry Pi and Python code. Using a breadboard, I could easily connect and organize the sensors using simple jumper cables while keeping somewhat of an overview. One by one the project slowly took shape.

I created a circuit diagram showing the several connections. These are also the actual GPIO pin numbers which should match the code.

For the final product, I just staked and screwed together the sensors to look somewhat like a handheld device.

Code

All the code and associated files can be found in my GitHub repository. The main.py file includes the entire logic of the device. I tried to document the code as best as possible using inline comments.

import os
from time import sleep
import base64
import requests
from dotenv import load_dotenv
import RPi.GPIO as GPIO
from gpiozero import LED
from picamera2 import Picamera2, Preview
from openai import AzureOpenAI
import Adafruit_SSD1306
from PIL import Image, ImageDraw, ImageFont
from gpiozero import DigitalOutputDevice
import pygame

load_dotenv()

# Initialize LED
green_led = LED(12)

# Initialize SSD1306 LCD screen
RST, DC, SPI_PORT, SPI_DEVICE = 24, 23, 0, 0
disp = Adafruit_SSD1306.SSD1306_128_64(rst=RST)
disp.begin()
disp.clear()
disp.display()

# Create blank image for drawing
width, height = disp.width, disp.height
image = Image.new("1", (width, height))
draw = ImageDraw.Draw(image)

# Set font and padding
font = ImageFont.truetype("includes/fonts/PixelOperator.ttf", 16)
padding, top, bottom, x = -2, -2, height + 2, 0

# Setup touch sensor
touch_pin = 16
GPIO.setmode(GPIO.BCM)
GPIO.setup(touch_pin, GPIO.IN, pull_up_down=GPIO.PUD_UP)

# Camera setup
picam2 = Picamera2()
preview_config = picam2.create_preview_configuration(main={"size": (1024, 768)})
picam2.configure(preview_config)
image_path = "snapshots/snap.jpg"

# Vibration motor setup
vibration_motor = DigitalOutputDevice(25)

# Azure OpenAI setup
oai_api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
oai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
oai_deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT")
oai_api_version = "2023-12-01-preview"

client = AzureOpenAI(
    api_key=oai_api_key,
    api_version=oai_api_version,
    base_url=f"{oai_api_base}/openai/deployments/{oai_deployment_name}",
)

# Azure Speech Service setup
speech_key = os.getenv("SPEECH_KEY")
speech_region = os.getenv("SPEECH_REGION")

# Define audio file paths
audio_file_path_response = "audio/response.mp3"
audio_file_path_device_ready = "includes/audio_snippets/device_ready.mp3"
audio_file_path_analyze_picture = "includes/audio_snippets/analyze_view.mp3"
audio_file_path_hold_still = "includes/audio_snippets/hold_still.mp3"

# Initialize Pygame mixer for audio playback
pygame.mixer.init()


# Helper functions
# Function to update OLED display
def display_screen():
    disp.image(image)
    disp.display()


def scroll_text(display, text):
    # Create blank image for drawing
    width = disp.width
    height = disp.height
    image = Image.new("1", (width, height))

    # Get a drawing context
    draw = ImageDraw.Draw(image)

    # Load a font
    font = ImageFont.truetype("includes/fonts/PixelOperator.ttf", 16)
    font_width, font_height = font.getsize("A")  # Assuming monospace font, get width and height of a character

    # Calculate the maximum number of characters per line
    max_chars_per_line = width // font_width

    # Split the text into lines that fit within the display width
    lines = []
    current_line = ""
    for word in text.split():
        if len(current_line) + len(word) + 1 <= max_chars_per_line:
            current_line += word + " "
        else:
            lines.append(current_line.strip())
            current_line = word + " "
    if current_line:
        lines.append(current_line.strip())

    # Calculate total text height
    total_text_height = (len(lines) * font_height) + 10

    # Initial display of the text
    y = 0
    draw.rectangle((0, 0, width, height), outline=0, fill=0)
    for i, line in enumerate(lines):
        draw.text((0, y + i * font_height), line, font=font, fill=255)
    display_screen()

    if total_text_height > height:
        # If text exceeds screen size, scroll the text
        y = 0
        while y > -total_text_height + height:
            draw.rectangle((0, 0, width, height), outline=0, fill=0)
            for i, line in enumerate(lines):
                draw.text((0, y + i * font_height), line, font=font, fill=255)
            disp.image(image)
            disp.display()
            y -= 2.5

        # Clear the display after scrolling is complete
        sleep(2)
        display_screen()


def vibration_pulse():
    vibration_motor.on()
    sleep(0.1)
    vibration_motor.off()


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


def play_audio(audio_file_path):
    pygame.mixer.music.load(audio_file_path)
    pygame.mixer.music.play()


def synthesize_speech(text_input):
    url = f"https://{speech_region}.tts.speech.microsoft.com/cognitiveservices/v1"
    headers = {
        "Ocp-Apim-Subscription-Key": speech_key,
        "Content-Type": "application/ssml+xml",
        "X-Microsoft-OutputFormat": "audio-16khz-128kbitrate-mono-mp3",
        "User-Agent": "curl",
    }
    data = f"""<speak version='1.0' xml:lang='en-US'>
        <voice xml:lang='en-US' xml:gender='Male' name='en-US-ChristopherNeural'>
            {text_input}
        </voice>
    </speak>"""
    response = requests.post(url, headers=headers, data=data)
    with open(audio_file_path_response, "wb") as f:
        f.write(response.content)
    play_audio(audio_file_path_response)


# Play audio "Device is ready"
play_audio(audio_file_path_device_ready)

while True:
    try:
        green_led.on()
        input_state = GPIO.input(touch_pin)

        draw.rectangle((0, 0, width, height), outline=0, fill=0)
        draw.text((x, top + 2), "Device is ready", font=font, fill=255)
        display_screen()

        if input_state == 1:

            play_audio(audio_file_path_hold_still)

            green_led.off()
            sleep(0.1)
            green_led.on()

            vibration_pulse()

            state = 0

            print("Taking photo 📸")
            draw.rectangle((0, 0, width, height), outline=0, fill=0)
            draw.text((x, top + 2), "Taking photo ...", font=font, fill=255)
            display_screen()
            picam2.start()
            sleep(1)
            metadata = picam2.capture_file(image_path)
            # picam2.close()
            picam2.stop()

            play_audio(audio_file_path_analyze_picture)
            print("Analysing image ...")
            draw.rectangle((0, 0, width, height), outline=0, fill=0)
            draw.text((x, top + 2), "Analysing image ...", font=font, fill=255)
            display_screen()

            # Open the image file and encode it as a base64 string
            base64_image = encode_image(image_path)

            if state == 0:
                green_led.blink(0.1, 0.1)

                response = client.chat.completions.create(
                    model=oai_deployment_name,
                    messages=[
                        {
                            "role": "system",
                            "content": "You are a device that helps visually impaired people recognize objects. Describe the pictures so that it is as understandable as possible for visually impaired people. Limit your answer to two to three sentences. Only describe the most important part in the image.",
                        },
                        {
                            "role": "user",
                            "content": [
                                {"type": "text", "text": "Describe this image:"},
                                {
                                    "type": "image_url",
                                    "image_url": {
                                        "url": f"data:image/png;base64,{base64_image}"
                                    },
                                },
                            ],
                        },
                    ],
                    max_tokens=2000,
                )

            response = response.choices[0].message.content

            vibration_pulse()
            sleep(0.1)
            vibration_pulse()

            print("Response:")
            print(response)

            synthesize_speech(response)

            draw.rectangle((0, 0, width, height), outline=0, fill=0)
            scroll_text(display_screen, response)

            state = 1
            sleep(5)

    except KeyboardInterrupt:
        draw.rectangle((0, 0, width, height), outline=0, fill=0)
        display_screen()
        print("Interrupted")
        break

    except IOError:
        print("Error")
        print(IOError)

Run the project

1. Ensure your Raspberry Pi is properly set up with the necessary hardware and software prerequisites.

2. Run the Python script:

python3 main.py

3. The system will initialize and display “Device is ready” on the OLED screen as well as play an audio description.

4. Touch the sensor to capture an image, which will be analyzed and described using Azure OpenAI services. The description will be played back via audio.

Conclusion

This project was enormous fun and showed me that even people with less expertise can develop exciting projects using today’s technologies. I’m excited to see what the future holds and I’m already looking forward to new projects. Have fun building it yourself!

Learn how to use Azure Bicep to deploy Entra ID resources declaratively

Let’s take a closer look at the new Graph extension and how it works. This extension integrates seamlessly with Azure Bicep, enabling the management of Entra ID resources directly within your Bicep files. By leveraging this new functionality, you can define your entire cloud infrastructure, including identity management components, in a unified and declarative manner. This not only simplifies the deployment process but also enhances the reliability and maintainability of your infrastructure as code (IaC) practices.

, ,
Marco Gerber

Marco

Senior Cloud Engineer, keen on Azure and cloud technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *