Vision-impaired individuals often face barriers that prevent them from participating in the same experiences as non-vision-impaired individuals. This is particularly evident in activities related to travel and tourism. As a result, their quality of life can be diminished and they may feel a sense of missing out.
Our team developed an accessible mechanism that allows vision-impaired individuals to take part in a safari by enabling participants to learn about their surroundings through sonification and scene description in an easy-to-use interface. The system uses computer vision to identify and track animals (YOLOv8) and perform scene recognition (LLaVA). MaxMSP receives data from YOLO to produce a sonification conveying real-time data about the animals. Text-to-speech is executed on data from LLaVA to describe the scene at specified intervals. The full systems runs in a Streamlit web application with a simple UI.
The completed research paper can be accessed here.
Research, Design, Sonification
Vision-impaired individuals often face barriers that prevent them from participating in the same experiences as non-vision-impaired individuals. This is particularly evident in activities related to travel and tourism. As a result, their quality of life can be diminished and they may feel a sense of missing out.
Our team developed an accessible mechanism that allows vision-impaired individuals to take part in a safari by enabling participants to learn about their surroundings through sonification and scene description in an easy-to-use interface. The system uses computer vision to identify and track animals (YOLOv8) and perform scene recognition (LLaVA). MaxMSP receives data from YOLO to produce a sonification conveying real-time data about the animals. Text-to-speech is executed on data from LLaVA to describe the scene at specified intervals. The full systems runs in a Streamlit web application with a simple UI.
The completed research paper can be accessed here.
Research, Design, Sonification
Below is a simplified diagram of the system. The main application consists of four parts: a web server (input), Max MSP (output), object detection pipeline (processing), & scene description pipeline (processing). The web server functions as the site that the user interacts with. The application handles three types of input: live webcam footage (real-time sensing), recorded video footage (testing and debugging), & direct youTube footage (testing and debugging).
MaxMSP Runs a Node.js server to listen for incoming object detection (YOLOv8) results. For each object detected, a new synth instance is generated with a unique, harmonic root note. Values from YOLO are mapped to control synth parameters and audio effects. Our approach prioritizes information perception and understanding over musicality and aesthetics.