The Impact of YOLOv8 and YOLOv9 Neural Network Architectures on the Quality and Speed of Object Detection in Assistive VR/MR Applications for Visually Impaired Users

Piotr Lichograj
European Research Studies Journal, Volume XXVIII, Issue 4, 1044-1057, 2025
DOI: 10.35808/ersj/4157

Abstract:

Purpose: This paper examines the performance and applicability of lightweight deep-learning architectures for real-time object detection in mixed-reality (MR) and virtual-reality (VR) environments designed to assist blind and visually impaired individuals. The primary objective is to determine which model—YOLOv8 or the lightweight YOLOv9-t variant—provides the superior balance of accuracy, sensitivity, and processing speed, particularly for on-device deployment on wearable devices such as the Meta Quest 3 headset. This research addresses the growing need for efficient, edge-based AI solutions that enhance navigation and spatial awareness without relying on cloud computing, thereby ensuring user privacy and low latency. Design/Methodology/Approach: To conduct the study, a custom dataset comprising 15,000 annotated images was curated from a 30-minute video captured from a first-person perspective in a public building setting. This dataset includes 17 object classes critical for indoor navigation, such as persons, doors (open, closed, glass), stairs (up and down), elevators, furniture items (chairs, desks, benches, sofas), and amenities (reception, cafeteria, cloakroom, trash bins, vending machines, computers). The images reflect realistic challenges, such as varying lighting conditions, occlusions, and a predominance of small or distant objects, which mimic everyday scenarios faced by visually impaired users. The dataset was split into training and testing subsets to ensure comparability, with augmentations applied, including random scaling, flipping, and brightness adjustments to improve model robustness. Both YOLOv8 and YOLOv9-t were trained under identical conditions, utilising the same hyperparameters, image resolution (640 × 640), and augmentation strategies. Evaluation metrics encompassed precision, recall, F1-score, mean Average Precision at IoU 0.5 (mAP@0.5), and across a range of thresholds (mAP@[0.5:0.95]), alongside detailed F1-confidence curve analysis. Post-training, the models were converted to the ONNX format for cross-platform compatibility and deployed in a Unity Passthrough environment on the Meta Quest 3, leveraging Vulkan and NNAPI for hardware-accelerated inference. This setup enabled real-time testing in an MR context, where detected objects are projected into the 3D scene with bounding boxes, facilitating immediate multimodal feedback. Findings: Results indicate that YOLOv9-t attains comparable overall accuracy to YOLOv8, with mAP@0.5 scores of 0.965 versus 0.969, respectively. However, YOLOv9-t achieves its peak F1-score at a lower confidence threshold (approximately 0.40 compared to 0.45 for YOLOv8), leading to enhanced recall and fewer missed detections without a substantial increase in false positives. This behaviour is particularly advantageous in assistive MR systems, where prioritising safety—such as alerting users to potential hazards like stairs or obstacles—is more critical than eliminating minor false alarms. Additionally, YOLOv9-t demonstrates faster training convergence (stabilising after 60 epochs versus 100 for YOLOv8) and reduced inference latency on mobile hardware, making it more efficient for battery-constrained wearables. The findings position YOLOv9-t as the optimal choice for MR-based assistive navigation tools. Its robustness at lower thresholds supports advanced features, such as graded multimodal cues (e.g., initial haptic vibrations escalating to audio descriptions). At the same time, on-device processing aligns with ethical standards for data privacy. By integrating technical evaluations with usability and ethical considerations, this study contributes to the advancement of inclusive AI technologies that promote greater independence, mobility, and safety for visually impaired individuals. Practical implications: As one of the initial empirical comparisons of YOLOv8 and YOLOv9-t in authentic MR settings, this work emphasises human-centred design principles. It offers practical insights for developing AI-driven applications in fields like education (e.g., virtual training simulations), rehabilitation (e.g., mobility therapy), and healthcare (e.g., daily living aids). Originality/Value: Future research could explore ensemble methods or integrate depth-sensing for even more precise spatial mapping, ultimately fostering more innovative and more accessible environments in smart buildings and public spaces.


Cite Article (APA Style)