Diagram descriptions using the latest advancements in generative AI
Introduction
I-Stem is an organization dedicated to inclusivity in education, and was engaged by Samarthanam Trust to conduct R&D on making Science, Technology, Engineering, and Mathematics (STEM) learning more accessible to individuals who are blind or have low vision in India. Recognizing the critical role of visual content in STEM education, this project focused on developing a web-based application complemented by an Application Programming Interface (API) that interprets images from textbook pages, scientific diagrams, and other STEM-related visual materials into comprehensive alternative text descriptions. This initiative is a step towards eliminating the barriers faced by the visually impaired community, ensuring equitable access to educational resources and opportunities.Background
STEM education heavily relies on visual representations such as diagrams, graphs, and images to explain complex concepts. This reliance creates a significant challenge for learners who are blind or have low vision, leading to an accessibility gap in education. Traditional methods of overcoming these barriers, such as braille textbooks and tactile diagrams, while useful, cannot fully address the dynamic and detailed nature of STEM content. The advent of digital technology presents a novel opportunity to bridge this gap through innovative solutions.Problem Statement
The project was conceived in response to the critical need for accessible STEM learning resources for the blind and low vision community in India. The main challenges identified were:- The lack of immediate and detailed access to visual STEM content for visually impaired learners
- The scarcity of scalable and cost-effective solutions to convert visual STEM material into accessible formats
- The need for contextual understanding of STEM imagery to provide meaningful descriptions
Solution
The solution developed by I-Stem is a web-based application and API designed to process images of pages containing STEM content and return detailed alternative text descriptions. This system uses latest vision foundational models to analyze the visual content and generate descriptions that are both accurate and contextually relevant. The application is intended to be user-friendly, allowing educators and learners to easily upload images and receive near-instant descriptions, facilitating a more inclusive learning environment.Technical Details
System Architecture
The project comprises two main components: a web-based application and a back-end API. The web-based application serves as the interface for users to upload images, while the API processes the images, analyzes them using foundational vision models, and returns the alternative text.Image Recognition and Processing
The core of the system utilizes vision foundational models that have been trained on massive amounts of data. Prompt engineering helps to establish the context around STEM learning.User Interface
The web application features a simple, accessible design, enabling users to easily navigate and use the service. It includes functionality for file upload, image preview, and displaying the generated alternative text. Accessibility features, such as screen reader compatibility and keyboard navigation, are integral to the design.Deliverables
The project successfully delivered the following:- A fully functional web-based application for uploading and processing STEM images
- An API that analyzes uploaded images and returns detailed alternative text descriptions
- Documentation for using the web application and API
Limitations
While the project marks a significant advancement in accessible STEM education, certain limitations were identified:- The accuracy of alternative text can vary depending on the quality and complexity of the image. In particular, domain-specific images relying on a lot of contextual information continue to be challenging to process.
- Currently, the system is optimized for English language content, which may limit its applicability in multilingual settings.
- The need for more extensive context for domain-specific images.
Next Steps
Building on the projects, future directions include:- AI research to improve images in specific domains that are known to be challenging (e.g. mechanics, electronics etc.)
- Developing partnerships with educational institutions to integrate this technology into their learning management systems
- Conducting workshops for educators and learners to maximize the utility of the application and promote inclusive education practices
Facilitating Accessible STEM Learning through Audio for the Blind and Low Vision Community in India
Introduction
In an endeavor to bridge the accessibility gap in STEM education for individuals with visual impairments, Samarthanam Trust engaged I-Stem on a project focused on enhancing the learning experience for the blind and low vision community in India. Recognizing the unique challenges faced by these individuals, particularly in accessing and comprehending mathematical content, the project aimed at developing a web-based application and an accompanying API. This innovative solution converts mathematical documents into natural-sounding audio, adhering to spoken math conventions and practices prevalent in India, thereby facilitating a more inclusive and accessible educational environment for visually impaired learners.Background
Mathematics plays a pivotal role in STEM education, serving as a foundation for various scientific disciplines. However, the inherent visual nature of mathematical content, including symbols, equations, and graphs, poses significant challenges for learners with visual impairments. Traditional assistive technologies, such as screen readers, often fall short in effectively conveying complex mathematical information, leading to a disparity in educational opportunities for the blind and low vision community.Problem Statement
The project addressed the critical need for an accessible solution that transcends the limitations of existing technologies in delivering mathematical content to visually impaired learners. Key challenges identified include:- The complexity of mathematical notation, which is difficult to interpret using standard text-to-speech technologies
- The lack of adherence to localized spoken math conventions in existing solutions, leading to confusion and misinterpretation
- The absence of an intuitive and accessible platform for converting mathematical documents into comprehensible audio formats for the visually impaired
Solution
To address these challenges, I-Stem developed a web-based application coupled with an API designed to transform mathematical documents into audio. This solution utilizes neural text-to-speech (TTS) technology tailored to the specific needs of the target audience, incorporating the nuances of spoken math conventions and practices in India through a large language model. By doing so, it ensures that the generated audio is both natural-sounding and educationally effective, providing a valuable tool for learners with visual impairments..
Technical Details
System Architecture
The project consists of two primary components: the web-based application interface for users and the back-end API responsible for processing mathematical documents and converting them into audio.Text-to-Speech (TTS) Technology
The core of the system employs TTS technology, enhanced with output from LLMs designed to accurately interpret mathematical notation and convert it into spoken words. This involves mapping complex mathematical symbols and expressions to their spoken equivalents, taking into account regional variations in math terminology.Localization and Customization
A significant focus was placed on ensuring the audio output adheres to Indian spoken math conventions. This was achieved through extensive research and collaboration with local educators and subject matter experts, translating into effective prompts for the LLM, resulting in a highly localized and effective learning tool.User Interface
The web application features an accessible and user-friendly interface, allowing users to easily upload mathematical documents in various formats (e.g., PDF) and receive instant audio conversions. Accessibility considerations, such as compatibility with screen readers and keyboard navigation, were paramount in the design process.Deliverables
The project successfully delivered:- A web-based application for converting mathematical documents into natural-sounding audio
- An API that enables third-party integration, extending the solution’s reach and utility
- Comprehensive documentation and user guides, emphasizing accessibility and ease of use
Limitations
Despite its achievements, the project encountered several limitations:- The quality of the audio conversion can vary with the complexity of the mathematical content and the clarity of the original document
- The current version primarily supports English, with limited support for regional languages, which may restrict its applicability across India for now
- Ongoing updates and maintenance are required to ensure the accuracy and relevance of the spoken math conventions
Next Steps
Future directions for the project include:- Enhancing the system’s capabilities to support additional regional languages, broadening its accessibility and impact
- Continuously updating the prompt to refine the interpretation of mathematical notation and improve audio quality
- Expanding outreach and collaboration with educational institutions to integrate this technology into their teaching methodologies
- Gathering feedback from users to inform iterative improvements and address specific needs of the blind and low vision community
Conclusion
This project represents a significant advancement in making STEM education accessible for the blind and low vision community in India. By providing a solution that converts mathematical documents into natural-sounding audio, we have addressed a critical barrier in STEM learning for visually impaired individuals. Moving forward, continuous innovation, collaboration, and feedback will be key to enhancing the effectiveness of this technology and ensuring that it meets the evolving needs of its users.http://52.36.237.33/