Approximately 40 million Americans live with some form of disability. These disabilities can get in the way of enjoying life. The NYU Ability Project partnered with Charter Spectrum (formerly Time Warner) and the NYC Media Lab to research universal design for home entertainment across all platforms.
NYU's Ability Project
May - August 2017
I worked as a developer, coder, and researcher. My responsibilities included user testing, prototyping, and designing an Artificial Intelligence system to generate Video Descriptions.
Charter-Spectrum provided the lab with the following goals:
Humanize access issues and our research (less abstraction, more stories)
Consider a range of disabilities
Model a framework for inclusive design
Present ideas across platforms
Address metadata issues
Address institution-wide adoption of accessible design and development principles
In order to identify issues as user experience them, we performed needfinding and user testing with people living with disabilities - focusing on those with cerebral palsy, cognitive disabilities, hearing impairments, and low visual acuity. To connect to those communities, we partnered with several New York based organizations. These groups helped us recruit participants who were willing to share their experiences with lack of universal access in home entertainment and test Charter-Spectrum products to provide proper feedback.
We evaluated Charter-Spectrum's products in five categories: Graphical User Interface (GUI), Screen-Reader Compatibility, Tangible Interfaces (focusing specifically on remotes), Onboarding, and Descriptive Services. We tested these across five media platforms: Xbox One, Laptop, Roku, iPad, and Set-Top Box/TV.
Our first assessment showed that Charter-Spectrum's products were not great in terms of accessibility. We had 9 participants between the ages of 28 - 64. They all reported watching TV an average of 5 times a week, using a variety of methods. Participants were briefly interviewed about their demographic and ability-related information. They spent 10 minutes with each platform, narrated their decisions, and gave us their thoughts.
After going through all the stations, every single user was frustrated.
Through our testing, we learned a few key things:
Partial access means no access
Users prefer to use their own personal devices already customized for their needs
Users want to find their favorite content quickly and easily
Users want to find content fitting their accessibility needs quickly and easily
HTML Tags must have IDs so screen-readers can read them
Access should be built into the entire customer journey
Using these principles, we began developing ideas to fix Charter-Spectrum products so people with disabilities would be able to use them independently and enjoyably.
Our main ideas were:
GUI prototype - a redesign of the HTML so content flowed logically and is screen-reader compatible
Remote control prototypes - a remote that is easier to hold for people with physical disabilities such as tremors
Onboarding experience prototype - a quick and easy way to run users through the interface and introduce them to all the features
AI + crowdsourced descriptions prototype - a method to rapidly generate video descriptions so people with vision impairments can hear descriptions of, and therefore understand, visual cues in entertainment
Artificial Intelligence + Crowdsourcing
I developed the AI + Crowdsourced Descriptions prototype on my own. I created two different models (pictured above). The AI-Only Model is the ideal stretch goal, and the Crowdsource/AI Model could be implemented very quickly.
AI-Only Model: In order to generate video descriptions for existing programs with Artificial Intelligence, there are three main principles of computing that need to be implemented in tandem for success. These principles are Image Processing/Computer Vision, Natural Language Processing (NLP), and Data Mining. Each can generate a description of visual data that can then then be stored and mined for context. Mining also allows for validation checks of generated interpretations of video content. Research into developing these techniques is particularly active at University of Texas - Austin and Stanford.
Unfortunately, this model is not viable yet. Prominent AI researchers like Stanford's Fei-Fei Li are only just learning how to get machines to tag actions (dancing, cooking, etc.) in images and videos. It would be totally unviable for Charter-Spectrum to implement anything like the AI-Only Model without existing APIs.
Crowdsource/AI Model: Crowdsourcing is a quick and easy way to get a large pool of data and input from a wide variety of people. Users on a crowdsourcing site (such as Mechanical Turk) can be incentivized to perform a wide variety of tasks - in this case, watching videos and writing descriptions of what they are seeing. This crowdsourced data could then be run through an AI to check for bias to make sure the descriptions are concise and without bias or colloquialism to optimize the TV watching experience.
IBM’s Watson has technology that makes this combination a feasible plan. Watson’s Tone Analyzer is a precise and customizable way to search for preferred specifications. All someone would have to do is put the crowdsourced text in a tone analyzer (which could be programmed into the backend of an app) and find the first acceptable and best crowdsourced description. A detailed github exists for developers to use.
Through developing a Wizard of Oz prototype, I found immediate and rapid success with this method. IBM's Watson easily identified crowdsourced descriptions with bias, highlighting exactly what types of phrases made the Tone Analyzer notice an overt emotion in the words.
Final User Testing and Results
In our second testing session, we had 11 participants with ages ranging from 30-66. They reported a variety of disabilities including blindness or low visual acuity, deafness or hearing impairments, cognitive issues, and mobility issues. Each user reported watching media an average of 6 times a week - across all platforms. Finding content with video descriptions was difficult for participants. Many said that they usually refused to watch content without it because they missed too much without visual cues. The lack of video descriptions severely excludes a significant audience base.
Screen reader users were asked to watch an original Law and Order video clip and then again with the video description my prototype selected (by sorting through a large sample of crowdsourced descriptions with IBM Watson). The clip focuses on a dog running under a police tape and relies heavily on visual cues (i.e. actually seeing the dog). After watching the first time, users said they felt “frustrated” and “that I’m missing a lot, I wish I knew what was going on."
Upon watching the second clip with the video description, users had overwhelmingly positive responses about the quality of the descriptions and made comments like “That was so awesome. I wish I could take this home,” "It's like I'm seeing with my ears," and “I would have liked to watch the whole thing." When asked about the quality of the description (the content), all participants rated it highly.
The final Law and Order clip using a video description selected by my prototype can be found below.