VIZWIZ GRAND CHALLENGE WORKSHOP AT CVPR 2022

Daniela Massiceti, Microsoft, dmassiceti@microsoft.com

Samreen Anjum, University of Colorado Boulder, samreen.anjum@colorado.edu

Danna Gurari, University of Colorado Boulder, danna.gurari@colorado.edu

Abstract

Our goal is to educate a broader population about the technological needs and interests of people with vision impairments while encouraging artificial intelligence (AI) researchers to develop new algorithms that can help eliminate accessibility barriers. Towards this goal, we organised the VizWiz Grand Challenge Workshop at the IEEE/CVF Computer Vision and Pattern Recognition conference (CVPR 2022). This workshop's scope included charting and celebrating progress on accessibility-related AI challenges as well as engaging invited speakers and stakeholders to discuss challenges and opportunities related to designing next-generation assistive technologies. A total of 72 teams participated in our three AI challenges and the winners received awards sponsored by Microsoft. The facilitated discussions highlighted insights from ten invited speakers who provided a range of expertise spanning the cutting edge of computer vision research, development of industry products and services for assisting people with vision impairments, and perspectives of people with vision impairments who use visual assistance technologies. Finally, nine teams who submitted extended abstracts about their research related to the AI challenges and assistive technologies for people with visual impairments gave spotlight and poster presentations about their research. Links to the content shared at the event can be found at VizWiz Workshop

Introduction

A common goal in computer vision research is to build machines that can replicate the human vision system, such as to recognize an object/scene category, locate an object, and read text.  Part of the motivation for the computer vision community is to design methods that   empower people with vision impairments to independently overcome their real daily visual challenges. Tasks that have already been motivated by and embraced for assistive technology applications include optical character recognition (OCR)  [1], object recognition [2], and image description [3].   In this article, we describe our work to elevate the computer vision community’s focus on designing methods that improve upon the status quo for people with vision impairments.  

As important context, the primary approach for accelerating progress in the computer vision community on a problem is to launch a large-scale, publicly-shared dataset challenge associated with a workshop where winners are announced [4, 5].  While many dataset challenges associated with workshops have emerged, often they focus on contrived data (e.g.,  [6, 7, 8, 9]). For example, images are typically scraped from the Internet or artificially constructed.

Towards promoting a community to develop computer vision methods that directly address the interests of people with vision impairments, we built the first dataset challenges originating from this population. In particular, our challenges are based on over 40,000 images and 3,822 videos taken by blind people, who each captured the data using their mobile phones. Unlike mainstream datasets, data from this population manifests distinct challenges for computer vision methods such as poor lighting, focus, and framing of the content of interest.

Towards cultivating a community around our dataset challenges, we organised the VizWiz Grand Challenge workshop at CVPR 2022 this June. This event built on three previous instalments from 2018, 2020, and 2021. This event brought together a multi-disciplinary community who are passionate about innovating in computer vision to improve assistive experiences for people with vision impairments. This workshop's scope included charting and celebrating progress on our accessibility-related AI challenges as well as engaging invited speakers and stakeholders to discuss challenges and opportunities for designing next-generation assistive technologies.  

Workshop Content

The workshop consisted of three parts: (1) overviews of the dataset challenge competitions and talks from the winners, (2) panel discussions with our invited speakers, and (3) lightning talks and a poster session spotlighting the work from teams who submitted extended abstracts.

The event was organised as a full day event, hosted in a hybrid manner. The first part was held in-person. It began with the challenge overviews and winner talks and then ended with the lightning talks and poster session about the accepted abstracts. The second part of the workshop was hosted virtually. It included six live-streamed panel discussions that engaged nine invited speakers about their experiences with today’s state-of-the-art assistive technologies and ideas for designing next-generation tools. The speakers included individuals who were computer vision researchers, industry assistive technology developers, and blind technology advocates.  A live Q&A session was provided alongside the panel sessions so that the audience could ask speakers questions. Approximately, 25 participants attended the workshop (in-person  and virtually).

Dataset Challenges

An image showing examples of visual data collected by people who are blind/low-vision to support three different dataset challenges that help describe their visual surroundings, i.e., visual question answering, answer grounding, and few-shot object recognition.
Figure 1 – An overview of the three dataset challenges showing examples of the data collected by people who are blind/low-vision. The examples of the visual question answering challenge show images and questions collected by blind people and the corresponding answers agreed upon by crowdworkers. The examples of the answer grounding challenge depict the regions in those images that were used to arrive at the answers. The examples of the few-shot object recognition challenge show stills from the short videos taken by blind people to recognise their specific personal objects.

As noted above, a key component of the event was to track progress on new dataset challenges, where the challenges entailed describing images and videos taken by people who are blind. This year, the workshop featured three dataset challenges. We briefly summarise each dataset challenge below:

The Visual Question Answering (VQA) Challenge [10], the workshop’ founding challenge, invited teams to develop a model that could automatically answer a question about an image which a person who is blind/low-vision has captured on their mobile phone. This year, 44 teams participated in the challenge, with the top-two winning submissions belonging to the teams from HSSLAB Inspur and Xidian University. The winning team’s solution leveraged a combination of the BLIP model with an answer grounding algorithm to predict answers, resulting in an improved accuracy of 6 percentage points from previous year’s winning submission.

The Answer Grounding for VQA Challenge [11], newly introduced this year, builds on the VQA Challenge by inviting teams to build algorithms that can additionally locate where in an image a VQA model ‘looks’ when answering a question about that image. Of the 16 participating teams, the team from ByteDance and Tianjin University secured the first place and the team from HSSLAB Inspur achieved the second place in the challenge.

The ORBIT Few-Shot Object Recognition Challenge [12], also newly introduced this year, invited teams to build a teachable object recogniser using the ORBIT dataset. Unlike a generic object recogniser, a user can ‘teach’ a teachable object recogniser to recognise their specific personal objects by providing just a few short clips of their objects. From the 12 submissions, the two winning teams were based at HUAWEI Canada and the Australian National University respectively. Their solutions improved recognition accuracy of users’ personal objects by 8-10 percentage points.

Winning teams from all three challenges presented their approaches at the workshop in the form of a 5-minute video presentation. In addition, all teams will be awarded financial prizes sponsored by Microsoft. We would like to highlight that the dataset challenges are still open for anyone interested in joining and new winners will be announced at next year’s workshop.

Panel Discussions

The panel discussions swept more broadly across the current state of computer vision research and the broader assistive technology ecosystem. Perspectives were drawn from each group of experts - computer vision researchers, assistive technology advocates who are blind, and industry specialists - as well as from interdisciplinary discussions amongst all three.

The computer vision panel included the following speakers who conduct relevant research in the computer vision community:

The panel with blind technology advocates consisted of the following speakers who serve as leaders advocating for people with vision impairments in their organisations and are also blind themselves:

The panel with industry specialists included the following speakers who are developing today’s state-of-art assistive technologies for people with vision impairments:

In the computer vision researchers panel, experts highlighted some of the key technical challenges. This included discussions on how to deal with the growing size of models on resource-constrained devices like mobile phones. For example, Howard talked about the trade-off between system accuracy and resource usage, and shared best practices for designing efficient systems. He also mentioned using the end-application to guide these design decisions (e.g. deciding the resolution size of an image). Rohrbach added that efficiency could also be addressed by building models that could solve multiple problems rather than just one problem. The panelists also had conversations about how to ensure that these models are safe to use, for example by abstaining from answering a question rather than answering incorrectly. Panelists then shared the technical details of how their systems work and where they fail. For example, Coughlan shared about challenges with deploying navigation systems on smartphones such as identifying landmarks accurately using the camera phone. Finally, the panelists highlighted interesting turning points within the vision community in the last decade (e.g. deep learning) and their predictions for how technology is going to evolve in the next decade (e.g., augmented reality, personalised assistants).

The panel of blind technology advocates started with individuals sharing their stories on how they navigated their lives with vision challenges and became instrumental figures in advocating for changes for people with vision impairments. This included vignettes of how they use assistive technologies to overcome their daily vision-based barriers   (e.g. screen readers, apps such as Aira) and the specific features that they find useful. Panelists also described strategies they use to collect images on their smart-phones when soliciting visual descriptions from apps such as Seeing AI. Enyart mentioned her desire for guidance while taking pictures of her children to alert if all eyes are open and everyone is smiling. Agreeing to this, Christopherson also thought that the current image description applications should have the ability to correct image quality, such as its lighting and framing. Another important theme that emerged in the panel was privacy concerns around datasets used for building products, in which Kish shared about his hesitance with using image description apps because of concerns for how his data will be used. Panelists also shared about scenarios in which they prefer systems that employ AI versus remote sighted assistants and their hopes for future computer vision technologies.  Finally, a recurring theme highlighted by all three panelists was the importance of understanding that one blind person cannot speak for all.

The experts in the industry professionals panel started with sharing their motivation for how they got involved in their respective roles to build products and services for accessibility technology. In particular, they shared their common interest in assisting people with understanding and learning about their visual surroundings. Panelists also discussed the challenges of  deploying accessibility products and services in practice. Anne noted that a particular challenge is that recently graduated engineers often are not exposed to accessibility development before and so it becomes a skillset they must learn on-the-job. She argued that accessibility skills need to be more “entrenched in the education systems”. Kannan and Shaikh highlighted the need for more datasets with appropriate privacy requirements that are collected from people who are blind to in turn make the products more inclusive. Butler added that there is a challenge with ensuring privacy and addressing ethical concerns with these datasets. The panelists also provided their perspective on the pros and cons of solutions that solely rely on humans versus AI, how end users are always part of their design process, and valuable advice to individuals who are interested in joining accessibility teams in industries.

From the three interdisciplinary panels, we heard an overall excitement about the past decade of progress in accessibility. This included improved development frameworks to build accessible products as well as better products themselves from companies such as Apple, Microsoft, and Be My Eyes. Panelists also described what they envisioned for the future of technology with a nearly unanimous agreement that we will have wearable technology companions with information personalised to the interest of each user. Panelists also shared that a remaining key challenge in developing future technologies is designing datasets representative of people with disabilities.

Posters and Extended Abstract Presentations

To further promote discussion around the latest research and applications, we invited submissions of posters and extended abstracts on topics related to image captioning, visual question answering, visual grounding and assistive technologies for people with visual impairments. Nine submissions in total were accepted and the authors presented their innovations through lightning talks followed by an open poster session.

Dissemination

To widely disseminate the knowledge that emerged from this workshop, we have publicly-shared all materials on the workshop website. The website shares the details about the event, challenges, speakers, and organizers. In addition, the workshop website also includes recorded video presentations from the top winning teams of each dataset challenge as well as presentations from the extended abstracts submissions. Finally, all six panel discussions from the invited speakers are recorded and can be heard in video and audio formats on YouTube and podcasting platforms. Links are provided on the workshop website.

Closing Remarks

Through the dataset challenges and insights from the panel discussions, the workshop focused on highlighting some of the key progress that has been made in the field of computer vision in technologies for people who are blind and low-vision. We hope to inspire the research community to look ahead and imagine what the next decade of technology might look like, and to take on solving the challenges which lie along the path to realising it.

Acknowledgements

We are grateful to SIGACCESS and Microsoft for sponsoring our workshop. We would like to thank our challenge participants for their contribution in making progress with the challenges, our panelists for sharing their perspective and experiences, and the EvalAI team for providing an online platform to host our challenges. Finally, we thank all the participants (both in-person and virtual) for attending and making the workshop a successful event.

References

  1. C. Yao, X. Bai, B. Shi, and W. Liu.Strokelets:A learned multi-scale representation for scene text recognition. pages 4042–4049. IEEE, 2014
  2. http://taptapseeapp.com/
  3. H. MacLeod, C. L. Bennett, M. R. Morris, and E. Cutrell. Understanding blind people’s experiences with computer-generated captions of social media images. In ACM Conference on Human Factors in Computing Systems (CHI), pages 5988–5999. ACM, 2017.
  4. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In IEEE European Conference on Computer Vision (ECCV), pages 740–755, 2014.
  5. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal on Computer Vision (IJCV), 115(3):211–252, 2015.
  6. S. Antol et al. VQA: Visual Question Answering. In IEEE International Conference on Computer Vision (ICCV), pp. 2425–2433, 2015.
  7. J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE Computer Vision and Pattern Recognition (CVPR), pp. 1988–1997, 2017.
  8. J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 39–48, 2016.
  9.  H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? Dataset and methods for multilingual image question answering/ In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, Cambridge, MA, USA, Dec. 2015, pp. 2296–2304.
  10. D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz Grand Challenge: Answering Visual Questions from Blind People. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pages 3608–3617, 2018.
  11. C. Chen, S. Anjum, and D. Gurari. Grounding Answers for Visual Questions Asked by Visually Impaired People. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19098–19107, 2022.
  12. D. Massiceti, L. Zintgraf, J. Bronskill, L. Theodorou, M. T. Harris, E. Cutrell, C. Morrison, K. Hofmann, and S. Stumpf. Orbit: A real-world few-shot dataset for teachable object recognition. arXiv preprint arXiv:2104.03841, 2021.
  13. https://visioneers.org/who-we-are/

About the Authors

Daniela Massiceti is a senior machine learning (ML) researcher in the Teachable AI Experiences team at Microsoft Research Cambridge UK where I work at the intersection of ML and human-computer interaction. She is primarily interested in ML systems that learn and evolve with human input, so called “teachable” systems.

Samreen Anjum is a PhD student in the Computer Science Department at University of Colorado Boulder. Her research interests are primarily focused on computer vision and its applications in the fields of biomedical sciences and assistive technologies. She is also interested in solutions that enable collaborations between humans and machines by leveraging their individual strengths.

Danna Gurari is an Assistant Professor in the Computer Science Department at University of Colorado Boulder. Her group focuses on creating computing systems that enable and accelerate the analysis of visual information. Her research interests span computer vision, machine learning, human computation, crowdsourcing, human computer interaction, accessibility, and (bio)medical image analysis.