Week 8 Homework: Real time object and sound detection

18 October, 2019




GPUs are becoming smaller and powerful. That means that we can now pack a smaller object with processing power to achieve more than it ever has. This is very useful in developing applications that need to process real-time information such as what you view through AR glasses. This advancement in hardware is pushing for more light-weight and easier to use AR glasses.

Our professor shared a video on real-time object detection with us. You can watch it here. This video demonstrates how real-time object detection would look like. Frankly, while this implementation does a good job at detecting objects, it is too cluttered. My view was clouded by the squares capturing objects and the tags. This caused me to be more focused on that rather than what was going on in the scene and the location.

In the first image below you see a street crowded with a lot of vehicles passing by. Having to watch every vehicle classified as a car,bus, or truck just takes away from the scene. The person viewing this with AR glasses will have the street view obstructed by the boxes and tags and will not be able to focus or possibly move around safely in the zone. Similarly, in the second image you see all the bowls and people highlighted. For an adult, tagging these basic objects is not very useful and could be annoying - a bad user experience. However, this might be useful to teach very young children. It would be a great way for them to identify objects in the world around them, the names of these objects and also be able to hear the pronunciation so they learn better.




Screenshots from video depicting object detection

In one of the previous lectures, the professor covered a similar style of agumentation that was about Google's beta version where they augment navigation and street signs as an additional feature to Google Maps. Such real-time augmentations are relevant and useful in this domain. However, what we saw in the video was that the huge signs blocked the view of other objects and while Google kept giving reminders to put the phone away, it points out the importance of knowing what to augment and how. Not only the scale but also how many objects should we augment and how much should they take away from the real world.

As discussed above, while the technology can be very useful, a couple of improvements need to be made to make it more useful rather than annoying. The most important thing lacking in the implementations discussed above is visualization. We need to develop AR in a way that it doesn't block what is in the real world or hinder our ability to move around safely in the real world. For that, we need to reduce the number of common/everyday objects that are being tagged such as 'person'. We could experiment with other ways to highlight detected objects rather than block the view by the boxes and tags. We could also limit the number of detected objects shown at a time. For example: If we have limited only 4 objects to be detected at a time, we could tap on our glasses to see the next 4 detections or choose to see all. But showing all the detections in a crowded room will take away from the scene and could raise safety concerns. Alternatively, we could allow the user to highlight by gesture, voice or another way to select the object that the user would want to detect and want more information on.

Placement of these tags is very important as well and should not obstruct the view of other objects in the real world. Doing so might hide essential information from the user's view and raise safety concerns. For example: any text or identification of an object should not cover a traffic signal such that it is not possible to view whether it is a green or red light.




Screenshots from video depicting object detection

It would be more useful to tag objects with their actual names or categories rather than high level categories. Most humans would know everyday objects such as cars, person, bowl, dogs, cats and would not require detection for it. For example: If the user points at a dog to see the detection, it is more likely that the user wants to know the breed of the dog.

We've had sound detection apps such as Shazam and SoundCloud since a really long time now. These apps listen to the sounds playing and let the user know the name of the song, the artist, and other details along with an option to play the song. These apps are very simplistic in design- you hold the button on the app to get it to listen to the sound and it quickly provides the information. This would be very useful and easily accessible with AR glasses and would make for a great addition. Often it happens that while walking through University lounges or malls, I end up hearing the end of a song and by the time I get my phone out and use the app, the song has finished playing and forever lost to me. To tap on AR glasses and immediately capture the sounds will ensure that the users do not miss it. Moreover, for users travelling at airports and train stations, it will be easier to track announcements by having AR glasses detect announcements made, and being able to replay it in case you didn't hear it. Moreover, we could set our glasses to detect sounds in relation to a query and possibly save or display(play/replay) the information.

Users should still be given a choice of how they get to use these features and should have control of:
  1. Which objects are detected: Since there isn't a need for every object to be detected or named via AR glasses, we should be able to control which ones actually need to be tagged.
    One way is to let every user individually choose what kind of objects they would like to detect or not, maybe using pre-defined categories. The ability to detect unusual objects in a particular scene would be useful as well.
  2. Turning detection on/off: Users should be able to switch on/off such object/sound detection easily
  3. Visualization: Users should be able to choose how the object's name would be shown to them. Users should be able to choose if the object needs to be highlighted by a box, or how the text needs to be displayed, the size of the augmented object and what it should not overlay on.
  4. Sounds: Customize the volume, tone, sounds that will be available. Set these alerts off for certain events/times.
Detecting and classifying unknown objects and sounds in real-time is a very useful feature, especially for people with poort vision and for people who are travelling to a country that is very different from where they have been. This could be applied to everyday objects, important buildings or landmarks, food items in a grocery store, etc. A very convenient and useful feature would be to be able to point our AR glasses to a particular object and have it detect only that object and provide details about it. For example: pointing at food items in an unknown country could help a user understand what to buy or scanning packaged foods and seeing their ingredients. We could also try to set up our glasses to scan the scene and highlight only a specific object that we're looking for. For example: look at a crowded room and search for a specific person or search for noodles in a grocery aisle. This would speed up shopping only in cases where we know what exactly we are looking for.

On the downside, this kind of detection can be used to target places or certain kinds of people and we need to consider the implications of enabling such features. There is also a possibility of malicious attacks that could change the alerts, text, or objects augmented on a person's AR glasses or phones. While we work on making this technology more useful, we also need to weigh in its legal, ethical, moral, and safety implications.