Building a Voice User Interface
Sai Kambampati - August 31, 2021
One of the core pillars of Airgift is that I want to make it easier than ever to create a 3D experience. With just a few taps, users should create an immersive experience. If it can take a few seconds to tweet some text or post an image, why shouldn't the same apply for 3D and AR experiences? Unfortunately, this isn't an easy task that can be fixed overnight. In fact, I believe this is the last mountain we have to climb in online content creation: 3D!
Our mission is to advance authentic human connection by using augmented reality to spark creative, immersive and playful experiences.
I like to predict the future by looking at the past. Previously, we had to edit images on Photoshop and today, all we need is a few filters to swipe through on Instagram or toggle a few sliders on any photo editing app. Or let's take video! From iMovie to TikTok, anyone with a phone is now a professional videographer. History indicates what's true of the future: creator tools are pushing forward to be more mobile and accessible than ever before. This doesn't mean that professional, industry tools will go out of business. Far from it! Photoshop and Final Cut are still more prevalent than ever. But in the future I see, content creation has to be mobile and accessible to the masses.
📣 Our Response
So how are we approaching this? Well there's two ways we're aiming to make 3D content creation easier.
Templates are a natural response that stemmed from the problem. If 3D content creation is tough, then let's automate a lot of the processes. AR allows pre-existing content (text, images, audio, and video) to be shared in more immersive ways. So I've been designing and building templates around making these types of experiences. When Airgift will release, the number of templates will be very limited, but over time it'll grow and improve on. Let's look at two templates coming to Airgift.
🎙 Voice User Interface
While templates are great, as you can tell, they can be pretty limiting. So our other option is to jump to the other end of the spectrum and let users design a 3D experience from scratch with all the raw tools right inside of a mobile. But that leads to a big problem in that this would be no different that the default desktop tools we've been using for so long, only with a smaller screen and a much worse UX. Our compromise yielded users still being able to design a 3D/AR scene from scratch but the the tools they'll access to is going to be very limited. But this also causes another problem: If a user has to tap a lot of buttons and move a lot of sliders, then that defeats our goal of making AR content creation easy to use. Then came the idea of a voice user interface. It makes perfect sense! We want to create virtual objects that we can place and interact with in our world. Rather than having to shuffle our phone around as we add models to scenes, what if we can control the whole world we dare to build with our voice? This would provide a much better UX when starting off in creating a 3D scene. Take a look at some of the commands below. This would let users quickly and easily create a 3D scene and interact it without having to worry about terminology like positioning and scaling. All they need to do it see how it looks in their environment.
"Add a pyramid at origin"
"Change the color to a shiny orange"
"Spin it around a sphere"
"Make it look like the Earth"
The goal of this is to also make it easy to start off in building a scene. Once the raw primitives and textures are in, users can then switch back to manual controls to fine grain and tune their Airgift. But the problem with many VUI hasn't been the speech recognition, but rather the textual analysis that happens once a text has been parsed. VUI works great when you give it very clear instructions like below.
"Insert a cube at point 0, 0, 0"
"Change the color of the cube to red"
"Rotate the cube around the y-axis"
As humans though, we want our perfect VUI to understand natural language. This is the biggest hindrance for many VUI's Taking the above example, we should be able to say the following and get the same results.
"Add a box"
"Make it red"
"Spin it in place"
🤖 Enter GPT-3
I'll preface this by saying my experience with GPT-3 is very limited to about 2 weeks of hacking. If you're not familiar, GPT-3 is a deep learning model by OpenAI that can understand and produce text eerily similar to that of a human. The basic operation of GPT-3 is to generate text responses based on the input text. It can generate an answer to your question or write an essay based on a title. As the world runs on languages, the applications for GPT-3 stretch far and wide. Most recently, OpenAI released its Codex engines which developers can access through their API. Codex is a more niche version of GPT-3 that can be used to translate human natural language to computer code. It also made this the perfect way for us to provide the brains to our VUI. Taking the commands that a human would give to Airgift, we were able to pass it on to Codex and allow it to generate a code-like syntax that would make it easy for us to render a 3D scene. The commands from the previous section were tried with this technology and it worked like a charm. You can take a look at the demo below.
The above demo was an incomplete version of what we call the Creator View in Airgift. But as with most of my Airgift designs, I want to talk about this UI layout. It'll give you (and me honestly) some insight into the design and layout of each screen.
As with any AR app, the primary focus should be the camera view. Other 2D elements like buttons and texts should try and be as unobtrusive as possible. This leading principle lead to the following design choices.
The Recording Button: With the above principle in mind, this button is all the way in the bottom left corner. To actually start recording from the device's microphone, the user has to press and hold the button at all times. This is privacy by design. Users know exactly when the device is listening to them and performing speech recognition on their voice.
The Labels: Right next to that, we have a couple labels that serve various purposes. Whether it's to inform users of what action to take, displaying the query that was recognized, or letting the users know what's happening to their query, these labels are also important in making sure that users are aware of what the app is doing. AI can be finicky at times and it's important to let users know what's happening.
More buttons: While the above demo doesn't have the full scene editing capabilities yet, it's important to note that there will be another row of buttons on top of the recording buttons. This is to let users quickly switch into manual control at any time and go back to editing manually.
So what's possible with a GPT-3 powered VUI? A lot actually! The demo above showed simple position and scaling transformations along with the addition of primitives and changing their simple materials. But there's actually so much more I'm exploring. Animations, textures, addition of 3D models. The hardest part though is making GPT-3 work consistently. Since it's always learning, we can't expect today's response to be the same tomorrow even if the input is the same. I hope this sparked your curiosity in Airgift and the future of AR. If you're interested in learning more about Airgift or following our progress, reach out on Twitter or Email.
Until next time, Sai