Implementing the Web Speech API for Voice Data Entry

Video in TIB AV-Portal: Implementing the Web Speech API for Voice Data Entry

Formal Metadata

Implementing the Web Speech API for Voice Data Entry
Title of Series
Part Number
Number of Parts
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
We live in a world where you can schedule a meeting by talking to your watch or turn off your lights by asking Alexa as if she were your roommate. But would voice dictation work for something more intensive, like a web app used for hours of data entry? In this talk, I’ll show you how to implement the Web Speech API in a few simple steps. I’ll also walk through a case study of using the API in a production Rails app. You’ll leave with an understanding of how to implement voice dictation on the web as well as criteria to evaluate if voice is a viable solution to a given problem.
Inference Type theory Slide rule World Wide Web Consortium Computer animation Software Weight Multiplication sign Speech synthesis Cartesian coordinate system Ruby on Rails Theory
Web page Axiom of choice Slide rule Stylus (computing) Statistics Multiplication sign Projective plane Archaeological field survey Set (mathematics) Sound effect Client (computing) Flow separation Process (computing) Different (Kate Ryan album) Order (biology) Cuboid Video game output Cycle (graph theory) Social class
Covering space Point (geometry) World Wide Web Consortium Clique-width Observational study Multiplication sign Bit Client (computing) Mereology Measurement Latent heat Process (computing) Computer animation Personal digital assistant Speech synthesis Diagram Freeware
Process (computing) Computer animation Software Tape drive Projective plane Video game Cycle (graph theory) Cartesian coordinate system Measurement
Laptop Ocean current Scripting language Building Context awareness Touchscreen Demo (music) Software developer Keyboard shortcut Projective plane Cartesian coordinate system Measurement Formal language Product (business) Type theory Digital photography Process (computing) Personal digital assistant Phase transition Right angle Diagram Associative property
Laptop Dataflow Observational study Multiplication sign Tape drive Hypothesis Formal language Usability Prototype Mathematics Different (Kate Ryan album) Software testing Associative property Position operator Concentric Interface (computing) Keyboard shortcut Cartesian coordinate system Measurement System call Type theory Computer animation output Right angle Quicksort
Context awareness Shift operator Observational study Key (cryptography) Divisor Multiplication sign Range (statistics) Control flow Bit Measurement Usability Prototype Process (computing) Computer animation Integrated development environment Term (mathematics) output Right angle Musical ensemble Associative property Resultant
Laptop Digital photography Prototype Observational study Multiplication sign Keyboard shortcut Tape drive Right angle Associative property Measurement System call Number
Scripting language World Wide Web Consortium Graphical user interface Pattern recognition Computer animation Speech synthesis Bit Web browser Mereology Associative property Library (computing)
World Wide Web Consortium Pattern recognition Code Multiplication sign Range (statistics) Keyboard shortcut Special unitary group Bit Measurement Computer animation Logic Speech synthesis Family Associative property Resultant Library (computing) Row (database)
World Wide Web Consortium Addition Context awareness Touchscreen File format State of matter Design by contract Cartesian coordinate system Element (mathematics) Number Revision control Data mining Word Computer animation Personal digital assistant Speech synthesis Cuboid Resultant Row (database)
Laptop Axiom of choice Web page Implementation Greatest element Multiplication sign Mereology Number Formal language Product (business) Fraction (mathematics) Mathematics Bit rate Different (Kate Ryan album) Computer configuration Computer hardware Error message Traffic reporting Associative property Position operator Scripting language World Wide Web Consortium Addition Touchscreen Matching (graph theory) Mapping Validity (statistics) Demo (music) File format Planning Bit Basis <Mathematik> Cartesian coordinate system Measurement Graphical user interface Word Data model Computer animation Personal digital assistant Speech synthesis Object (grammar) Resultant
Point (geometry) Slide rule Mobile app Game controller Building Code Outlier Multiplication sign Range (statistics) Number Product (business) Formal language Power (physics) Prototype Single-precision floating-point format Energy level Collaborationism Validity (statistics) File format Keyboard shortcut Debugger Projective plane Electronic mailing list Planning Maxima and minima Database Bit Cartesian coordinate system Measurement Type theory Graphical user interface Word Process (computing) Computer animation Software Order (biology) output Iteration Family
Point (geometry) Laptop Web page Slide rule Multiplication sign Tape drive Product (business) Formal language Subset Inference Prototype Goodness of fit Bit rate Computer configuration Cuboid Software testing Analytic continuation Associative property Error message Position operator World Wide Web Consortium Call centre Demo (music) Projective plane Keyboard shortcut Bit Limit (category theory) Measurement Word Process (computing) Integrated development environment Speech synthesis Right angle Quicksort Local ring Row (database)
the and here
and are 8 so we're going to go ahead and get started and unambiguous are coming know at the end of the day and everyone's probably pretty tired but I really appreciate that coming to this talk so I'm gonna start out with a short hole on can you guys raise your hand for me if you've ever used theory and we will waste or any other type of voice dictation software and so show OK great so almost everyone in the room and then I want you to raise your hand again if you would characterize that voice dictation authorize 100 per cent accurate all the time anyone who OK so that's pretty much what I expected and that was basically my impression of waste dictation as well so for example I was back in the inferences go in my apartment I'm putting together my slides I tried to get theory to turn on the light above my desk and I have to ask about 44 times and in the end she didn't even turn on the correct weights by that being fed and today we are going talk about voice dictation specifically with a technology called the Web Speech API the but 1st introduction height and Cameron and I build extra software for Citrix in Ruby on Rails and you may have also heard expertise suffer described as internal tools so what this means is that I build applications for others to trick employs and I'm going to be spending most of our
time today talking about a project I worked on recently accident so I thought it would be a good idea to give a brief overview of what the company does so that everyone can level and everyone on the same page so Citrix is an online personalization company I'm currently focused on the men's and women's apparel accessories and footwear so the way that works is you go online you fill out a survey on with side and that preferences and then we match you with a personal stylist who put together a box for you of 5 items and we pick the 5 items from a warehouse we send it to your house you tried on at home and you get to keep wiki-wide and then the rest back free of charge and so the previous
slide showed a picture of a typical statistics box also known as a fixed and here I wanna show the life cycle of 1 of those items in effect so before the item gets to the client and there's several different sets goes the at the very beginning involves a choice by the buyer to actually bring in the style to sell to our client the buyer places the order for the style to come in at a certain date and then and the vendor ships the item of this style to a warehouse on that date next the warehouse receives input the items into inventory and then there available for the stylus to stand up to the class and so 1 the stylist picks the item to go in the clients that were back at the warehouse and the warehouse picks the item that inventory tax them up and shipped them to the client then like a mention before the client can try an item that form and then the warehouse is gonna process anything that the client returns
so we'll come back to that in a 2nd but now that we talked a little bit about situates here's a brief overview of what we're in cover today so 1st many go through a case study featuring data entry by associate the warehouse and then also you how you can get started with the Web Speech API to experiment with ways dictation on your own all talk about some Voice Dictation challenges that we ran into and solution that we implement it an answer the question is voice the right solution free on so jumping into the case study and like
many retail companies such take measurements of the items that we bring into inventory to eventually sell to our client so this is a diagram of a man's long-sleeve woven shirt and you can see 60 marks across the shirt and these are called point of measure so the the very specific technical retell measurement that we would take on the short time and there's actually hundreds of these measurements that can be taken and they range from something as specific to the width of a button to as generic as the leader with of the shirt but that's Citrix for any given men's shirt and we typically take about 15 to 20 measurements and i and the part of a process we
take these measurements if we go back to the life cycle of an idea it's our warehouse when we received the inventory from the vendor but
in the way that the measurements are collected is just with a basic sewing measuring tape so here you can see 1 shirts laid out flat on a table and we're measuring across the shoulders of the measuring tape so
when I started working on this project the goal of the project was to build an application to start capturing these measurements that we were taking our warehouse and the process was already in place before I started working on the project project so measurements were already being taken and collected on but the team was using Google sheets to report these measurements and you'll see that it's kind of a recurring theme in internal software is that were taking existing prophecies and I'm making them more efficient and scalable by building software to support that and so that's exactly what we did in this project I'm so throughout this
project I got the opportunity to partner with my coworker on the user experience or you acted 19 it's a text and we work together throughout the entire project from the user research phase the prototyping phase in the development phase and so here's a picture from our initial user research session where we went to the warehouse to observe the current measurement process before figuring out what type of tool we were going to build to support the process and we had a couple main takeaways from the 1st user research session so the pictures on the left and the right shows some handmade products that the warehouse associated need to even in the measuring process and you can see from the diagram a couple slides ago that we really took inspiration from these crops that they made and carried out through into the application and then the middle photo shows 1 of the warehouse associate actually taking these measurements and i mean take away from that was that they were recording measurements on very small laptop screens and there was a lot of hunching over a lot of shifting and body language back and forth between measuring the bombing and entering it into the keyboard and so before
I talk about the rest of the process we went through to build this application and I wanted to give you guys some context so you can think about and what we ended up building as we go through the rest of the process so here is a quick demo of the final solution that we came up with so 23 18 and whatever for 9 and 3 quarters 8 and a half yeah 4 and 7 8 2 and 3 over 4 16 and 1 half 2 and 5 over say it Ch so in case you haven't figured it out already you are at 1 of the jobs scripts talks at real come I'm so we ended up going with voice dictation as the solution this is a real that by all of the voice dictation is built on the front but in all honesty this isn't really talk about on a script or worried about ways dictation it's a story about how to leverage the UX design process in engineering to build the best products for our users so
how did we do that well let's finish the story so after our initial user research session we where I'm kind pretty focused on the fact that the users were hunched over the small laptop and they had to switch back and forth from measuring in entering the measurements into the keyboard so we want to test out measuring in pairs we ask the associates to pair up so that 1 of them could continuously measure and dictate the measurement the loud and then the other could type into the laptop and the reasoning behind that is that our hypothesis was that if 1 of the associates could spend 100 per cent of the time measuring they wouldn't have to break the flow the concentration we wouldn't have to reset the body language or the hand position on the measuring tape and they would be able to be more efficient so what we found from this test is that the associates kind of hated this concept of measuring in pairs on the person who's typing on a laptop fell accuses sort of weaving around and whether really doing much and felt like he could be more efficient if he had grabbed another shirt and started measuring himself but what we did notice that was promising is that the associate who got to focus completely on measuring I did seem to be more efficient and she didn't have to break her flows she was able to continuously measure without breaking the concentration and without shifting her body language so because of that
finding we then wanted to move forward with a voice usability study so these 2 screenshots show our initial prototypes that we brought the warehouse the 1 on the left is the basic keyboard entry and the 1 on the right is the voice dictation protect and they don't look that much different I'm just wanna call that you're not really supposed to be a difference in the use of because this isn't so much of the you I change as an input change but you can see in the Voice Dictation protectors a click to speak button on the top for the associates press and when they're ready to start speaking in the application but aside from that the interfaces are pretty similar so in this
voice usability study there were 3 main question that we were hoping to answer and the first one was around efficiency so wood waste entry affects the overall time to measure a common and then the 2nd question that we wanted to answer was around accuracy and so little bit of background are warehouses are pretty noisy environments the associate often like to play music or they sing aloud to the music and they wanna talk to their friends during their shifts and so we were wondering if this would work out for voice dictation or if it would be hard to capture the input of the user with saying and then the last 1 was a question I'm a little bit around about culture and workflow how would the warehouse associates feel about what entry so a little bit of context and so that any associated who's working on the measurement usually is using it in about a four-hour shifts so about half of work day with breaks in between and we do know that would feel exhausting to be talking allowed for hours at a time or if they would prefer to be typing into a keyboard and instead so let's take a look at the results and here are the results around efficiency we tested these prototypes with 2 warehouse associates and you can see that participant 1 had a pretty dramatic increase in efficiency unshaved about 3 minutes off of his measurement time with the voice data entry and then participant to also saw a bit of a lift efficiency but not quite as dramatic and an interesting thing here is that participant to was already the more experience of the person doing the measurements and so he was already very ridiculously fast taking measurements which is why he didn't have and you have quite the increase in efficiency as the less experienced associates but without the asthma really promising results especially since we knew that we would be onboarding new people on this process to be taking these measurements there seem to be a huge efficiency gains here so the next
thing that we wanted to take a look at what was the accuracy i and we found that investing in the right had that was really the key here and we're able to mitigate accuracy issues from the noisy environment so this is the head that that we ended up on purchasing for our warehouse associate the microphone has a pretty narrow input range and then the most important factor here that the microphone can be flipped up into the headset and it stops recording when it's looked up so this is important to us in terms of keeping the culture going in the warehouse and the associated move seamlessly back and forth between measuring and singing along were talking to their friends and even have to feel trapped by this voice dictation device and and then the last
thing that we wanted to know was how he associates would feel about voice dictation so here are some photos on the left 1 shows the keyboard entry prototype and the right 1 shows the Voice Dictation and this is participant number 1 in the study and his main comment was that the voice dictation felt a lot better for his back and and you can see as well the keyword picture but in the dictation picture on his standing up definitely straighter not as hunched over the laptop
and then this this participant to or who was the most experienced and already pretty efficient had using keyword in his name commented that he liked that he never had to remove his hand from measuring tape I see in in the photo on the left even when he's using keyword entry you have kind of a one-handed approach to typing into the keyboard and sense he's more experience that measuring he really capitalize on the fact that if you don't have to completely rethink your hand on a measuring tape each time it you can move through the measurement faster and that was his main call out there with the voice data entry he could truly used to hand out to do the measurement so now that you've seen
how we utilize Voice Dictation with ah warehouse associates and I wanna talk a little bit about how you can get started with the Web Speech API on your own
so i'm 1st here is a bit of course script I'm showing how to initialize the Web Speech API and the really cool thing about this API is that there is no external library or anything that you need to pull and this is available as part of the dollar scripts language if you're using the Chrome browser so it's really just as simple as initializing the WebKit speech recognition and on that
note like I mentioned this is available to use in Crown with no external libraries but the flip side of that is that it only available in problem so that's why internal tools I make a really good candidate for using the Web Speech API because we can fully control our users brothers but it probably wouldn't be the best solution for something that customer-facing where you have to be able to support every brother under the sun and then below this range i is just a little code snippets showing that were only initializing the speech recognition i if it's a fact and
and then pretty much the only other thing that you have to do is start the recognition and recorded voice results so you can also see I'm a bit of code here in the middle where we have logic that restarts the recognition every time it and and this is so that the associates continuously measure they can go through every measurement on the form without having to click on a button or actively turn on and off and the voice recognition in a way that they could move in and out of the voice dictation was by flipping a microphone up in the head that as opposed to clicking anything on the keyboard or nothing would be out at all and then the last step is just getting the results back from API and returning the transcript so
it's a pretty uh I'm straightforward set up and I now want to go into some of the challenges that we ran into with voice dictation and some of the solutions that we implemented so the 1st challenge was
around contextual form and you may or may not have noticed a couple slides ago that the results that come back from the Web Speech API come back as an array I and this is the kind of the what the API is doing is it's recording contracts along the way the user is speaking and actually return snippets of speech along with the final result which is going to be the last element in the array so let's look at 2 basic examples here on the top you can see the user will start to speak and they say to they continue speaking and they say 2 and a half and then they finish off the sentence with 2 and a half ice cream and so the API determined OK this person is speaking in sentence form there's other context around they're talking about a screen words then return the words just as they send them and there's no need for additional formatting but then in the 2nd example of the user starts out the same they start to say the word to the then continue on in a state to a half and then they stop speaking which is the case for our applications and the users only recording numbers and with the API is supposed to do here is from the lack of context it implies that the user is speaking a number into a transforms the text into the numeric version and this is pretty awesome I thought this was 1 of the most fascinating things about the API that you kind of just get out of the box with contextual for mining on but unfortunately it doesn't
really work 100 per cent of the time I think we thought about like a 50 50 success rate with and so what that meant for our users was that they were speaking allowed 2 and a half and they were seeing the words to and have come onto the screen as opposed to the number and that's really confusing when they're using a measuring tape and they're supposed to be entering data infractions on and sometimes they get a fraction and sometimes they get words but I'm we're able solve this pretty easily I think because the fact that we have such structured data we were only expecting our users to dictate numbers so we were able to account for that and be able to do that contextual format formatting herself so what what what we did is we set up an object which is a mapping between the numbers of words and then the numeric counterparts and then everything without a transcript back from the API we iterated through the object we checked for matches in the keys and if we found a match we just replaced it with the value which is the numeric emerged and the the
so in addition to contextual formatting another challenge we ran into was dictation errors from our users which were a little bit harder to solve so here's 2 examples that we ran into on the first one the user dictated 35 8 which came out to be 35 over 8 I'm pretty much just as expected on what the user was actually trying to say was 30 and 5 8 but they just didn't enunciated in actually physically pronounce the and and so this is more a matter of training the users and how the API would work and how would report the result so in this case it was literally recording exactly what they were saying that the user didn't say what they meant to set a and the same thing would be an example on the bottom of 4 quarters would return for over 4 because that's 4 quarters but they actually meant to say was for and 1 quarter and it just didn't come out of the mouth the way that they were not intending to so that's a little challenging because I'm both of the result are valid fractions and so it's a little hard I to catch these and luckily we do have a friend and validation in the application that make sure users reduce the fraction and so these both of these examples will catch an error but that's only because they're non-reduced fraction so you can imagine other dictation errors that are totally valid fraction but it may not capture so that one's a little bit harder to plan for and then the last chance that we ran into with voice dictation was around reliability so if you go to the Andean documentation for the Web Speech API you'll see a notice at the top and this is an experimental technology and a bunch of cardiac about not seeing backwards compatible could have breaking changes etc. etc. and and after a few weeks of using the Voice Dictation in production on going on a daily basis we noticed some unexpected behavior in the mean unexpected behavior we noticed was that the users would get through about half of a page of measurements in the reporting which is completely stop working altogether and this proved to pretty challenging to debug because there was a really a difference between those scenarios and the scenarios when it was working and by that I mean there were no errors in the daughter script crumpled nothing really indicating that something was wrong so it was pretty hard to debug and test on we initially turns to hardware as a potential problem we thought 0 maybe we made the wrong choice and had set so we tested a few different had that option even just regular year by and I didn't really find anything there we also tested different laptops on Mac vs. PC to see if it could potentially suitable the laptop or the internal making a laptop and we made sure users had up to the burdens of Chrome which they all dead so it's so little bit inconclusive and on something that were digging into mora but were not quite sure what's causing the reliability issues right now and but the good thing is that when working with an experimental technology we knew we had to have a fallback plan from the outset so we never blocks the users from just entering the data into the form that you saw in the demo so that they're using for the most part right now as we work through some of these reliability issues and if you remember 1 of the main reasons that we wanted to implement voice dictation was for the user's comfort and their positioning in their body language and so what we ended up doing was purchasing monitors that have large screens that we could stand up in the warehouse so that the associates can clearly see the form of measurements in front of them they have to hunch over and it was a much better experience for so I wanna call out 1 other challenge
on that we've run into that's not related to voice dictation but has more to do with the user's entering data in the form of and so you can imagine if the user's typing into a keyboard and we intend to type 10 they might actually slipped tape and extra 0 and then we have invalidate essentially we have a measurement of 100 and of 10 and that's difficult to catch because 100 is a valid number on it's not anymore in Balaban 10 and but what we had to do was implements I suggested ranges for each of our measurements so the way that we did not is for every single point of measure and again there's hundreds of these and they differ by type of item that were measuring we added a minimum and maximum value to the database that we were able to use on and implement some front end warning if the measurements were out of range so here we show the orange morning but were not ever blocking the users from submitting the form because it's certainly possible the measurement could be out of range we just wanna catch the extreme outliers like 104 across the shoulder which would never so is voice the
right solution for you a couple thoughts on that I'm a few things that I would consider if you wanna look into voice dictation as a potential solution for the users and the 1st being Brother control over the Web Speech API and in particular it only supported in Chrome right now and which I think I those because we were building this as an internal tool that allowed us to experiment with it more easily than if you were potentially building a customer tool and I think the fact that we had structured data with also really helpful particularly in the contextual formatting and not really sure how we would solve that problem if the API was returning unexpected data and we were just allowing any words to come through it so the fact that we were only affecting numbers we were only allowed to input numbers I really helped us out there in order to quickly solve the problem and then I think it's important since this is I'm a pretty experimental technology that you have a flexible user base and a fallback plan so building trusted the users on making sure they're willing to experiment with you and making sure they have the understanding that you know it might not be perfect especially for the 1st few iterations and communicating that there's there's always a fallback plan and making sure everyone trained on the fallback we is really key so when I was thinking about my key takeaways on making up to make up this slide I sort of came to the realization that this talk has served as a bit of a postmortem on this project fermi and there's a lot to learn here by power when I think about it there's a couple things but I'd like you to take away from this story and those around you act in engineering collaboration and so the first one is that the US in engineering collaboration that we had a lotta empathetically build expertise software I mean by that I mean usually I'm working on software that used by some people were sitting at a desk typing in on the keyboard so I have this is the 1st time I had thought about things like body language and the user's comfort level while they were using the app and not something that I hope to bring to a lot more of my applications and products unwanted and then the collaboration also allowed us to quickly prototype early on we were able to iterate and quickly solve problems and so the the the couple protects they showed at the beginning that we tested with users those were about 100 per cent code as a true collaboration between EU acts in engineering and the reason that was beneficial because we could get out a realistic protector user tests it quickly and make iterations directly in the code and some Unicode ended up in ah production Virgin and so I think it allowed us to move through the process faster and it also allows us to look at the problem from both engineering and the user experience so I want to thank a few people for their collaboration on this project on 1st and foremost for most their know I'm a worker on the UX design team who was with me every step of the way during this project and everyone out from on this list was also instrumental in getting off the ground and with that I think
you guys but the why the 2 speech versus something out well were there any other option that means the only option here really considering with the speech versus the traditional keyboard entry we haven't really looked into any technology like smart measuring tape for or something like that it unthinkable idea yeah and are there limits to how long we can be reporting for so our impression was no all but the sort of intermittent recording that were getting like my imply otherwise if that makes sense so whenever it's really hard to duplicate the problem and if you're just sort of testing on your laptop or even in the warehouse and so we've noticed we have noticed that it becomes more of a problem with like continuous use like hours at a time so you're right there could be something there although every time you submit the form it stopped recording and start back up again when they go to a new shirt to measure social it's yeah but that's a good point it's something a look into definitely so the question is whether that was the only reasons using the Web Speech API versus some other speech have we played around anyway that's so like I said this is still a very like early stages of the project and we will the web the API on pretty much because it was available quickly and easily for our initial prototyping and we didn't run into the issues early on so we're like well let's not fix was not broken and but now at the point where we probably have to evaluate the options that the question 0 yes and the question so the question was around and if the warehouse associates or something that they didn't expect a member maybe had an error and they wanted to go back to we implement anything that help them do that on and yes I did mention that in the talk but we did implement some sort of like keyword triggers that would move the cursor around on the page and the 1 that was in the demo was received and which submit the form and then we did implement the back in a forward as well so that if they did have a mistake they have to touch the keyboard to go back into that did I say we have a warehouse inference that of the question whether we implemented localization with that and different languages of we only warehouses in the US right now so that wasn't on an issue but you might have seen on the initialisation slide you can have a lingua doctrine of a we haven't tried it yet but I'm assuming it would work is also yes and I'm able to share the model of had said it was that we're using time it's called job right and I think it's pretty common I company for like call centers for anyway that's that how we found in how many garments created the company measure on so I'm not quite sure and were not fully ramped up since this is such an early project of word we started with on men's clothing because it was a a little simpler to like rock rock our minds around like how we would measure it because men's clothing is primarily based on like a box or a square her her time not to make any implications but it's a little bit it's a little bit tougher to measure women's clothing on because of the extreme so let variation in different styles of that kind of leaves accelerator so we this out and with men a which is why it was the examples were from and and then a small subset of women's positive so the rate now we have a handful of associates measuring for a few hours a day but I'm not sure how many environments but you know I thinking guys without the that WSJ