Meet Yandex.Station Mini. Big story of a small device

We just introduced our new device - Yandex.Station Mini. This is a compact smart speaker that can play music, manage a smart home, set reminders - and much more. This is also the first column with Alice, which can be controlled with gestures.



Today we will tell the readers of Habr several stories about the stages of creating the Mini Station. From optical calibration and UX testing to unobvious features of working with power supplies. And you will learn what the theremin is and how it is associated with the Yandex device.







But for starters, a small flashback.



Last year, we talked on Habrรฉ about the development of the "big" Yandex.Station (and also the Yandex.IO platform, which we and partners use). This is our flagship device with Alice, designed to be in the center of a large room next to the TV. She has a powerful 50-watt sound. Three active speakers with a wide range of frequencies. Seven microphones working as a single radar. HDMI output, after all.



All this year we did not stand still. Alice's voice became more and more natural. She learned to solve the pronunciation problem for many homographs, that is, depending on the context, correctly emphasize words that are written the same way, but have different meanings. Hearing also developed: recently we already talked about how we taught Alice not to respond to other people's names. Recently, we began testing the ability to recognize the owner of a column by voice.



And we also launched the smart home platform. Now, with the help of voice, you can control third-party devices and even combine them into scripts. The rejection of remotes and buttons in favor of voice is a key feature of our platform. And for this, Alice should be nearby.



In addition, a smart speaker is not only music, radio and video, but also reminders, an alarm clock, weather, factual answers, fairy tales and games for children, etc. The device can be useful by the bed, in the office, in the kitchen, in any other corner of the apartment.



Therefore, we decided to make another Station - for those who need a simpler and more compact device with Alice.



Reduce the device



The mini-version does not need a loud sound, so the heavy and large speakers were replaced by one trehvatnym. This is more than enough for simple tasks. Although even it can cause trouble with power supply, if you do not take into account one nuance, but more on that later.



Refused to exit to the TV. This reduces the load, heat, and, therefore, the requirements for electronics. The massive metal frame of the Station with a passive radiator for cooling also became unnecessary.



Instead of seven microphones, four were left, because loud sound no longer interferes with speech acquisition. But at the same time, microphones, like in the Station, work on the principle of phased array antennas, or a directional microphone. The device algorithmically searches for a voice command with the word "Alice" in the surrounding noise. Then it determines the direction and clears the signal from noise, including subtracting music. And only after that the signal goes to the cloud and is recognized.



In order for speech recognition to work most accurately, the neural network needs to be trained on records that have been spoken specifically for this device. It makes no sense to take a neural network model from the "large" Station, because its efficiency in the Mini Station will not be so high.



This problem can be solved in various ways. For example, hire people to read out a phrase column on a piece of paper. But we will receive few records that are not similar to the actual user requests, because in reality the records contain unpredictable noise, overlapping voices and much more.



Therefore, we did not save on quality and immediately ordered several hundred ready-made speakers at the factory, which we distributed to participants in a closed beta test in Yandex in exchange for help in training the neural network. And it worked.



By the way, they did not refuse from the hardware Mute button, which de-energizes the microphones and mutes Alice's โ€œhearingโ€. It does not add any special complexity to the device and is now located on the side.







But the remaining buttons were abandoned. And here the fun begins.



Add magic and laser



Take a look at the photo below. This is a top view of both of our Stations. Today we will not talk about design - try to find another important difference.







Please note: there are no buttons. And there is no rotary ring to adjust the sound. If we make a small, lightweight device, almost all of whose electronics fits on one board, then the mechanical elements only complicate the design and increase the size.



Voice is the most natural way to control smart speakers. But it happens that a person speaks on the phone or dines, so an understudy is still needed. And we found an option. And no less natural.



Imagine: you are making a hand gesture - and your favorite song is getting louder. Or just put your palm on the column and the alarm goes off.



So how does magic with gestures work? The depth sensor, which is hidden under the cover of the device, is responsible for it. This is how it looks on the board with a significant increase (the length in reality is only 4 mm, the thickness is 1 mm at all):







This is a vertically emitting infrared laser with a wavelength of 940 nm in conjunction with a receiving photodiode. The beam bounces off an obstacle above the column and returns. And since the speed of light is known, it is possible at any time to determine the distance to the object.







It seems to be enough to buy a sensor and connect it to the board so that everything works fine. But no.



The sensor is hidden inside, above it there are holes in the case (otherwise it would work). This means that dust and other debris can distort measurements.



We need a protective plate that will cover the laser and the photodiode, but it will fit in the case. Its material is strictly regulated, since not all types of plastic work well in the near infrared range. With a strong desire, glass can also be cut out, but it is quite difficult, which means it is very expensive.







Moreover, each protective plate is cast and unique in the literal sense. It is impossible to make two identical plates. So, each of them in its own way affects the propagation of the beam. If this is not taken into account, then we will get an error in measuring the distance.



Each new Mini Station goes through a sensor calibration step on the conveyor to take into account the individual characteristics of the lens. Simply put, so that the device perceives an obstacle at a height of 15 cm at this height. Calibration is something like this. Sheets are taken from materials similar to photographic paper, but do not pass the infrared range, and are statically placed at a known height.



As a result, we reached the stage when you need to test the accuracy of the sensor in the assembled device. But it turned out that a ready-made industrial device for this simply does not exist. There is nothing to be done - they built their device. In the photo below you can see the first prototype in our office in Moscow, assembled literally from sheets of plywood printed on a 3D printer of bushes, two motors and a controller to control them. This thing automatically moves the platform simulating a hand above the column to evaluate how accurately the sensor determines the distance.







Fine copies were later sent to production.



We stabilize the power



It's time to think about the power supply, which we promised to talk about above.



The column consumes energy. On average a little, less than 5 watts even at high volume. But, unlike many other small household appliances, its consumption is extremely uneven. We noticed this effect on an early prototype when we used a gesture sensor while listening to this track:





Try to guess what is wrong with him? Sharp transitions to low frequencies. And how do low frequencies differ from high ones? The amplitude of oscillation of the diaphragm of the speaker. The higher it is, the more energy the device consumes.



Add to this gesture control, voice commands, network traffic - and you get short, but unpredictable moments when the consumption jumps so much that simple power supplies simply can not cope with the support of stable voltage. For example, typical charges for smartphones are not designed for this, because this class of devices has a battery and consumption is fairly uniform. The column, if the supply voltage briefly sags, can simply reboot.



To avoid this problem, we tested prototypes on a sound with a frequency of 100 Hz. It is on it that the speaker creates the greatest load. Our external power supply, although it looks like a typical charge with a USB Type-C of 1.5 amperes, is ready for such situations. Moreover, we understand that people can connect their own power supplies, so during the development they replaced internal power converters (the so-called DC-DC converters) with those that can withstand short-term voltage drops. Of course, third-party power supplies are different, we do not test them and do not recommend them, but the solution with replacing the converters helps.



By the way, we also took into account the wishes of users: the white Station Mini has a white power supply and wire. A trifle, but nice.



Make gestures



A stable device and sensor are only half the battle. It remains to come up with the gestures themselves. The best way to come up with something is to collect a maximum of ideas, and then gradually filter them out and test them. We did just that: organized an internal hackathon with prizes. Any employee of the company could offer and immediately realize their gestures for the device. In Yandex, this approach works well.



There were many options. We eliminated them according to several criteria, but the most important - two. Firstly, if a function is popular and often required, then the gesture for it should be simple and easily reproducible. Secondly, a successful gesture is intuitive. You can write instructions, shoot a training video, but all this is less effective than the good old intuition.



We quickly decided on the gesture "Alice, stop it." Users are already accustomed to just lay their hands on an alarm clock, phone, smart watch to stop the sound.



But with the gesture of adjusting the sound, everything was not so obvious. We had two winner options. In both, it was understood that the sound is controlled using an imaginary vertical scale above the speaker. But is it enough to simply place your hand above the speaker: the greater the distance, the higher the volume? Or is it better to take a relative scale and move your palm up / down to smoothly change the volume?







UX testing is well suited to finding answers to such questions. A special laboratory has been created in Yandex for this: we bring people from the street there and observe how they use the product. This practice is pretty useful.



We hoped that one of two options would definitely win in UX testing. But not at this time. The behavior of people was divided approximately equally. So, you need to check both options. So we did as part of the beta test, and its participants quickly enough pointed out a significant shortcoming of the absolute scale. This option leads to the fact that a random wave of the hand (or flight of a cat) can suddenly turn on the maximum volume. And this is unpleasant.



The relative scale option won. Although there have been improvements based on feedback from beta users. For example, heuristics were added from random falling objects: for the sound to change, the palm should freeze for a moment at the same height and only then move. And they also added a sound indication of the volume levels so that the person hears exactly how many steps he changed it.



This could have ended the story, but colleagues working on gestures turned out to be big lovers of music and non-standard ways of playing it.



Add gravitsapu



In the course of work on gestures, the following idea was born: with the help of hand movements, not only adjust the volume, but also create music. Later we recalled that this idea is already being applied in the theremin. This electromusical instrument was created in 1920 by the Soviet inventor Lev Sergeyevich Termen. Theremin works as follows: hand movements change the capacity of its oscillatory circuit and, accordingly, the frequency of sound. Just listen to the inventor himself:







The classical instrument of Leo Theremin uses an electromagnetic field and two antennas: to control the volume and pitch. We only have one infrared beam, so you can control one thing. We took the volume as a constant.



Peter Termen, a composer and performer on the theremin, great-grandson of Lev Theremin, helped us develop a new regime. And experimental musician Anton Maskeliade and Monoleak studio created instrumental styles for the synthesizer: from familiar pianos and guitars to unusual swords and pans. You can even play space music - just say: "Alice, give the sound of a gravitsapa." The collection already has several dozen tools, and it will be replenished.



In the thereminvox, the slightest movement of the hand changes the frequency of the sound. You need to be a professional with a strong hand in order to accurately hit the notes and reproduce something melodic. We wanted everyone to play music on our speaker. Therefore, for many instrumental styles, an imaginary ray was divided into segments, each of which was assigned a specific sound.



By the way, initially the synthesizer mode developed as a personal project of one of our colleagues. But the children, whom we also invited to the UX-study, were very enthusiastic about the new regime. So we realized that we should not be shy and should bring personal initiative to the product.



***



Today we showed that even a small and seemingly simple device hides a whole history and numerous technological solutions. What individual stories would you like to hear in more detail?



We believe that the future lies with voice control, because in many cases itโ€™s easy to say - itโ€™s much more convenient and more natural than pressing buttons. And the new device is another step in this direction.



All Articles