Our developers Philipp Weissensteiner and Florian Limberger are currently working with mood detection AI. If you are interested which challenges they've faced so far, and what's up next- read here about their experiences:
In this post we’ll share how we used TensorFlow’s object detection API to build a custom image annotation service for eyeson. Below you can seen an example where Philipp is making the “thinking” 🤔 pose during a meeting which automatically triggers a GIF reaction.
Background and overview
The idea for this project came up in one of our stand-up meetings. After we introduced GIF reactions to eyeson we thought about marrying that feature with some machine learning. As we wanted to do some rudimentary form of “sentimental analysis” on a frame from a video meeting and display a GIF matching the reaction of the participant.
Stack
We initially played around with a couple of machine learning services like Cloud Vision and Azure Machine Learning. And although these services are very convenient and provide some impressive results, we wanted a bit more flexibility, control and fun. Therefore we decided to try TensorFlow’s object detection API. If you’re interested in playing around with it yourself, follow this guide and read through the documentation on GitHub.
Training our own model
For our project we wanted to annotate images with the following labels:
-
- face
- hand
- neutral 😐
- thinking 🤔
- smile 😃
- wave 👋🏻
- ok_hand 👌🏻
- thumbs_up 👍🏻
Gathering good training data was certainly one of our biggest challenges and a weak spot in the current implementation. We do have some snapshots from our staging environments, but those pictures are from eyeson staff and testers. Unfortunately, there is not a lot of variation (skin colour, gestures, camera angles) in these.
To remedy this, we increased the volume of images by using some of our staging recordings and “exploding” those into their frames. Finally, we wrote a script to download a bunch of images from Google Images. These are quite different from actual images we would get in actual meetings but ¯\_(ツ)_/¯
.
So to summarize, the current set of training data consists of:
- snapshots from our staging environments
- exploded recordings from our staging environments
- Google Images
... if you want to read more about their experiences, the challenges they've faced, the process itself and what's on their agenda next, go to our developers' blog. The results so far look like this:
Pretty nice, right 😉