Skip to content
AI learning for video conferencing
LisaOctober 16, 20182 min read

Mood Detection AI to trigger GIFs

Our developers Philipp Weissensteiner and Florian Limberger are currently working with mood detection AI. If you are interested which challenges they've faced so far, and what's up next- read here about their experiences:

In this post we’ll share how we used  TensorFlow’s object detection API to build a custom image annotation service for eyeson. Below you can seen an example where Philipp is making the “thinking” 🤔 pose during a meeting which automatically triggers a GIF reaction.


Background and overview

The idea for this project came up in one of our stand-up meetings. After we introduced  GIF reactions to eyeson we thought about marrying that feature with some machine learning. As we wanted to do some rudimentary form of “sentimental analysis” on a frame from a video meeting and display a GIF matching the reaction of the participant.


We initially played around with a couple of machine learning services like  Cloud Vision and  Azure Machine Learning. And although these services are very convenient and provide some impressive results, we wanted a bit more flexibility, control and fun. Therefore we decided to try  TensorFlow’s object detection API. If you’re interested in playing around with it yourself, follow  this guide and read through the documentation on GitHub.

Training our own model

For our project we wanted to annotate images with the following labels:
    • face
    • hand
    • neutral 😐
    • thinking 🤔
    • smile 😃
    • wave 👋🏻
    • ok_hand 👌🏻
    • thumbs_up 👍🏻

Gathering good training data was certainly one of our biggest challenges and a weak spot in the current implementation. We do have some snapshots from our staging environments, but those pictures are from eyeson staff and testers. Unfortunately, there is not a lot of variation (skin colour, gestures, camera angles) in these.

To remedy this, we increased the volume of images by using some of our staging recordings and “exploding” those into their frames. Finally, we wrote a script to download a bunch of images from  Google Images. These are quite different from  actual images we would get in  actual meetings but  ¯\_(ツ)_/¯.

So to summarize, the current set of training data consists of:

  • snapshots from our staging environments
  • exploded recordings from our staging environments
  • Google Images
  ... if you want to read more about their experiences, the challenges they've faced, the process itself and what's on their agenda next, go to our developers' blog. The results so far look like this:


Pretty nice, right 😉