Dynamic Hand Gesture Recognition Using 3DCNN and LSTM with
FSM Context-Aware Model

Noorkholis Luthfil Hakim	Timothy K. Shih*	S. P. Kasthuri Arachchi	Wisnu Aditya	Yi-Cheng Chen	Chih Yang Lin
National Central University, Taiwan	National Central University, Taiwan	National Central University, Taiwan	National Central University, Taiwan	National Central University, Taiwan	Yuan Ze University

Abstract

Currently, Smart TV technology demands unique and beneficial applications since it can obtain multiple purposes without the need to connect a satellite service or a TV antenna. Consequently, inspired by such benefits we propose a unique gesture-based system for a smart TV-like environment, which consists of six applications: movie recommendation, social media platform, call a friend, weather checking, chatting, and Tourism. All applications accumulate to a single system controlled by a natural-like gesture controller for easy use and natural interaction. As one of the significant contributions, this work focuses on recognizing the recorded gestures to implement the proposed system based on a multi-modal deep learning architecture. Currently, there are 24 gestures, including 13 static and 11 dynamics in RGB and depth format. The proposed model is implemented with 3DCNN architecture, followed by the LSTM network to extract both short-term and long-term temporal features. Later the classification results combine with the Finite State Machine (FSM) that communicates with the model to control the class decision results based on the application context. The classification result shows that combining both depth and RGB data achieved an accuracy of 97.8 % while the FSM improves the recognition rate from 89% to 91% with real-time performance.

Gestures Design

Below images is the sample of 8 gestures in the real-time gesture recognition system for Smart-TV environment. The design gesture choosen by several requirements. First, the gestures need to be have different kind of characteristic for each other. Second, there is no culture or religion meaning in the design of the gestures. And the last, the gestures selected based on the convenience uses from the users preference.

Below is the whole 24 gestures collected gesture design for smart-TV environment

Proposed Method

As seen in the figure below, There are two main module in the system. Classification module extracting the feature of the data input to predict the gesture based on those input data. FSM control module to filter out gesture based on context of the system to help to enhance the Classification module for predicting the gesture

The following figure is the classification part using 3DCNN and LSTM to recognize the hand gesture actions

Proposed classification module uses the three multimodal fusion data in the experimental design. a.Early Fusion, b.Middle Fusion, c.Late Fusion. the result suggest that using the Late Fusion produced the best result

In the end of the part, FSM control help to narrow the decision making of the classification module based on the context of the application system. The folowing figure the example of "watch movie" application narrow the gesture decision in one of its state

Experimental Demo

The following Image and Video is the experimental demo of our gesture recognition model in our implemented system called IC4You.

3DCNN + LSTM + Context-Aware Proposed Method

This is the video experimental result of our Proposed method in the implementation system

First Demo Video

Second Demo Video

Demo in PAIR Taiwan 2018

The system has been tested in the MOST Artificial Intelligence Demo Conference in Taiwan 2018

Live Demo

Conclusions

In this paper, we presented multi-model deep learning architecture to solve the gesture recognition problem on real-time application situation. We used the combination of RGB and depth data as the input for the proposed model to recognize the gestures and both 3DCNN and LSTM architectures could extract the spatio-temporal features of the gesture sequence, especially with the dynamic gestures. When working with the real-time application, joining with the FSM controller model could narrow the gesture classification task of the model into smaller parts that make the model work efficiently and enhance the accuracy result. To test the proposed model, we designed 24 static and dynamic gestures associated with a smart TV-like environment. However, for real-time testing, we only used eight gestures to examine the robustness of our work. The result shows that the FSM controller can enhance the accuracy result in real-time applications. For the future work, we would like to use transfer learning by training the model with large datasets such as the Sports-1M dataset or ChaLearn gesture dataset to enhance the model accuracy. Besides, we also plan to discuss the comparison results of our model with other similar works, which only consider the gesture recognition.

Dataset Download

You can download a part of our IC4You gesture dataset using below links. The dataset split as RGB and Depth with four parts. Please be noted that the dataset is only for non-commercial use.

RGB Dataset Part 1: RGB-1-8 ( 12 GB Size )

RGB Dataset Part 2: RGB-9-15 ( 9.7 GB Size )

Depth Dataset Part 1: Depth-1-8 ( 2.3 GB Size )

Depth Dataset Part 2: Depth-9-15 ( 1.9 GB Size )

Information: Readme.txt

Citation

Cite the following paper if using this IC4You gesture dataset in your publications.

Hakim, N.L.; Shih, T.K.; Arachchi, S.P.K; Aditya, W.; Chen, Y.C; Lin, C.Y. Dynamic Hand Gesture Recognition Using 3DCNN and LSTM with FSM Context-Aware Model, Sensors, Year: 2019, Volume: 19, Issue: 24, 5429.

End Note

Timothy K. Shih

Noorkholis Luthfil Hakim

Copyright 2019 - Timothy K. Shih and MINE Lab., National Central University