3D motion tracking is the process of determining the movement and position of an object in three-dimensional space. It is crucial in various fields such as AI fitness, robotics, and computer vision. However, obtaining accurate 3D motion tracking from a single camera can be a challenging task due to the scarcity of labeled data. According to a study, only 5% of the data collected for 3D pose estimation is labeled. This shortage of labeled data makes it difficult to train models that are specific to the target domain. There are several methods that researchers have proposed to tackle this problem, such as weakly supervised learning, and Generative Adversarial Networks (GANs).
One approach to overcome this problem is to use weakly supervised learning methods, which combine both 3D supervised training and unlabeled data for the target domain. This approach enables the training of models that can accurately estimate 3D poses from a single camera. The key idea behind weakly supervised learning is to leverage the labeled 3D data available in the source domain while unlabeled data of the target domain is used for a 'smaller' learning task which also helps the model in the general task of 3D pose tracking. The method utilizes a neural network trained on both 3D labeled data and unlabeled data. The network learns the unique characteristics of the target domain, resulting in more accurate 3D pose estimation. Additionally, the technology is able to handle real-time motion tracking, making it suitable for use in interactive applications such as virtual reality and gaming.
Another approach is to use Generative Adversarial Networks (GANs) to generate synthetic 3D data to train the model. GANs are neural networks that consist of two parts: a generator that creates synthetic data and a discriminator that decides whether the data is real or synthetic. The generator is trained to create data that is similar to real data, while the discriminator is trained to distinguish between real and synthetic data. By training the generator and discriminator in an adversarial manner, GANs can generate synthetic data that is similar to the real data. This approach has been used in various applications, such as image synthesis and 3D pose estimation.
In Sency, 3D pose tracking is widely used to determine the motion of a person in front of a mobile phone’s camera. Thus, the device should be located from a very unique point of view relative to the user's location. Utilizing Neural network models trained on large-scale 3D data sets for Sency’s domain was found to be useless. The capability of these models to generalize their 3D vision abilities to a significantly different viewpoint is poor.
To overcome this limitation we used a large amount of in-house 2D data from the target domain and combined it in semi-supervised training including 3D labeled data. We also used Adversarial networks to refine the 3D predictions. After tuning this combined model, we found the predictions are much more accurate than before and the model succeeded in generalizing for a new domain from which it never saw any 3D data before.