Creating An AI Powered Taekwondo Trainer

Mansi Katarey
7 min readMay 22, 2021

Did I mention that it’s personalized?

Has the pandemic had an impact on your ability to work out and train? I know for me it has. As someone who enjoys doing taekwondo in my free time, when the pandemic hit and everything closed, it also impacted my ability to train and learn taekwondo. Besides the considerable decrease in stamina and power, I noticed my techniques getting sloppier, my kicks getting lower, when they should have been getting higher and of course, forgetting my patterns.

Now, you may think there’s a very simple solution to the latter, just practice your patterns…duh! And believe me when I tell you I tried. I watched countless Youtube videos, even the ones that walk you through the patterns step-by-step. But as a brand new green belt at the time, I was oblivious to making mistakes. Without someone telling me I made a mistake, I didn’t know that I was making one.

This led to another problem: practicing the wrong things. Michael Jordan pretty much sums it up, “you can practice shooting eight hours a day, but if your technique is wrong, then all you become is very good at shooting the wrong way.” I could relate.

Many people think taekwondo is about power, force and strength, and for the most part, it is. But it also requires a lot of attention to detail. The stances, the angle at which you kick and punch and the exact ratio of your body weight on each leg. It all matters. Every tiny detail. So when I started practicing the wrong things…well, it was wrong.

This is the foot position for a “Walking” or Front Stance

Anyway, fast forward to a few months later and I started working out using my old Xbox. It had a workout DVD that guided you through different exercises and tracked your movements. As an example, let’s take sit-ups. If you didn’t bring yourself all the way up, or your posture was off, let’s say your knees were straight instead of bent, the computer wouldn’t count that sit-up. Instead, it would tell you what you did wrong.

The green person at the bottom is the user. The xbox sensor/camera is measuring the angle between the arms and the midsection and the distance between both feet.

Inspired by that, I sought out to build my own taekwondo trainer. A trainer that would tell me when I did something wrong.

Sounds Cool, But How Does it Work?

See, the Xbox DVD was built based on 3D human pose estimation. It’s a computer vision technique that predicts and tracks a person’s location or object and analyzes the human posture. It can also be defined as the search for a specific pose in space of all articulated poses.

This is done by looking at a combination of the pose and the orientation of a given person/object. We can identify, locate, and track the number of key points on a given object or person through pose estimation. For humans, these key points represent major joints like an elbow or knee. Pose estimation allows for higher-level reasoning in the context of human-computer interaction and activity recognition and it’s also one of the basic building blocks for marker-less motion capture (MoCap) technology.

The key points in this image are represented with a blue circle.

There are 3 different types of human pose estimation:

  1. Skeleton Based Model- This type only analyzes sets of joints. For example, knees, shoulders, elbows, etc. As the name implies, it looks at the skeletal structure of the human body.
  2. Contour-Based Model — This one looks at a person’s silhouette. It takes into account the width of the arms and legs and the torso to name a few.
  3. Volume Based Model — This one is cool and a little more advanced. It looks at the muscle structure. A volume-based model can only be created through 3D body scans.

The DVD used 3D Human Pose Estimation vs. 2D…What’s the Difference?

2D pose estimation estimates the location of key points in 2D by estimating X and Y coordinates for each keypoint. 3D pose estimation works by adding a z-dimension to the prediction. See, 2D pose estimation only looks at the joints, but 3D also looks at the spatial arrangements of all the joints in the body. Often, 3D models will first go through the process of predicting an image or video in 2D and then incorporating the z-dimension to make it 3D.

2D vs. 3D Pose Estimation

Let’s Take a Deeper Look into Pose Estimation

There are two different approaches when it comes to pose estimation: the bottom-up approach and the top-down approach.

The bottom-up approach looks at one specific joint (e.g. a right elbow) for everyone and then assembles the rest of the joints specific to each person using joint association techniques.

The top-down approach is the opposite. It first draws a box around the object or person, confining itself to a certain area, and then estimates the possible locations of the key points within each region.

The Process:

Let’s walk you through what it takes to find the key points.


First, the computer needs to remove the background. Sometimes, the background can be distracting and has the potential to make the model inaccurate. To prevent this, the background is removed.

Sometimes, however, removing the background is not optimal. It may take too long or the computer may not have the ability to do so. In this case, a bounding box is drawn around every human in the frame. From there, only what is within the boxes is evaluated.

A bounding box is created around the person in the left image.

Extracting the Key Features

We want to take the important information from the images which can later be used to train the learning algorithm. There are two types of key feature extraction: explicit or implicit.

HoG Image on the right

Explicit consists of your standard computer vision features, like Histogram of Oriented Gradients (HoG) and Scale Invariant Feature Transform (SIFT). These features are calculated explicitly before feeding the information to the learning algorithm.

Implicit features refer to deep learning-based feature maps like outputs from complex Deep Convolutional Neural Networks (CNNs). These feature maps are never created explicitly.

Predicting the Key Points

Often, the computer will create confidence maps as a way to predict the joint locations for every point. Confidence maps look at the probability distribution over an image and analyze it pixel by pixel. This is where the computer will either use a top-down approach or a bottom-up approach.

Simple key point detection

Post Processing

There are lots of post-processing algorithms that are used to ensure that the outcome is realistic. See, sometimes, the computer is a little tired and distracted and it gives us inaccurate outputs.

The way the post-processing algorithms work is that every pose passes through a post-processing algorithm. From there, each pose gets scored based on its likelihood. Poses that are scored below the average are then ignored going forward.

Pose estimation in action!

Using these same principles, I created a basis for my taekwondo trainer.

Application of Pose Estimation

Pose estimation can help revolutionize so many things. Some things it can be used for:

  • Workplace activity monitoring
  • Measuring the crowd and foot traffic
  • Robotics (to make robots more advanced)
  • Animation (making it quick and easy to create movement-based animations)
  • Augmented Reality (e.g. virtually testing out pieces of furniture)
Pose estimation can be used for many things.

It’s important to remember that pose estimation is still evolving very much. With the advancements of this feature, we can change how we live our daily lives.

This article is part one of two about the AI trainer. In this article, we went over the basic steps and ideas to get the trainer working. Stay tuned for my next article that goes much more in-depth about the trainer!

Thanks for reading! If you have any questions or want to chat further, contact me at: or contact me through LinkedIn.

To keep up with everything I’m doing, subscribe to my newsletter!



Mansi Katarey

Passionate about AI and how it can solve problems around the world!