Looking for a convenient method of “do as I do” motion transfer? Ralabs’ CEO Andrew Yasynyshyn explains a new framework created by Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros from the University of California, Berkeley. The technology makes possible to produce a video enabling any person to move like a professional dancer or perform martial arts kicks.

Step 1. Background

Before talking about the new method, let’s see what has already been done in the motion transfer area. Previous methods mostly create new images or videos based on existing content. For example, Video Rewrite. To make the specific person say totally new words or phrases, the system uses any available video of this person. It finds needed mouth position in the different frames of the video and combines them as a puzzle in new words. So, the first benefit of the new approach – Berkeley researchers create a new motion instead of changing the old ones. Modern technologies make possible to generate qualitative images of a human in the novel poses and even produce this kind of images for temporally coherent video. Motion transfer between faces and from poses to the body has been already learned by Recycle-GAN and vid2vid, for example. But the new approach not just generates poses, but preserve small details such as eyes or smile. Generative Adversarial Networks (GANs) for approximating generative models are used for different purposes. The image generation is one of them because these networks are able to create images with high-quality details. So now advanced GANs’ output depends on a structured input. Also, there were studies of the image to image translations. To solve these mappings, some frameworks often used these GANS: pix2pix, Cascaded Refinement Networks, CoGAN, DiscoGAN, CycleGAN, etc. Berkeley research team adopt some frameworks for their new approach of motion transfer.

Step 2. Methodology

The method proposed by Berkeley researchers contrast to approaches over the last two decades. To transfer motion from one human subject to other, they suggest using an end to end pixel-based pipeline. What does it mean? Let’s take a look. The framework has two videos. On the first of them, we can see a target person whose actions will be synthesized, on the second – a source subject whose motion will be imposed onto the target person. To learn the transition, in this case, researchers didn’t use two subjects performing the same motions. Why? Even if two subjects make perfectly the same moves, it still impossible to have a clear frame to frame body-pose correspondence – every subject has its unique differences like a body shape, for example. So, researchers wanted to find the best mapping system between videos of the two persons, or, basically, the way to make an image-to-image translation. The process of pose detection in this framework is known to everyone who familiar with Computer Vision and Data Science. Modern computer vision is based on a convolutional neural network (CNN). In our case, CNN is trained to detect keypoints that represent a position of different parts of a human body. It can be feet, knees, pelvis, shoulders, elbows, hands, nose, eyes, ears, etc. Then these keypoints are connected by the lines in a pose stick figure, and that figure will play the role of an intermediate subject’s representation. Exactly the same system Ralabs team used in our project for Surveillance cameras – recognizing of people depending on the way they walk. For Berkeley project, pose stick figures method makes possible to preserve motion signatures but put away all unnecessary subject details. Look at this example:
In this way, researchers got a definite pose from each frame of subject video. That helps to realize an image-to-image translation and train the model to produce specific videos for a specific subject. To make the target subject move in the same way as the source, they needed to put the pose figures into a trained model. Also, you can notice the realism of generated videos. Berkeley’s team achieved it by two small components. For the temporal smoothness, they conditioned the prediction at each frame on the previous time step. And to make faces more real they trained special adversarial network to generate target faces. Having the finalized method, they started to create a video. In Figure 3 you can observe the whole process that we will discuss below.
Stucture F2

Step 3. Pose detection and normalization

As we’ve already said, pose figures in this framework represent the source body position. To create this figures, first of all, we need a source video. The source video does not require a high quality because we need just to detect pose from it. So you can choose any acceptable video from the internet. As for a target video, there are some exact requirements. It must have a sufficient range of motion and frames with minimal blur. Researchers used targets in a minimal wrinkling cloth and filmed them for around 20 minutes at 120 frames per second. To get pose figures that encode body position they used a pre-trained state-of-the-art pose detector (P). It estimates x,y joint coordinates. In Figure 2, you can see keypoints connected by lines to create pose stick figure. During the training, these figures will be a base for the generation of new images. But they need to be normalized because of the difference between source and target within every frame. That normalized coordinates will be inputs for generator (G). What do they mean by normalization? Subjects may have different body shapes, limb proportions, locations concerning the frame, etc. Thus, researchers needed to transform the source pose keypoints in accordance with positions and proportions of the target person (see Figure 3, Transfer). To solve this problem, they used the closest and farthest ankle positions in both videos. A linear mapping between ankle keypoints helps to calculate the scale and translation for each frame based on its corresponding pose detection.

Step 4. Training

The training method based on the algorithm pix2pixHD developed in 2017 at the same UC Berkeley. How is this algorithm work? The generator network G is playing against multi-scale discriminators D = (D1, D2, D3) that distinguish real images from generator-made images. So the generator must be able to produce high-quality images that discriminator could not recognize as a fake. Both of these networks should be trained at the same time to improve each other. Look at the base pix2pixHD form:
Here, LGAN(G,D) is the adversarial loss, and LV GG (G(x),y) is the perceptual reconstruction loss.
For this framework, researchers modified the original pix2pixHD on the level of a single image generation setup. Now adjacent frames are enforced to temporal coherence (see Figure 4). The system predicts two consecutive frames. The first of them depend on its corresponding pose stick figure and a zero image, and the second – on its corresponding pose stick figure xt and the first output. So, now the discriminator have to find differences in the realism of images and temporal coherence between the fake and real frames. In this way GAN objective looks like:
To make faces in generated videos more real, researchers add another GAN in the system. After Generator (G) produces the full image, they put a small image section in the face zone and input pose stick figure for the same section (xF) to another generator (Gf). This gives an output residual r = Gf (xF ,G(x)F ). And the final output is a combination of this residual and original face region. Then discriminator discerns the face region of input pose stick figure, the face of the target person image, and their fake pairs. As for the network architecture, they use different models for different stages of the process:
  • Art pose detector OpenPose for the definition of pose keypoints (the body, face, and hands);
  • Wang in the pix2pixHD model for image translation;
  • Global generator of pix2pixHD for face residuals predictions;
  • 70x70 Patch-GAN discriminator for the face discriminator;
  • LSGAN objective (as well as pix2pixHD) for the full image and face GANs.
 More about studies and data calculations you can find here. 

Step 5. Pros and Cons

Overall, this model can produce quite long videos with acceptable quality saving a lot of details. However, the outputs have some drawbacks for several reasons. The quality level of results depends on input pose stick figures. Incorrect keypoints caused by noisiness lead to errors in inputs even with temporal smoothing setup and temporal coherence. The most common errors we can see in the transfer videos is a different movements speed, for example. In a transfer of dance moves from source video to the target, we still can see some shakiness and jittering. The researchers explain it by the unique body structure of both subjects. They think motion depends on an identity which remains in the transfer process. Also, the system does not take into account different limb lengths and camera positions angles. These differences also influence the final result. Also, 2D coordinates have their own drawbacks. They limit motion targeting between subjects while 3D can perfectly join locations. To avoid these cons, the system needs an upgrade in the temporally coherent video generation and motion representations. But even with these open questions, the approach allows creating compelling videos with a wide variety of inputs. Generally, to understand what can an artificial intelligence do today, you can watch the short movie “Zone Out”. It is quite weird, but from the beginning to the end it was made by Benjamin, why is basically AI. Benjamin spent 48 hours to make the movie using thousands of frames from old films and green-screen footage of professional actors. The idea and realization belong to Director Oscar Sharp and AI researcher Ross Goodwin.  Read more about “Zone Out” here.