Researchers from Yuncong Technology and Shanghai Jiaotong University recently proposed a new framework, DenseBody, which can directly obtain 3D human poses and shapes from a color photo. This study designs an efficient 3D human pose and shape representation without intermediate representations and tasks, enabling end-to-end generation from a single image to a 3D human mesh.
How to estimate the pose and shape of a human body from a single image has been a research problem for several applications for many years. Researchers have proposed different approaches to try to solve this problem in part or in combination. This article will introduce an end-to-end approach to reconstruct full 3D human geometry directly from a single color image using a CNN.
Early research in this field used iterative optimization methods to estimate human pose and shape information from 2D images, generally by continuously optimizing the estimated 3D human model to fit some 2D observations, such as 2D keypoints  or outline .
With the rise of deep learning, many studies have attempted to solve this problem in an end-to-end manner with CNNs, and some of them have achieved better performance and faster running speed. However, directly predicting the full body mesh with CNN is not straightforward, since training such a model requires a large amount of 3D annotated data.
Most recent studies have combined some parametric human models, such as SMPL and in turn predict the parameters of these models .[22,27] Improve performance with the help of joints or split outputs. This model-based 3D representation confines the 3D human shape to a low-dimensional linear space, making it easier to learn by CNN models, but its performance may not be optimal due to the limitations of linear models.
 The proposed use of a volumetric representation for estimating human body shape shows certain advantages, during which the predicted 3D joint positions are output as intermediate results.
While there are multiple options for 3D representations, most recent CNN-based methods rely on some intermediate 2D representation and loss function to guide the training process.
In these methods, the problem of mapping a single RGB image to a 3D human mesh is decomposed into two steps: first obtain some type of 2D representation, such as joint heatmaps, masks or 2D segmentation; then predict the 3D representation based on these intermediate results [16,5]. The intermediate representations chosen by these studies, and the quality of the outputs of the neural networks that solve these subtasks, greatly influence their final performance.
This study by Yuncong Technology presents an efficient method to directly obtain a complete 3D human body mesh from a single RGB image.
The main differences between this method and other studies are in the following two aspects: first, the network proposed in this study does not incorporate any parameterized human model, so the output of the network will not be restricted by any low-dimensional space; second, the method’s The prediction process is one-step, without relying on intermediate tasks and results to predict a 3D human body. The study evaluates this method on multiple 3D human datasets and compares it to methods in previous studies.
The evaluation results show that the method outperforms the other results and runs faster.
The main contributions of this study are as follows:
An end-to-end method is proposed to directly derive 3D human meshes from a single color image. To this end, the researchers developed a new 3D human body mesh representation. It is able to encode the complete human body in 2D images into pose and shape information without relying on any parametric human model.
It reduces the complexity of 3D human body estimation from two steps to one. This study trains an encoder-decoder network to directly map input RGB images to 3D representations without solving any intermediate tasks such as segmentation or 2D pose estimation.
Multiple experiments are conducted to evaluate the performance of the above methods and compare with the state-of-the-art methods. The results show that the method achieves significant performance gains and runs faster on multiple 3D datasets.
Figure 1: Example results.
Paper: DenseBody: Directly Regressing Dense 3D Human Pose and Shape From a Single Color Image
Paper address: https://arxiv.org/pdf/1903.10153.pdf
Abstract: Deriving 3D human pose and shape from 2D images is a difficult problem due to the high complexity and flexibility of the human body, and the relatively small amount of 3D annotated data. Previous approaches largely relied on predicting intermediate results, such as body segmentation, 2D/3D joints, and contour masks, to decompose the current problem into multiple subtasks, exploiting more 2D labels or incorporating parameterization in a low-dimensional linear space Human body model to simplify the problem.
In this paper, we propose to use Convolutional Neural Networks (CNNs) to derive 3D human body meshes directly from a single color image. We design an efficient 3D human pose and shape representation that can be learned by neural networks with encoder-decoder architecture. Experiments show that our model achieves state-of-the-art performance on multiple 3D human datasets while running faster. The datasets include Human3.6m, SURREAL and UP-3D.
3. The method proposed in this paper
3.1 3D Human Representation
Previous studies typically used deformable models and voxels such as SCAPE and SMPL to represent 3D human geometry. The method proposed in this paper uses the UV position map to represent the 3D human body geometry, which has the following advantages: First, it can save the spatial adjacency information between points, which is more important for accurate reconstruction than the one-dimensional vector representation. is more critical; second, it has lower dimensionality than a 3D representation like a voxel, because a large number of points not on the surface in a voxel representation is not very useful; finally, this is a 2D representation, so we Can directly use off-the-shelf CNN networks, such as Res-net and VGG, using the latest advances in the field of computer vision.
In the field of human reconstruction, UV maps are often used to render texture maps as a way of expressing the surface of objects. In this paper, we try to use UV maps to repay the geometric features of human surfaces. The 3D annotations provided by most 3D human datasets are based on the SMPL model. The SMPL model itself provides a built-in UV map that divides the human body into 10 regions.
DensePose provides another way of human body segmentation, and provides a UV map that divides the human body into 24 areas. We experimented with two segmentation methods, and the UV map of SMPL obtained better experimental results. Therefore, in our method, we adopt this UV map to store the 3D position information of the whole body surface.
Figure 2 shows the errors introduced by vertex warping and resampling of UV position maps at different resolutions. Considering that the whole body accuracy error (surface error) and joint error (joint error) of the current state-of-the-art method are in the order of tens of millimeters, we choose a resolution of 256, which introduces a negligible 1mm whole body accuracy error. In addition, the 256-resolution UV map can represent more than 60,000 vertices, far more than the number of vertices in SMPL.
Figure 2: Full-body accuracy error and joint accuracy error due to warping and resampling at different UV position map resolutions, in millimeters.
3.2 Network and Loss Function
Our network adopts an encoder-decoder structure, the input is a 256*256 color image, and the output is a 256*256 UV position map. The encoder part uses ResNet-18, and the decoder is composed of four layers of upsampling and volume Layered composition.
Unlike previous approaches that require careful design and fusion of multiple different loss functions, we directly supervise and design loss functions on the predicted UV position maps (see Table 2). To balance the influence of different body regions on training, we employ a weight mask map to tune the loss function. In addition, the weights of points near the joint points are also weighted.
Table 1: Loss functions employed in different methods.
Figure 3: Framework of different approaches compared to DenseBody.
3.3 Implementation Details
All images are first aligned so that the person is centered. It is then adjusted to 256×256 by cropping and scaling so that the distance between the tight bounding box and the edge of the image is moderate. Images undergo random translation, rotation, flipping, and color dithering. We should note that most of the data augmentation operations are not simple, because the corresponding ground-truth data must also be deformed accordingly.
When the randomly deformed human body exceeds the 256×256 canvas, the enhancement operation is invalid. We use orthographic projection to obtain the xy coordinates of the position map to avoid error propagation of depth information. The depth information of the ground truth data is appropriately scaled to control the range of the sigmoid output.
We use the Adam optimizer with a learning rate of 1e-4, a mini-batch size of 64, and train until convergence (about 20 epochs). About 20 hours of training on a single GTX 1080Ti GPU. The code implementation is based on Pytorch.
Table 2: Experimental results on SURREAL, full body accuracy error and joint accuracy error in millimeters.
Table 4: Experimental results on UP-3D. Whole body accuracy error and joint accuracy error are in millimeters.
Table 5: Forward runtime on a single GTX1080TI in milliseconds. 1 means running on TITAN X GPU.