Improving Erase.bg with Synthetic Data

Enhancing deep learning-based image background removal with automated synthetic data generation in Blender

Hemant Singh

Published in

Building Fynd

8 min readMar 8, 2024

Introduction

In this blog, we’ll discuss our challenges and approaches to improving our AI-based background removal tool, Erase.bg, currently used by over 80k users per day 🚀 for image editing, ecommerce product photography, passport photo creation, and more. To know more about the basics of this tool, read this blog by my colleague.

For such a tool to be able to perform reliable and efficient background removal for complex and diverse scenarios, it needs to be trained on huge amounts of high-quality image segmentation datasets. Preparing such huge datasets with high-quality segmentation masks is a very time-consuming task as it involves manually annotating the images at a pixel level. It can take over 20 minutes of human annotation for a single image, and even more for complex foregrounds and backgrounds.

To deal with this issue of high-quality data scarcity, we introduced a novel large-scale synthetic data generation pipeline, with the help of the open-source 3D creation suite, Blender, and its integrated Python scripting capabilities. Using this synthetic data creation pipeline, we can generate huge amounts of high-quality image segmentation data that can be readily utilized in the model training process. Let’s dive into how we used Blender to create more than 60,000 high-quality images along with their segmentation masks, and how this data has helped improve the performance of Erase.bg.

Why Blender x Python?

So why did we choose Blender with Python scripting for this task?

Blender is already an important and established player in the world of 3D modeling and rendering. Not only can it produce highly realistic renders, but because it is open-source, it also comes with a high level of flexibility.

One of the most useful features of Blender is its ability to work headlessly through command-line rendering. This means we can unleash its rendering power on servers to render massive amounts of data without needing a graphical interface. Rendering a high-resolution frame along with a segmentation map can take 30 seconds to 1 minute for just one frame, so we must be able to run our program on servers to quicken our experimentation cycles and generation process.

Blender also has a seamless integration with Python, which opens up a world of possibilities for automation. With Python scripting, we can automate most aspects of the 3D modeling and rendering process, from setting up scenes to adjusting parameters and everything in between. This saves us countless hours of manual tinkering and transforms Blender into a true powerhouse for diverse synthetic data generation.

Overall Bulk Data Generation Stages

Our ML models typically have a varied set of requirements in terms of the data that can be provided using Blender. To ease the process of creating these custom pipelines, we follow a generalized workflow shown below.

*Stages of Blender-based bulk data generation for training EraseBG*

Creating the Virtual Environment

The most important part is setting up the environment because we want everything to look just like real life. This step is important as we’re aiming to recreate real-life situations. We take our time placing the 3D objects, adding textures to make them look alive, and setting up lighting to make it as realistic as possible. Finally, we do the final render once everything is where it should be. Consider the images below showing the Blender workspace as well as the individual steps involved in creating the virtual environment.

Blender workspace, where we gather and organize all the 3D assets

3D Modelling: detailed 3D models of the subject (humans) are utilized, whereas simpler 3D models are used for background objects, ensuring the generation of precise data

Lighting and Shadows: the template file includes a universal lighting setup that mimics real-world illuminated environments, enhancing realism by incorporating authentic shadows

Texturing: detailed textures are applied to the subjects (humans), while background objects utilize textures with lower to medium detail

Rendering: the final rendering is executed once all 3D objects are correctly positioned, achieving a realistic scene appearance based on their placement

Once we understand how a virtual environment is built, we can now start thinking about dealing with multiple 3D objects in a single environment. To achieve this, building a template file is often quite useful, as it makes it easy to assemble and render lots of 3D objects at once. We have a big library of human models and poses, using which one can use the same background while changing the human models to create different scenes.

We start by setting up a virtual scene in Blender that looks like real life and then placing people in different positions on the scene. Then, we add random objects to the background, like plants, walls, fences, and furniture. This careful setup helps the model learn the difference between the people and the diverse backgrounds in front of which they are present. The figure below shows an example of such a template.

3D Objects and Camera Placement: in the template file, objects are thoughtfully positioned, ensuring humans have their designated area and the background is in its designated place. The camera is set up to capture all elements effectively

We also build a huge dataset of various poses that is meticulously crafted for all human models to introduce diversity and precision into the final render. A crucial aspect of this process is ensuring that the foot placement of each pose remains grounded, maintaining realism and stability throughout the generated poses. This meticulous attention to detail not only enhances the variety of the rendered scenes but also ensures accuracy in depicting human movement and interaction with the environment. The demo video below shows how different poses make changes to the underlying synthetic 3D human model.

Pose data: a vast library is curated with a range of poses in every direction, enriching the 3D model with lifelike characteristics

A camera setup is implemented where the camera smoothly pans from one end to the other, ensuring that the focus remains on the human subjects throughout the movement. This dynamic process significantly contributes to enriching the dataset by introducing additional variation. By capturing the human subjects from different angles and perspectives, we add depth and realism to the rendered scenes. The demo video below shows the same environment from different camera viewpoints.

Camera movement: the template file includes a predefined camera angle that captures the subjects from various perspectives, contributing to a diverse dataset

The outputs from different time steps of the rendering process are depicted in the below images, where first only the 3D objects are shown without any texture or HDRIs. Then, step-by-step, we add the texture details as well as the HDRI, which gradually adds more realism to the samples.

Scene with complex objects: incorporating complex elements like nets, plants, and fences enriches the variety in the background and thus forces the model to learn more complex scenarios

The final scene is rendered without HDRI solely to demonstrate the complexity of the template file

Final render with HDRI: the scene is finalized by adding an HDRI, which envelops objects located farther from the main scene, bridging the gap between the complex elements and the background

Following this process, we can generate diverse scenes with a variety of positioning and arrangement of the 3D human models. Here are a few different synthetic samples generated using this approach.

Variety of 3D humans with variations in animation, clothing, background assets, and lighting

Rendering Process

Once we generate a diverse set of images that can be used for training the Erase.bg model, we need to incorporate the extraction of high-quality alpha masks for obtaining the ground truth segmentation. This is achieved through rendering with separate layers, where each layer separates the background and the foreground regions so that appropriate alpha masks can be obtained.

Through Python scripting, we automate the process of loading an environment, loading a human model, applying a pose to the human model, and separately layered rendering in around 30 seconds using a single GPU (RTX 3090)-based system. Here’s a layered rendering output for each layer.

Rendering with separate layers: this method ensures the clear separation of background and foreground data, aiding the machine in identifying the subject and effectively removing the background

Segmentation Data for Deep Learning

Final render with layers: the final rendering includes separate layers for the subject and background, ensuring distinct segmentation

After the layered rendering is complete, we get the composite images and alpha masks to form paired training samples, which serve as the backbone for training the new Erase.bg model.

Blender Python Scripting Based Automation Stages

Using the above workflow, we implement render pipelines in stages. The stages described provide a solid framework to build custom pipelines for data generation.

Training the Erase.bg Model

This segmentation data is then used to train Erase.bg. The model learns to understand and differentiate between subjects and backgrounds, improving its ability to perform seamless background removal.

A synthetic dataset based generic segmentation model training.

The data generated via the Blender-based approach can in fact be used to train any generic segmentation model.

Real-Life Scenario Testing

Once the training process is complete, the Erase.bg model undergoes real-life testing. This final phase ensures that the model can adeptly handle unforeseen challenges and perform seamlessly in practical applications. Below are some examples of our output.

Testing results on a group of real human images, using the trained model

From the comparison above, it is quite clear that the inclusion of synthetic data-based training can improve background removal for humans. The new model is better at understanding human form and picks up less background in noisy scenes.

Conclusion

With the help of the proposed synthetic data generation pipeline, we have created over 60,000 samples for training and improving our Erase.bg model. Without such automatic bulk data generation, it would have taken 20,000 work-hours to create such data. The Blender-based approach achieves this in 500 GPU-hours (on a single-GPU RTX 3090 machine), which is a 40x improvement in terms of data generation time. Moreover, the high-quality training data generated through the Blender-based pipeline ensures that the newly trained Erase.bg model is capable of much better background removal in complex and diverse sets of images.

For a better image background removal experience, try Erase.bg or PixelBin.io.

Trying out background removal on Erase.bg

The PixelBin team at Fynd is constantly working on impactful ML problems and actively hiring interns and full-time ML Researchers. Send in your applications here. To learn more or to send us your feedback, please write to research@pixelbin.io. Explore our ongoing research projects at fynd.com/research.

Special thanks to our team: Arnab Mishra, Hemant Singh, Calvin D’Souza, Mihir Botle, and Rahul Deora.

Some YouTube Useful Links