Improving Erase.bg with Synthetic Data
Enhancing deep learning-based image background removal with automated synthetic data generation in Blender
Introduction
In this blog, we’ll discuss our challenges and approaches to improving our AI-based background removal tool, Erase.bg, currently used by over 80k users per day 🚀 for image editing, ecommerce product photography, passport photo creation, and more. To know more about the basics of this tool, read this blog by my colleague.
For such a tool to be able to perform reliable and efficient background removal for complex and diverse scenarios, it needs to be trained on huge amounts of high-quality image segmentation datasets. Preparing such huge datasets with high-quality segmentation masks is a very time-consuming task as it involves manually annotating the images at a pixel level. It can take over 20 minutes of human annotation for a single image, and even more for complex foregrounds and backgrounds.
To deal with this issue of high-quality data scarcity, we introduced a novel large-scale synthetic data generation pipeline, with the help of the open-source 3D creation suite, Blender, and its integrated Python scripting capabilities. Using this synthetic data creation pipeline, we can generate huge amounts of high-quality image segmentation data that can be readily utilized in the model training process. Let’s dive into how we used Blender to create more than 60,000 high-quality images along with their segmentation masks, and how this data has helped improve the performance of Erase.bg.
Why Blender x Python?
So why did we choose Blender with Python scripting for this task?
Blender is already an important and established player in the world of 3D modeling and rendering. Not only can it produce highly realistic renders, but because it is open-source, it also comes with a high level of flexibility.
One of the most useful features of Blender is its ability to work headlessly through command-line rendering. This means we can unleash its rendering power on servers to render massive amounts of data without needing a graphical interface. Rendering a high-resolution frame along with a segmentation map can take 30 seconds to 1 minute for just one frame, so we must be able to run our program on servers to quicken our experimentation cycles and generation process.
Blender also has a seamless integration with Python, which opens up a world of possibilities for automation. With Python scripting, we can automate most aspects of the 3D modeling and rendering process, from setting up scenes to adjusting parameters and everything in between. This saves us countless hours of manual tinkering and transforms Blender into a true powerhouse for diverse synthetic data generation.
Overall Bulk Data Generation Stages
Our ML models typically have a varied set of requirements in terms of the data that can be provided using Blender. To ease the process of creating these custom pipelines, we follow a generalized workflow shown below.
Creating the Virtual Environment
The most important part is setting up the environment because we want everything to look just like real life. This step is important as we’re aiming to recreate real-life situations. We take our time placing the 3D objects, adding textures to make them look alive, and setting up lighting to make it as realistic as possible. Finally, we do the final render once everything is where it should be. Consider the images below showing the Blender workspace as well as the individual steps involved in creating the virtual environment.
Once we understand how a virtual environment is built, we can now start thinking about dealing with multiple 3D objects in a single environment. To achieve this, building a template file is often quite useful, as it makes it easy to assemble and render lots of 3D objects at once. We have a big library of human models and poses, using which one can use the same background while changing the human models to create different scenes.
We start by setting up a virtual scene in Blender that looks like real life and then placing people in different positions on the scene. Then, we add random objects to the background, like plants, walls, fences, and furniture. This careful setup helps the model learn the difference between the people and the diverse backgrounds in front of which they are present. The figure below shows an example of such a template.
We also build a huge dataset of various poses that is meticulously crafted for all human models to introduce diversity and precision into the final render. A crucial aspect of this process is ensuring that the foot placement of each pose remains grounded, maintaining realism and stability throughout the generated poses. This meticulous attention to detail not only enhances the variety of the rendered scenes but also ensures accuracy in depicting human movement and interaction with the environment. The demo video below shows how different poses make changes to the underlying synthetic 3D human model.
A camera setup is implemented where the camera smoothly pans from one end to the other, ensuring that the focus remains on the human subjects throughout the movement. This dynamic process significantly contributes to enriching the dataset by introducing additional variation. By capturing the human subjects from different angles and perspectives, we add depth and realism to the rendered scenes. The demo video below shows the same environment from different camera viewpoints.
The outputs from different time steps of the rendering process are depicted in the below images, where first only the 3D objects are shown without any texture or HDRIs. Then, step-by-step, we add the texture details as well as the HDRI, which gradually adds more realism to the samples.
Following this process, we can generate diverse scenes with a variety of positioning and arrangement of the 3D human models. Here are a few different synthetic samples generated using this approach.
Rendering Process
Once we generate a diverse set of images that can be used for training the Erase.bg model, we need to incorporate the extraction of high-quality alpha masks for obtaining the ground truth segmentation. This is achieved through rendering with separate layers, where each layer separates the background and the foreground regions so that appropriate alpha masks can be obtained.
Through Python scripting, we automate the process of loading an environment, loading a human model, applying a pose to the human model, and separately layered rendering in around 30 seconds using a single GPU (RTX 3090)-based system. Here’s a layered rendering output for each layer.
Segmentation Data for Deep Learning
After the layered rendering is complete, we get the composite images and alpha masks to form paired training samples, which serve as the backbone for training the new Erase.bg model.
Blender Python Scripting Based Automation Stages
Using the above workflow, we implement render pipelines in stages. The stages described provide a solid framework to build custom pipelines for data generation.
Training the Erase.bg Model
This segmentation data is then used to train Erase.bg. The model learns to understand and differentiate between subjects and backgrounds, improving its ability to perform seamless background removal.
The data generated via the Blender-based approach can in fact be used to train any generic segmentation model.
Real-Life Scenario Testing
Once the training process is complete, the Erase.bg model undergoes real-life testing. This final phase ensures that the model can adeptly handle unforeseen challenges and perform seamlessly in practical applications. Below are some examples of our output.
From the comparison above, it is quite clear that the inclusion of synthetic data-based training can improve background removal for humans. The new model is better at understanding human form and picks up less background in noisy scenes.
Conclusion
With the help of the proposed synthetic data generation pipeline, we have created over 60,000 samples for training and improving our Erase.bg model. Without such automatic bulk data generation, it would have taken 20,000 work-hours to create such data. The Blender-based approach achieves this in 500 GPU-hours (on a single-GPU RTX 3090 machine), which is a 40x improvement in terms of data generation time. Moreover, the high-quality training data generated through the Blender-based pipeline ensures that the newly trained Erase.bg model is capable of much better background removal in complex and diverse sets of images.
For a better image background removal experience, try Erase.bg or PixelBin.io.
The PixelBin team at Fynd is constantly working on impactful ML problems and actively hiring interns and full-time ML Researchers. Send in your applications here. To learn more or to send us your feedback, please write to research@pixelbin.io. Explore our ongoing research projects at fynd.com/research.
Special thanks to our team: Arnab Mishra, Hemant Singh, Calvin D’Souza, Mihir Botle, and Rahul Deora.
Some YouTube Useful Links
- On Blender composition and render layers: Introduction to Compositing in Blender
- On Blender render passes: Render Passes in Blender 2.81 — what are they and why even use them?
- On Shadow Catcher: The Best Way To Work With Blenders Shadow Catcher Pass
- On Blender layer workflow: Blender and Fusion Multi-Layer Workflow — Render Passes and Compositing for Beginners
- On Blender composition workflow: Exr-IO Blender Photoshop Compositing Tutorial
- On Blender collections and ray visibility: Blender 2.9 for Production — 03 RenderLayers, Collections & Ray Visibility