AI Background Generator: Leveraging diffusion models for ecommerce and retail

Rajiv Kumar
Building Fynd
Published in
10 min readFeb 21, 2024

--

Introduction

The advent of text-to-image generative models such as Stable Diffusion, DALL-E, and Midjourney has democratized the creation of artistic, expressive, and photo-realistic images, bridging the gap between professional artists and novices. In this landscape, background scene generation is one such application for creating a realistic scene around an object or a product.

At Fynd Research, we are at the forefront of integrating open-source advancements with our extensive, high-quality datasets, focusing on applications in e-commerce and retail. This blog explores how we’ve adapted the Stable Diffusion XL (SDXL) model for background scene generation in product photography.

Image Inpainting

Image inpainting and outpainting are sophisticated generative image editing techniques designed to fill in missing or concealed areas of an image, thereby rendering it both realistic and convincing. This task typically involves integrating one or more foreground objects — such as an e-commerce item or a human model — into a background scene described by a text prompt.

While it shares similarities with standard image generation from text prompts, inpainting and outpainting present unique challenges. They not only demand photorealistic and high-resolution outcomes but also meticulous attention to detail. A high-fidelity model must adeptly handle intricate aspects of image creation, including blending, harmonization, shadow and reflection generation, as well as accurate object placement and occlusion management, to ensure a seamless and lifelike scene.

The following images show the inputs and outputs of the inpainting task which include an input image, foreground object mask, and the generated image. The foreground object mask can be obtained from a background removal tool like erase.bg. The text prompt to describe the scene is:

Text prompt: Picturesque vineyard on a sunlit afternoon, rows of grapevines under a clear sky. Gentle breeze and a sense of tranquility. High Quality

From left to right: Input Image, foreground object mask and generated image.

Let’s consider the application of generative models in fashion photography. While traditional photography involves extensive time, effort, and resources, including travel, setup, and post-production, generative models present an innovative alternative. Through outpainting, a fashion model is placed as the foreground subject against a generated background that mimics a real-life shoot location. The generative model intricately crafts the scene, considering lighting, shadows, and other environmental aspects based on user-provided text prompts. This method is also applicable in e-commerce, where products are virtually placed in attractive settings to enhance appeal. This approach provides a cost-effective and efficient solution, eliminating the need for expensive location shoots while delivering high-quality, creative imagery.

Approaches

During the inpainting or outpainting task, the model fills in new content based on the text prompt by assigning new pixels that are consistent with the surrounding details. Since the goal is to make the inpainted area indistinguishable from the rest of the image, the pixel values surrounding the foreground boundary are blended seamlessly with the surrounding image area, mostly extending the foreground object. Here are some examples of foreground object extension while using off-the-shelf SDXL inpainting models.

Text prompt-1: Cozy cabin interior with a crackling fireplace, plush sofas, and soft candlelight. Ideal for a relaxing evening indoors.

Text prompt-2: Mountain range in the distance at dusk, bathed in warm hues. Misty atmosphere with a serene glow. High Quality.

From left to right: Input Image, SDXL 1.0-Inpainting-0.1 model results on text prompt-1 and text-prompt-2.

As seen from the generated images, off-the-shelf models suffer from unrealistic generations along with outgrowth of the product, which is undesirable in an e-commerce setting. The outgrowth issue is due to the generic inpainting training procedure of stable diffusion models using randomly masked regions of the training data (see the figure below) that teaches the model to extend objects around their boundaries.

Pairs of input and ground truth images used in generic inpainting task training. Credits: NVIDIA

We can try another approach where we leave a pixel gap at the mask boundary to prevent outgrowth. However, this leads to other artifacts and inconsistent generations, as shown below:

Text prompt-1: Cozy cabin interior with a crackling fireplace, plush sofas, and soft candlelight. Ideal for a relaxing evening indoors.

Text prompt-2: Mountain range in the distance at dusk, bathed in warm hues. Misty atmosphere with a serene glow. High Quality.

From left to right: Input Image, SD 2.0-Inpainting results on text prompt-1 and text-prompt-2 based on pixel gap.

One way to tackle the issue of outgrowth is to use ControlNet. ControlNet is a method that introduces precise control over the attributes by training with extra conditioning images such as edge boundaries, normal maps, segmentation maps, depth maps, etc. Conditioning on the canny edges of the input images results in outputs that look like the below.

From left to right: Canny edge image, image generated using SD with ControlNet.

Applying inpainting with ControlNet for our task results in:

Text prompt-1: Cozy cabin interior with a crackling fireplace, plush sofas, and soft candlelight. Ideal for a relaxing evening indoors.

Text prompt-2: Mountain range in the distance at dusk, bathed in warm hues. Misty atmosphere with a serene glow. High Quality.

From left to right: Input image, Image outpainting results using Stable Diffusion with ControlNet

As it can be seen, applying ControlNet limits the outgrowth but severely reduces the image generation quality and diversity, which is a known issue with ControlNet.

Thus, challenges like the ones seen above and issues such as poor prompt adherence, poor photo-realism, and the lack of details and diversity in the generated images led us to curate a new dataset and train our own model.

Our Solution

Layout diagram showing the interaction between text prompt, foreground mask, input image and the output image generation

Our approach to solving the above issues consists of a blend of data-centric and learning-based approaches that involve curating a dataset and using a modified loss function for fine-tuning the model. For building the dataset, we curated around 60,000 samples with triplets of images, foreground masks, and text captions. The images that were considered were a mix of both fully masked and foreground object-masked images. Full masks were considered for those images of sceneries or those without a salient object.

The text captions for all input images were generated by an image captioning model. Before captioning the foreground object masked images, the foreground objects were inpainted within a bounding box region to generate a background scene without any salient objects. For fully masked images, image captioning was done as usual.

Our candidates for fine-tuning were the SDXL base model, the SDXL inpainting model, and a few community-trained models such as RealVis-XL, Nightvision-XL, CrystalClear-XL, etc. One of our experiments involved the fine-tuning of the SDXL inpainting model using a modified boundary loss that penalizes the extensions around the foreground objects and reduces the outgrowths. Another experiment focused on fine-tuning the text-to-image model (RealVisXL, in our case) with 4 input channels and adapting it for inpainting with 9 channels (4 for original image, 4 for masked image, 1 for the mask). Then we trained the Unet model, with the same architecture as that of the SDXL-base inpainting model.

Results

After qualitative comparisons of the results from the experiments, we finalized the inpainting model fine-tuned from RealVis-XL, based on aesthetics and prompt adherence capabilities. As a result of fine-tuning, we alleviated issues like object outgrowths and extensions, and improved the overall scene generation quality in terms of image composition.

Text prompt-1: Cozy cabin interior with a crackling fireplace, plush sofas, and soft candlelight. Ideal for a relaxing evening indoors.

Text prompt-2: Mountain range in the distance at dusk, bathed in warm hues. Misty atmosphere with a serene glow. High Quality.

From left to right: Input Image, SDXL outpainting results on text prompt-1 and text-prompt-2

Conclusion

In conclusion, our solution to background scene generation for product photography yields significant improvements over earlier methods. By curating a unique dataset and fine-tuning the model with a specialized loss function, we successfully improved over the common challenges of object outgrowths and poor prompt adherence. We observed the generated images to be significantly better than other methods in terms of quality, leaving a small amount of room for improvement. You can test our AI background generator for yourself by following the steps in this gif.

Using AI background generator on pixelbin.io playground

Discover more about our AI background generation capabilities in our comprehensive documentation. To experience this cutting-edge feature, simply login to PixelBin, select or create your organization, navigate to the Playground, and in the transformations search bar, look for “AI background generator”. Apply it to your uploaded image along with a desired text prompt, and wait for the model to generate a magical image for you.

The PixelBin team at Fynd is constantly working on impactful ML problems and actively hiring interns and full-time ML researchers. Send in your applications here.

Special thanks to Tauhid khan and Rahul Deora for their guidance.

To learn more or to send us your feedback, please write to research@pixelbin.io.

Explore our ongoing research projects at fynd.com/research.

--

--