Scaling E-Commerce: The Power of AI and Automation in Product Tagging

Published in

Building Fynd

6 min readFeb 1, 2024

Introduction

Product Tagging is an essential aspect of retail and e-commerce, playing a pivotal role in enhancing customer experience and driving business success. The volume of online products is vast and diverse; hence, the ability to accurately tag and categorize products is crucial. Product Tagging involves assigning relevant labels or keywords to products, which facilitates easier search and discovery for customers.

This approach streamlines the buyer’s experience, allowing them to quickly find suitable products, and also enhances recommendation accuracy, which collectively boosts e-commerce conversion rates. Additionally, every giant e-commerce platform like Amazon, Myntra, Flipkart, etc. requires product metadata before it can go live on sale. This process of creating product metadata is very labor-intensive and requires skilled personnel, often leading to a delay of 2–3 months. This delay means that physical goods remain unsold during this period, as their metadata is still being prepared.

PixelBin’s AI Product Tagging addresses this challenge by streamlining the process of creating product catalogs. It allows you to create detailed fashion metadata about your product in a matter of seconds. Currently, our AI Product Tagging supports 50+ attribute types like gender, sub-category, article type, color, pattern, sleeve length, neck type, collar type, etc. leading to very fine-grained metadata covering 700+ labels!

In this blog, we cover some of the technical challenges we faced and describe how we designed our neural network architecture to tackle this problem.

Challenges

Multi-Task Learning: The primary challenge in developing AI Product Tagging was building a single model that could perform multiple tasks instead of having a dedicated model for each task. Building a separate model for each attribute would have required us to build 50+ models, which leads to high deployment and maintenance costs.

Multi-Model Learning and Multi-Task Learning

This led us to employ multi-task learning, where we have a single shared backbone with separate task heads attached to it. Each task serves as an attribute classifier for gender, color, pattern, etc. Tasks like predicting sleeve length and sleeve styling may learn similar representations and benefit from multi-task learning.

Task Imbalance: The multi-task learning paradigm introduced its own set of challenges due to data imbalance across tasks. We observed a huge ratio of imbalance between tasks. Tasks like gender and color have labels for 100% of the products, whereas bottom length has labels for only 0.5% of the products, leading to an imbalance ratio as high as 200. This presented us with further challenges in training a balanced model.

Label Imbalance: Imbalances within task labels and a substantial presence of noisy labels in the dataset further added complexity to the development process. For instance, in attribute color, the label black occurs for 15% of the products, whereas only 2% of the products are orange. Similar colors, like teal & blue, teal & green, violet, purple & lavender, were tagged interchangeably to products by the annotators, which caused an increase in noisy labels.

Multi-Image Model: Most of the models, whether uni-modal or multi-modal, that are built for product tagging are only fed a single image as input. This single image is assumed to contain all the information required to accurately predict the labels, but this assumption does not always hold.

A product on an e-commerce platform displays many different images with each image showing a different view of the product, where each view may be associated with a different set of attributes. Certain attributes like slit detail, sleeve styling, top length, bottom length, neck type, collar type, etc. can be only accurately predicted by exploiting specific views. For instance,

Sleeve length can be accurately predicted from the front/back view.
Neck type can be accurately predicted from a close view helping the model to distinguish between similar labels like boat neck and round neck.
Slit detail requires a side view to be predicted accurately.
Pattern is best predicted from a close view.

A single input model cannot fully capture the attribute information using different product image views. We devised a model that simultaneously captures fine-grained attribute information from all the available views.

Solution

The backbone of AI Product Tagging is our in-house JIT (Joint Image Transformer), a powerful model capable of processing multiple image views of the same product simultaneously. Here’s a high-level overview:

Model Architecture

Visual Encoder Block: Utilizing a shared ResNet-101, Channel Attention, and Spatial Attention backbone, this block processes multiple image inputs independently, producing K-feature maps for K-image views. The K-feature maps are flattened and undergo a linear transformation. Now, each image is represented by a feature vector.
Transformer Encoder: These feature vectors, now called visual tokens, along with a CLS token, are fed into the transformer encoder. Self-attention is used to calculate the similarity between the CLS token and visual tokens. The CLS token gathers the relevant features for the attribute task heads through various transformer encoder blocks.
Classification Heads: The enriched CLS token (Joint Image Task Embedding) is then passed through each of the attribute task heads.

The JIT model works on a set of image views. A shared vision backbone is used to extract visual feature maps which are fed to the transformer encoder. The transformer encoder computes the joint image task embedding which contains features from all the different views. This joint image task embedding is then used for classifying the product into respective labels in attribute task heads.

We use inverse square root label frequency to calculate weights to tackle label imbalance and task imbalance. We use Adam optimizer with a cosine annealing scheduler to train the model.

Results

Designed as a multi-image model, the model addresses the challenge of selecting the best image for prediction by implicitly treating it as an optimal subset problem. This is achieved by employing a differentiable attention mechanism to focus on the relevant parts of the image views for predicting attributes.

JIT inference time scales linearly with the number of views. The optimum number of views peaks around 5 wrt to compute-accuracy tradeoff

The figures above show the performance comparison between varying numbers of images. Including more than 1 product image view improves performance, as reflected by the improvement of at least 5% in macro f1-scores across all the attributes. The increase in inference time is only linear with an increase in views of a product.

Conclusion

In conclusion, our approach to AI Product Tagging achieves significantly better results for our use case. By overcoming challenges and the limitations of single-image models, we have introduced the Joint Image Transformer (JIT) model which processes multiple images simultaneously, addressing the diverse nature of product views in e-commerce.

Discover more about our AI Product Tagging capabilities in our comprehensive documentation. To experience this cutting-edge feature, simply login to PixelBin, select or create your organization, navigate to the Playground, and in the transformations search bar, look for “AI Product Tagging.” Apply it to your uploaded image, and then explore the generated tags by clicking on the Context tab below.

Steps to try AI Product Tagging Transformation on PixelBin

The PixelBin team at Fynd is constantly working on impactful ML problems and actively hiring interns and full-time ML Researchers. Send in your applications here.

Special thanks to Rahul Deora and Rahul Bishain for their guidance.

To learn more or to send us your feedback, please write to research@pixelbin.io.

Explore our ongoing research projects at fynd.com/research.