Machine Learning

Exploring the latest innovations in Computer Vision

Insights from Fynd’s visit to The Indian Conference on Computer Vision, Graphics & Image Processing 2022

Shashank Vasisht

Published in

Building Fynd

15 min readMar 28, 2023

The Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) is the premier conference in computer vision, graphics, image processing, and related fields.

The ICVGIP 2022 took place at IIT Gandhinagar. The Computer Vision Research team at Fynd got a chance to attend! The 3-day event included exciting events like tutorials, paper presentations, industry sessions, plenary talks, and Vision India. Each day also featured poster presentations and demo sessions by independent researchers and industry members, offering opportunities for engaging discussions about their work.

Learnings from Tutorial Sessions

Two tutorial sessions were conducted in parallel on Physics-based rendering in the service of computational imaging and Designing and Optimizing Computational Imaging Systems with End-to-End Learning.

The first was more inclined towards computer graphics and rendering while the other was about incorporating end-to-end deep learning into Imaging systems. Since the latter was closer to our field of interest, we chose to attend that.

Designing and Optimising Computational Imaging Systems with End-to-End Learning

The speakers for this session were Dr Vivek Boominathan, Dr Evan Y. Peng and Dr Chris Metzler. Computational imaging systems combine optics and algorithms to perform imaging and computer vision tasks more effectively than conventional imaging systems. However, end-to-end learning has emerged as a new system design paradigm where both optics and algorithms are designed automatically using training data and machine learning.

Advantages of end-to-end learning algorithms

This tutorial presents an end-to-end learning method that integrates optical models. Traditional optical lenses function by focusing light to a single point, mimicking human vision, and are commonly used in camera systems to capture visual information. However, this approach may not be optimal for all imaging tasks, such as monocular depth estimation and super-resolution. By using modified lenses or incorporating additional information with standard camera lenses, these computer vision tasks can benefit greatly.

Typically, in camera systems, the optical design is established first, and then the image processing algorithm’s parameters are adjusted to achieve high-quality image reproduction. In contrast to this sequential design approach, the authors consider joint optimising of an optical system (such as the physical shape of the lens) simultaneously with the reconstruction algorithm’s parameters. They developed a fully-differentiable simulation model that optimizes both sets of parameters to minimize the deviation between the true and reconstructed image.

They published their ideas in a paper called End-to-end Optimization of Optics and Image Processing for Achromatic Extended Depth of Field and Super-resolution Imaging.

The broad idea is to use a simulation module, which can simulate point spread functions (PSF) of differently shaped lenses and learn the best-tuned parameters for the optics model (hence the best-shaped lens) for this particular task which was on super-resolution.

Applications of end-to-end learning

The same idea can be applied to other applications like monocular depth estimation which has been demonstrated in the paper Depth from Defocus with Learned Optics for Imaging and Occlusion-aware Depth Estimation. The authors jointly optimize a deep CNN (a U-Net-like model) along with a fully differentiable optics model to produce high-quality depth maps using a single camera.

Carrying this idea forward, the authors went on to showcase an innovative Lens-less camera where they proposed to eliminate the need for lenses in cameras and use a very thin phase mask instead. Phase masks are essentially transparent materials with different heights at different locations.

This causes phase modulation of the incoming wavefront and resultant wave interference produces the PSF at the sensor plane. Their proposed phase-mask framework takes the input of the target PSF and the desired device geometry (which as stated above can be learnt using a fully differentiable simulated optics model) and outputs an optimized phase-mask design.

Using FlatNet to enhance the output quality of phase masks

Most of the phase mask output cannot be interpreted by humans. In another proposed paper called FlatNet: Towards Photorealistic Scene Reconstruction from Lensless Measurements, the authors first train a model to learn to invert the forward operation of the lensless camera model. This allows them to obtain an intermediate representation with local structures intact.

Once they obtain the output of the trainable inversion stage, which is of the same dimension as that of the natural image they want to recover, they use a fully convolutional network to map it to the perceptually enhanced image. They choose a U-Net to map the intermediate reconstruction to the final perceptually enhanced image.

*Reconstruction Of Scenes From Lensless Cameras*

Key Takeaways from Plenary Talks

The plenary talks were led by top-notch researchers and held during the final two days of the conference. The talks covered a diverse range of topics, including multi-sensory perception, efficient networks for graphics and rendering, transformers etc. offering attendees invaluable insights into the latest trends and advancements in the field.

Instant NGP: Neural Networks in High-Performance Graphics

This session was hosted by Dr Thomas Müller, the principal research scientist at NVIDIA. The talk was a case study on how the research team at NVIDIA was able to successfully train a Neural Radiance field (NeRF) model in a matter of minutes or even less. They call it the Instant NGP (Instant Neural Graphics Primitive). But let’s back up a bit and understand what NeRF is first.

A neural radiance field (NeRF) is a fully-connected neural network that can generate novel views of complex 3D scenes, based on a partial set of 2D images. It is trained to use a rendering loss to reproduce input views of a scene. It works by taking input images representing a scene and interpolating between them to render one complete scene. NeRF is a highly effective way to generate images for synthetic data. A NeRF network is trained to map directly from viewing direction and spatial location (5D input) to opacity and colour (4D output), using volume rendering to render new views. You can read more about it in their paper.

NeRF is a computationally-intensive algorithm, and rendering complex scenes can take hours or even days. Despite their reputation for being computationally expensive, neural networks can be trained and run efficiently for high-performance tasks. With the use of the appropriate data structures and algorithms, neural networks can run in the inner loops of real-time renderers and 3D reconstruction, resulting in an “instant NeRF.” Details about this approach can be found in their paper titled Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.

The speaker credits their success to the three pillars of Neural High-Performance Graphics which are

Small Neural Networks
Hybrid Data Structures
Task Specific GPU implementations

The importance of special priors

Smaller and more efficient neural networks can significantly reduce computing time, but this can sometimes compromise the accuracy and quality of the output. To address this issue, the use of special priors, such as positional encodings, can help produce similar results when generating a 3D scene with a smaller NeRF model than the original larger model. The speakers argued that without the input positional encodings, this would not have been possible and thus emphasize the importance of smaller networks aided with priors. Read more about it here.

*Fourier Feature Input Results vs. Standard Input Results*

Usually, a lot of time is wasted in I/O read-and-write operations. The authors state that to improve speed it is necessary to modify the existing data structures to something specific to the task. For the 2D to 3D scene projection, they designed Multiresolution Hash Encoding.

Finally, they stated that the current deep learning frameworks are not optimal enough to exploit the speed of GPUs to the fullest. They instead proposed a new framework to train your Neural networks called Tiny Cuda. It is a small, self-contained framework for training and querying neural networks. It contains a lightning-fast “fully fused” multi-layer perceptron (paper), a versatile multiresolution hash encoding (paper), as well as support for various other input encodings, losses, and optimizers.

Strong Interpretable Priors Are All We Need

This session was led by Dr Tali Dekel, a research scientist at Google. Computer vision has recently made exciting progress, with new architectures and self-supervised learning paradigms rapidly improving. As computing power increases, models scale in size and training data, resulting in “foundation models” — billion-parameter neural networks trained in a self-supervised manner on massive amounts of unlabelled imagery.

Such models learn extraordinary priors about our visual world, as evident by their breakthrough results in a plethora of visual inference and synthesis tasks. Nevertheless, their knowledge is buried and hidden in the vast space of the network’s weights.

The speaker presented a series of works that aim to investigate the internal representations learned by large-scale models. By studying their priors and utilizing them in classical and new visual tasks, the research covers co-segmenting two images into coherent object parts and using text to modify the appearance of moving objects in real-world videos.

There has been a constant evolution in visual descriptors used for Computer Vision tasks. Starting from the early hand-crafted features (SIFT, HOG, SURF, ORB), people eventually moved towards Deep CNN-based features with the rise of the Deep Learning era. However, with the latest developments in the field of Transformers, specifically Vision Transformers, is it time to move towards Deep ViT-based Features?

Exciting innovations in AI: Self-Supervised Learning & Transformers

The speaker mentioned how many of the most exciting new AI breakthroughs have come from two recent innovations: self-supervised learning, which allows machines to learn from random, unlabelled examples; and Transformers, which enable AI models to selectively focus on certain parts of their input and thus reason more effectively. A recent work called Self-Supervised ViT — DINO is a great example of this. Interestingly, the acronym DINO comes from self-distillation with no labels.

By training ViT with the DINO algorithm, the authors observed that the model automatically learns an interpretable representation and separates the main object from the background clutter. It learns to segment objects without any human-generated annotation or any form of dedicated dense pixel-level loss.

The core component of Vision Transformers is self-attention layers. In this model, each spatial location builds its representation by “attending” to the other locations. That way, by “looking” at other, potentially distant pieces of the image, the network builds a rich, high-level understanding of the scene. When visualizing the local attention maps in the network, it is apparent that they correspond to coherent semantic regions in the image.

How does DINO work?

DINO works by interpreting self-supervision as a special case of self-distillation, where no labels are used at all. It trains a student network by simply matching the output of a teacher network over different views of the same image.

The authors of this paper identified two components from previous self-supervised approaches that are particularly important for strong performance on ViT, the momentum teacher and multi-crop training, and integrated them into their framework.

In the image below you can see the difference between the feature map representations of both supervised and self-supervised variants of DINO ViT and ResNet. It is visible that deeper layers of Self-supervised DINO ViT produce more semantically coherent features and can even identify similar objects.

*Feature maps of Supervised and Self-supervised DINO ViT & ResNet*

These DINO ViT features can be used for a plethora of applications such as Zero-shot Co-segmentation and Part-Cosegmentation. In all cases, lightweight methodologies are designed, that leverage the universal knowledge learned by large-scale models through new visual descriptors and perceptual losses. The methods are “zero-shot’’. They require no training data and are self-supervised — requiring no manual labels and thus can be applied across different domains and tasks for which training data is scarce.

Insights from Paper Presentations:

FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing

Multi-object multi-part scene parsing is a challenging task which requires detecting multiple object classes in a scene and segmenting the semantic parts within each object.

For this, the authors produce changes in the monolithic object label map structures to introduce more information such as front/back, left/right, and animate/inanimate parts of the object. They use this information to create the Pascal-Part201 dataset. They propose the following model to solve the multi-object multi-part scene parsing challenge. The model consists of encoder-decoder-style architecture with different decoders for object level segmentation, front/back, left/right, and animate/inanimate parts of the object. Finally, the feature maps are merged and an Inference Time Zoom Refinement (IZR module) is used to get the final output.

Can you even tell left from right? Presenting a new challenge for VQA

Visual Question Answering (VQA) research aims to create a computer system that can answer questions using both an image and natural language. VQA needs a means of evaluating the strengths and weaknesses of models. One is the evaluation of compositional generalisation, or the ability of a model to answer well on scenes whose scene setups are different from the training set. For this, we need datasets whose train and test sets differ significantly in composition.

This study introduces quantitative measures of compositional separation and shows that current VQA datasets are inadequate for evaluation. To solve this, they present Uncommon Objects in Unseen Configurations (UOUC), a synthetic dataset for VQA. UOUC is at once fairly complex while also being well-separated, compositionally. UOUC contains 380 object classes from 528 characters in the Dungeons and Dragons game, with 200,000 scenes in the train set and 30,000 in the test set.

To study compositional generalisation, simple reasoning, and memorisation, each scene of UOUC is annotated with up to 10 novel questions. These deal with spatial relationships, hypothetical changes to scenes, counting, comparison, memorisation and memory-based reasoning. In total, UOUC presents over 2 million questions. UOUC also finds itself as a strong challenger to well-performing models for VQA. Read the full paper here.

Learning compositional structures for deep learning: Why routing-by-agreement is necessary

A formal description of the compositionality of neural networks is associated directly with the formal grammar structure of the objects it seeks to represent. This formal grammar structure specifies the kind of components that make up an object, and also the configurations they are allowed to be in. In other words, objects can be described as a parse tree of its components — a structure that can be seen as a candidate for building connection patterns among neurons in neural networks. The authors present a formal grammar description of convolutional neural networks and capsule networks that shows how capsule networks can enforce such parse-tree structures, while CNNs do not. Read the full paper here.

Learnings from Industry Sessions:

The industry sessions featured interesting research work published by various competitors in the market. They had set up posters and demo booths where people could discuss more about their work and can see real-time demos. The sessions were from companies like Qualcomm, Adobe, Samsung R&D, L&T etc.

Samsung R&D

Researchers showcased work on various image editing features that they have incorporated into their latest mobile phones. Some of the notable features are shadow remover, photo remaster, image in-painting, and portrait mode, which are all deployed in their latest smartphones.

Under Display Camera

Apart from this, their major contribution was the development of the under-display camera. An Under Display Camera (UDC) is a breakthrough innovation that enables an uninterrupted viewing experience on a mobile device by hiding the camera under the display and dedicating the whole screen to users while applications are running. It not only requires hardware innovation by placing a camera under a display panel but also requires algorithm innovation for restoring image quality — one of the most complex image restoration problems.

As the camera is placed underneath the display, the Under Display Camera can suffer from poor image quality caused by diffraction artefacts, which results in flare, saturation, blur and haze. Therefore, while the Under Display Camera brings a better display experience, it also affects camera image quality and other downstream vision tasks. These complex and diverse distortions make restoring Under Display Camera images extremely challenging.

In this talk, the author discussed some of the challenges with the Under Display Camera system & presented their work on image restoration for Under Display Camera.

Adobe

In this talk, the speakers provided an overview of Adobe Research and the key areas that they are working on. They gave us a peek at some of their recent work on video generation, image out-painting and graphic design harmonization. Their recent work on image out-painting and animating still images show how expressing visual data via intermediate representations and manipulating provides better outputs against direct pixel-level manipulations.

The authors propose a method to interactively control the animation of fluid elements in still images to generate cinemagraphs. Specifically, they focus on the animation of fluid elements like water, smoke, and fire, which have the properties of repeating textures and continuous fluid motion. They represent the motion of such fluid elements in the image in the form of a constant 2D optical flow map. The user can provide any number of arrow directions and the associated speed along with a mask of the regions the user wants to animate. The user-provided input arrow directions, their corresponding speed values, and the mask is then converted into a dense flow map representing a constant optical flow map (FD).

The authors observe that FD, obtained using simple exponential operations can closely approximate the plausible motion of elements in the image. They further refined a computed dense optical flow map FD using a generative-adversarial network (GAN) to obtain a more realistic flow map. A novel U-Net-based architecture was proposed to auto-regressively generate future frames using the refined optical flow map by forward-warping the input image features at different resolutions.

Some other research showcased were:

*Design Understand & Generation (top), Image Understanding & Generation (bottom)*

TCS Research

Draping a 3D human mesh has garnered broad interest due to its wide applicability in virtual try-on, animations, etc. The 3D garment produced by existing methods are often inconsistent with the body shape, pose, and measurements. This paper proposes a single unified learning-based framework (DeepDraper) to predict garment deformation as a function of body shape, pose, measurements, and garment styles. The authors train the DeepDraper with coupled geometric and multi-view perceptual losses.

Unlike existing methods, they additionally model garment deformations as a function of standard body measurements, which generally a buyer or a designer uses to buy or design perfect-fit clothes. In addition to that, the authors claim that DeepDraper is 10 times smaller in size and 23 times faster than the closest state-of-the-art method (TailorNet), which favours its use in real-time applications with less computational power.

*DeepDraper training & Inference Pipeline*

Final thoughts and Key Takeaways:

Our team had an amazing three-day experience full of learning opportunities! Witnessing the growth of the computer vision community in India was nothing short of exhilarating. We were thrilled to see the groundbreaking research being done by these talented individuals, and we couldn’t wait to learn more.

The key points that I’d like to take away with me are:

It is better to use problem-specific priors to aid learning in your deep networks compared to blind input-to-output mapping.
Having in-depth knowledge of the latest developments in different problem statements can help in translating ideas across different domains of AI.
Full stack expertise of tech often pays well in designing a powerful product.

*The Fynd Research Team at IIT Gandhinagar (L-R): Bipin Gaikwad, Arnab Mishra, Prasanna Kumar, Shashank Vasisht, Vignesh Prajapati*

It was incredibly inspiring to see the strides being made in this field, and we feel grateful to have been a part of it. We can’t wait to attend more conferences like this and hopefully even present the groundbreaking work we’re doing here at Fynd too!