On Zero-Shot Recognition of Unseen State-Object Composition
No Thumbnail Available
Date
2024-09
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Indian Statistical Institute, Kolkata
Abstract
Compositional Zero-Shot Learning (CZSL) attempts to recognise images of new
(unseen) compositions of states and objects, when images of only a subset of stateobject
compositions are available as training data. Thus a CZSL model should
recognise a young dog when the model has seen images of the state-object compositions
young bear, old bear and old dog. There are multiple challenges to solve the
CZSL problem. It is difficult to disentangle the visual features of object dog and
its state young from its compositional image young dog. The features of a state
are observed to have high variation in visual features across compositions. For
example, the state sliced has different visual features in compositions sliced apple
and sliced tomato. In the second chapter of the thesis, we attempt to disentangle
the visual features of state and object using a two-stage sequential recognition approach.
In next chapter of the thesis, we work on the open-world CZSL problem
where no prior information about the feasibility of a state-object composition is
available. We use a Graph Convolutional Network based architecture along with a
frequency-based feasibility prediction approach for the open-world CZSL problem.
Another challenge in CZSL lies in the fact that the extent of association between
the features of a state and an object vary significantly in different images of the
same composition. For example, in different images of peeled orange, the oranges
may be peeled to a different extent. Thus the visual features of images of peeled
orange may vary. In the fourth chapter, a novel Knowledge-guided Transformer
Network is proposed to better process the partial association between the visual
features of state and object. In the fifth chapter, we attempt the partially supervised
CZSL (pCZSL) problem, where for each state-object compositional image,
either the state or the object annotation is available. We propose a novel vision
transformer based architecture with Locality Preserving Neighbourhood Aggregation
approach in the fifth chapter. Effective identification of the discriminative
features of state and object often depends on the scale of the object in the image.
For example, in the images of the two compositions, young bear and old bear, the
identification of the states young and old may depend on recognising the scale
(or size) of the object bear in the image. In the sixth chapter, we leverage Vision
Language Model (VLM) to estimate the scale-aware features in CZSL. Extensive
experiments on C-GQA, MIT-States and UT-Zappos50k datasets demonstrate
the effectiveness of the approaches in this thesis, when compared to the stateof-
the-art in the closed-world CZSL, open-world CZSL and pCZSL settings. As
concluding remarks, we discuss the future scope of research in CZSL.
Description
This thesis is under the supervision of Prof. Dipti Prasad Mukherjee
Keywords
CZSL, disentanglement, State-object composition, Knowledge Sharing
Citation
157p.
