On Zero-Shot Recognition of Unseen State-Object Composition

Panda, Aditya

On Zero-Shot Recognition of Unseen State-Object Composition

Files

Thesis_Aditya Panda-16-4-25.pdf (47.57 MB)

Form17-Aditya Panda-16-4-25.pdf (366.56 KB)

Date

2024-09

Authors

Panda, Aditya

Publisher

Indian Statistical Institute, Kolkata

Abstract

Compositional Zero-Shot Learning (CZSL) attempts to recognise images of new (unseen) compositions of states and objects, when images of only a subset of stateobject compositions are available as training data. Thus a CZSL model should recognise a young dog when the model has seen images of the state-object compositions young bear, old bear and old dog. There are multiple challenges to solve the CZSL problem. It is difficult to disentangle the visual features of object dog and its state young from its compositional image young dog. The features of a state are observed to have high variation in visual features across compositions. For example, the state sliced has different visual features in compositions sliced apple and sliced tomato. In the second chapter of the thesis, we attempt to disentangle the visual features of state and object using a two-stage sequential recognition approach. In next chapter of the thesis, we work on the open-world CZSL problem where no prior information about the feasibility of a state-object composition is available. We use a Graph Convolutional Network based architecture along with a frequency-based feasibility prediction approach for the open-world CZSL problem. Another challenge in CZSL lies in the fact that the extent of association between the features of a state and an object vary significantly in different images of the same composition. For example, in different images of peeled orange, the oranges may be peeled to a different extent. Thus the visual features of images of peeled orange may vary. In the fourth chapter, a novel Knowledge-guided Transformer Network is proposed to better process the partial association between the visual features of state and object. In the fifth chapter, we attempt the partially supervised CZSL (pCZSL) problem, where for each state-object compositional image, either the state or the object annotation is available. We propose a novel vision transformer based architecture with Locality Preserving Neighbourhood Aggregation approach in the fifth chapter. Effective identification of the discriminative features of state and object often depends on the scale of the object in the image. For example, in the images of the two compositions, young bear and old bear, the identification of the states young and old may depend on recognising the scale (or size) of the object bear in the image. In the sixth chapter, we leverage Vision Language Model (VLM) to estimate the scale-aware features in CZSL. Extensive experiments on C-GQA, MIT-States and UT-Zappos50k datasets demonstrate the effectiveness of the approaches in this thesis, when compared to the stateof- the-art in the closed-world CZSL, open-world CZSL and pCZSL settings. As concluding remarks, we discuss the future scope of research in CZSL.

Description

This thesis is under the supervision of Prof. Dipti Prasad Mukherjee

Keywords

CZSL, disentanglement, State-object composition, Knowledge Sharing

Citation

157p.

URI

http://hdl.handle.net/10263/7550

Collections

Theses

Full item page

On Zero-Shot Recognition of Unseen State-Object Composition

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By