blog-image

Albert Gatt talk

Grounds for groundlessness ?

On the sensitivity of vision-language systems to perceptual input

Albert Gatt, University of Malta

Recent years have seen an explosion in NLP applications which lie at the interface between language and perception. Natural Language Generation (NLG) -- the task of generating text from non-linguistic input -- is no exception. To take one well-established example, many models have been proposed which generate text form image or video data. To what extent can such systems be said to be properly grounded in perceptual information? It turns out that this is an old question in NLG. NLG systems vary considerably in the type of input they work with. Judgments of the "correctness" of their output are subject to considerations such as (a) whether the most relevant elements of the input are actually covered (the system's ability to select appropriate content); and (b) the extent to which the information selected is accurately conveyed (the system's ability to interpret and abstract over the input and render it fluently in natural language). In the case of perceptually-grounded NLG, the same sorts of challenges often rear their heads. This talk will first survey the landscape in NLG, proposing a high-level taxonomy of NLG systems based on a characterisation of their inputs and outputs. Perceptually-grounded NLG will be introduced and situated within this broad taxonomy, with a focus on architectural properties of neural vision-to-language systems and their relationship to data-to-text NLG more generally. The question of grounding will be addressed through experiments on two related, but subtly different NLG tasks, namely: (a) generating descriptive captions of images and (b) generating entailments from a combination of image and textual premise. Results from ablation studies and sensitivity analysis demonstrate that, while some perceptual information is being captured by such systems, there are several kinds of information which appear to be missed, suggesting that these systems are only partially and selectively sensitive to the input. This turns out to be due to a combination of factors, including biases in training data, where the high predictability of linguistic sequences, which allow such models to rely on linguistic regularities at the expense of input sensitivity.