Mirella Lapata's Q&A Session

How much is the quality of the algorithms affected by the pre-processing steps? E.g. can we expect higher quality by doing more than simply selecting content words? Will stemming or baseform reduction be helpful (or hurt the performance) for highly inflected languages? etc.

The quality of the algorithms is affected mainly by the image preprocessing. The images at hand are small in size and their resolution is low. This creates difficulties for extracting accurate features. Furthermore, the BBC images are very diverse covering a broad range of topics which poses additional challenges. Regarding the text-preprocessing, the words are lemmatized and this helps somewhat. We could of course do more such as identifying synonyms (e.g., using Wordnet), named entities, or multi-word expressions (e.g., world war). Stemming could potentially hurt performance for highly inflected languages, especially in the caption generation application.

Not sure how realistic some of the queries are, e.g. if I am looking for a blue car and some sky I would not type in 'blue car sky' but instead '"blue car" sky'

One could type this, however it is still ambiguous and the search engine could return a blue car and something that is not a sky, simply because the word sky happened to be collocated with the image. If one were to generate a natural language description for the image "blue car against sky" then one could guarantee that the image does indeed depict a blue car and some sky.

Comment: Google image search is more than simply looking at surrounding text, they use a lot of clickthrough data as well (and very likely results obtained from Google Image Labeler which is the commercial version of ESP game).

Indeed, but imagine how much better the search engine would be if they actually exploited the synergy of visual and textual information. Furthermore, someone has to play the Google Image Labeler game and it is unrealistic to assume that there will be human annotations for all images available on the web.

With regard to the slide (near start of presentation) with two scenic pictures at sunset, you mention that these would typically be described with some descriptors such as 'birds sea sun waves', and that the images are and their annotations are similar. However, it is also possible for these to have annotations such as 'holiday'. There are broader issues around tags and human information needs worth considering. Comment?

Indeed, this is a general problem with image annotation, people may disagree as to what the picture depicts. As far as I am aware there are no publicly available agreement figures for the Corel images. However, annotators were instructed to describe the objects in the image so this probably prevented them from coming up with more general labels such as 'holiday'. This is also an issue with the captions we use to train our models. Captions can be denotative (describing the objects the image depicts) but also connotative (describing sociological, political, or economic attitudes reflected in the image). Importantly, our images are not standalone, they come with news articles whose content is shared with the image. So, by processing the accompanying document, we can effectively learn about the image and reduce the effect of noise due to the approximate nature of the caption labels.

Variability of the results when using latent semantic techniques. This makes us think about the empirical validation of the models we build upon PLSA/LDA/... How to choose apriori the correct (i.e. the most/best performing) model?

Previous latent variable models have been mostly evaluated on the Corel database or similar datasets. This is important for comparing different approaches, but does not entail that these models will perform well or scale to datasets of different nature. For the tasks I presented here, one could possibly rule out PLSA and CorrLDA from first principles. PLSA is not a well-defined generative model of documents, there is no natural way to use it to assign probabilities to previously unseen documents and this is critical in our applications. In CorrLDA word topic assignments are drawn from the image regions which are in turn drawn from a Gaussian distribution. This modeling choice places a lot of weight on the image preprocessing. The latter may be of higher quality for the Corel dataset but our images are noisier and more complex. Moreover, CorrLDA assumes that annotation keywords must correspond to image regions. This assumption is too restrictive in our setting where a single keyword may refer to many objects or persons in an image (e.g., the word "badminton" is used to collectively describe an image depicting players, shuttlecocks, and rackets).