145 — Moments in Time Dataset: one million videos for event understanding

Monfort et al (1801.03150)

Read on 12 January 2018
#moments  #dataset  #mturk  #video  #audio  #labels  #annotation 

This paper announces the release of the Moments in Time dataset, a manually annotated dataset that includes collections of videos that are around three seconds apiece. These videos capture human-defined “moments” — things like an object falling; people shaking hands; a dog barking, etc.

The dataset contains one million of these videos, coupled with labels of what they represent: For example, the labels might have simple information, like “falling”, or more complex labels, like “saving” or “ruining”.

The videos were first annotated with free-form tagging, and then these were validated using Mechanical Turk yes/no answers, along the lines of “do you see or hear someone cooking in this clip?”

This dataset holds obvious application for deep-learning or context-aware machine-learning programs that need to understand not only the relationship between moments and the emotions or signals that they convey, but also — the authors note — in cases where the order of occurrence is important. For example, seeing a person with their hand on the refrigerator door is ambiguous; this could either represent “opening” or “closing” and it’s not clear which until the next few frames are played in sequence.