71 — Deep Spatial Regression Model for Image Crowd Counting

Yao et al (1710.09757)

Read on 31 October 2017
#computer-vision  #LSTM  #CNN  #DSRM  #neural-network 

This paper is the largest paper to have ever been written, period.

This is the biggest paper and it’s also — and you people have to understand, I have read a lot of papers — no one’s read more papers than I have, that’s for sure — and don’t, because if you’re including Princeton, lot of great people at Princeton, trust me, a good friend of mine went to Princeton and, anyway — but it was a record, you know? It was a record. Biggest paper. Ever. Period.


How do you estimate the number of people in a crowd? It’s difficult because of occlusion (objects passing in front of people, blocking out parts or even groups of humans) and density distribution (the people at the back of a concert will be less densely packed than the ones right up front).

Rather than segmenting and counting individual humans, this CNN instead approximates density for a smaller set of superpixels, and then from patch densities estimates total count as a function of density.

One advantage of this method is that it is agnostic to camera position; the camera can be placed aerially or at ground-level. The image does not need to be segmented, so crowd attendees can be different sizes due to perspective, or at different rotations.

Though the method does not always deliver an exact count, the CNN + LSTM is more reliable than many existing methods, and paves the way for further development.

…if you can even trust it to CNN these days.