86 — Web Robot Detection in Academic Publishing

Lagopoulos et al (1711.05098)

Read on 15 November 2017
#internet  #crawlers 

This work focuses on the detection of web-crawling robots on academic publishing sites. Robots, the paper explains, make up around half of internet traffic to these websites, and so it is important to exclude crawlers when determining the “organic” popularity of a website or web asset.

This paper applies several machine-learning techniques to “learn” which users are crawlers and which are real humans, using data from the web logs of both real and experimentally-defined servers.

The authors employ a few main techniques for making these determinations: First, they inspect traffic patterns, using information about the user’s flow around the website and to and from neighboring websites to establish to determine robotness.

They also use page content to calibrate this (which is one novel contribution from this work). The authors use the content of the crawled pages to determine if the user has an interest topic — as a human might — or if it’s indiscrimiately clicking links and exploring pages (as a robot would do).

From this work, it is possible to better establish which users are robots and which are humans, better calibrating viewership statistics in academic publishing, and likely preventing artificial “view” or “interest” inflation with automated readers.