93 — Gene name errors are widespread in the scientific literature
Ziemann et al (10.1186/s13059-016-1044-7)
Read on 21 November 2017The gene for Membrane-Associated Ring Finger (C3HC4) 1
, often abbreviated MARCH1
, looks one hell of a lot like the date March 1
. But — dearest reader — do not be misled!
Ziemann, Eren, and El-Osta screened over thirty-five thousand supplementary files from nearly 20 journals pubished between 2005 and 2015 for tabulated data files (such as Excel, CSV, or other common data formats). Of these files, 7467 files contained lists of genes.
Of the journals with gene-list supplements, on average, 20% of the files contained a gene name that was erroneously converted to a date.
Nature had the highest error rate of any journal surveyed in this study, with over 30% gene-listing papers affected. The journals with the lowest affected-paper rate were usually genetics/genomics journals, where presumably the editors either vetted these documents with this error in mind (the first paper to bring this issue into the light was Zeeberg & Riss’s 2004 BMC Bioinformatics paper which likely tipped off the genomics community) or the authors were more conscientious with gene-name formatting.
Either way, nearly a decade and a half after that 2004 paper, these mistakes in gene-name formatting are still extremely widespread. In the least destructive case, these errors will fail to parse correctly in further reviews and scientists will be forced to rewrite their parsers to accomodate this failure mode. It is, however, very unlikely that these errors will be detected in review, and as a result, survey studies may report skewed results for these ‘mistakable’ genes.
The code used in this study is now available online. I hope journal editors might consider automating this parsing process on all incoming supplemental datasets to ensure that the published versions reflect the import of the research, and not the flaws in the spreadsheet software.