Skip to content

Focusing on big data: is seeing believing?

Focusing on big data transparency.jpg

Both honest mistakes and the deliberate manipulation of data can affect the quality of published research. In a recent Forbes article, Kalev Leetaru delves into how bad data practice impacts scientific publishing.

The research and publishing communities prioritise new discoveries, but this can be at the expense of full data documentation and validation. Leetaru suggests that this is particularly the case in the age of ‘big data’, where large datasets can be misunderstood in the race to a breakthrough. He classifies the current status of bad data practice under five broad themes and suggests possible solutions:

  • Honest statistical/computing error. Even a simple calculation error in a spreadsheet can drastically alter the understanding of a particular dataset. Statistical review processes may identify such errors, but only full disclosure of raw data, software and workflows can ensure they become known.
  • Honest misunderstanding of data. This can include a failure to understand the limitations of particular data sources, such as solely utilising English language Western-origin news sources to study global trends. The conclusions being drawn from such data may be statistically sound, yet largely irrelevant to the question being posed.
  • Honest misapplication of methods. Powerful statistical and analytical software packages may be freely available, but if used by researchers unfamiliar with their applications and limitations, the output may be unreliable. Only full documentation of the specific tools, algorithms and parameters can allow such errors to be identified.
  • Honest failure to normalise. This can be an issue in media analyses; for example, reporting changes in the number of news articles published on a specific topic over time is meaningless without also reporting changes in the total number of published articles over the same time period.
  • Malicious manipulation. Image doctoring and deliberate data falsification of are two particularly egregious examples of alleged data fraud, and highlight the need for journals to be more vigilant for the possibility of manipulated data.

Leetaru notes that errors can also propagate through scientific publishing as authors copy and paste incorrect information from one paper into another. As poor data practice can result in the broad acceptance of questionable conclusions as fact, Leetaru appeals to journals to take action and adopt dedicated data review processes to eliminate these (mostly unintentional) errors.


Summary by Julia Draper, DPhil

Julia Draper is a biomedical researcher and freelance writer. Her postdoctoral research background is in leukaemia biology and developmental haematopoiesis. Julia is open to being contacted regarding career opportunities in medical communications at

Never miss a post

Enter your email address below to follow our blog and receive new posts by email.

Never miss
a post

Enter your email address below to follow The Publication Plan and receive new posts by email.

We don’t spam! Read our privacy policy for more info.

Leave a Reply

%d bloggers like this: