What to do when batch effects get in the way of the biology you want to study

Even with the best laboratory practices, many genomic measurements are affected more by “nuisance factors” than variation in the biology of interest.

These so-called “batch effects” can arise from differences between labs, sequencing machines, reagents and technicians, often showing up as day-to-day variation.

Thus, a classic way to get a lively reaction from a biostatistician is to announce proudly that you made sure all the normal subjects were sequenced on a Monday and all the diseased subjects were done on a Tuesday.

This gets an A+ for simplicity, but an F for science because there is no way to tell whether differences between measurements on diseased versus normal subjects are due to genuine biological differences (that you are interested in) or day-to-day variation in the sequencing process (one of many nuisance factors).

If this is a revelation to you, get thee to an experimental design course!

But even with the best experimental designs, batch effects are still a headache. Strategies that statistically remove batch effects run the risk of throwing away genuine biological variation.

This concern is at the heart of Harman, a method for risk-conscious correction of batch effects developed by Yalchin Oytam, Fariborz Sobhanmanesh, Jason Ross, Megan Osmond-McLeod and Konsta Duesing, as part of a project supported by Transformational Biology and the former CFNS and CAFHS Divisions.

“In Turkish and Persian, Harman refers to a threshing yard where wheat is separated from chaff,” says Yalchin.

But how can a method do this when the “wheat” and “chaff” live in 10,000 or more dimensions?

“Ideally, and in a well-designed experiment, the data topography of different batches should all be similar,” says Yalchin. “In essence, Harman allows us to adjust the data from each batch so their distributions become more similar, right up to the point where the topography of different batches are identical.”

“Of course, at the extreme, it is easy to overcorrect and remove the interesting biological signal along with the batch effect noise—therein lies the risk. But Harman makes that risk explicit and adjustable by the analyst.”

Already experiments have shown that Harman can improve on ComBat, the most popular and best performing method for batch correction to date. “Essentially, with Harman, you have a performance curve reflecting the (variable) trade-off between signal preservation and batch noise rejection. The user can choose to be anywhere on this curve by adjusting the trade-off coefficient. At the outset, we expected ComBat to fall somewhere on the curve, except that this point might not always be the optimal one for any given application. As it turned out, in all three datasets we explored, ComBat fell below the performance curve of Harman, meaning that there was always a trade-off setting for Harman which results in better noise rejection and better signal preservation at the same time. And in a world of compromises, that is remarkable.”

Presently, Harman has been applied to transcriptome microarray data, DNA methylation data and mass spectrometry data, but Harman is suitable for any matrix of high dimensional data.

A free executable software package for Harman is available for download.