Monday, February 15, 2016

The Hidden Data: let's share & recycle it




Riding the success of Open Access, demand for Open Data is gaining momentum. Scientists do experiments, collect data, analyse those and draw conclusions. However, a scientific publication reflects all these processes only in brief. Methods of experiments and data collection get the least space in a paper. Often, this leaves most of the crucial steps of an experiment to the imagination of readers. Raw data metamorphose into graphs and images. Statistical analyses prove their existence only in some stars, somewhere on the graphs.

The main focus of a scientific article is story telling. Just like a film director, a scientist directs the readers through the paper in a chosen way. You learn the story that the author wants to tell you.

No, am not questioning the integrity of authors. They must have valid and honest stories to share. But that does not exclude the possibility of multiple other stories hidden in the data. Those can be unearth only when you allow every one to look into it; when you allow every one to think over your observations. After all science is communal.

Scientific data, particularly in Biology, comes in different forms. They can be images, videos, numerical values recorded in spreadsheets,  sounds collected from field, and even living organisms. With that diversity, comes the volume of the data. Obviously, you can not share such multitude of information using the pint media.

Thanks to digital media! Now we can store, and share different types of data easily (with only exception being living data). With enormous developments in data storage and cloud computing, we have no excuse to hide our data in lab notebooks and desktops.

As the demand for openness has increased, journals, likes PLoS, Science, Nature are now promoting data sharing to different extent. Some funding agencies also have  mandated such data sharing. However, there is no consensus across the board and many are raising apprehensions of misappropriation of raw data.

However, the debate mostly revolves around discloser of data from large scale studies, like clinical trails. But small scale experiments, performed everyday by most labs, also have the same fate. The observations are cherry picked, arranged, and then packaged in suitable graphical forms to present to the peers.

Say, you want to show that a drug inhibits Insulin signaling. For this, you need to identify the correct doses of insulin, the drug and the required treatment time. Therefore, experiments are performed to identify optimal doses and time. Eventually, you will perform an experiment at those optimum conditions. Observations of this experiment would be presented in a graphical form to substantiate your claim. For your story, elaborate dose and time dependent observations are not crucial and those are lost somewhere in your lab records.

But that hidden data may be crucial for someone working on the kinetics of Insulin signaling. Although you have already done the experiment, they have to do it again.

Even when such data are published, those are mostly in graphical forms. Graphs are good to communicate ideas and conclusions. But those are not suitable for data reuse. I can not get the exact numerical values of measured variables from a graph. Often, raw data are transformed before plotting (remember % cell viability of MTT experiment). Without adequate information, it is impossible to get back to the original values. Data exists wide and open, but we can not reuse it.

This is a frequent problem, faced by people in mathematical biology. Experimental observations are abundant in Biology. But most of the published data is not suitable for use in mathematical modeling. There are several free tools, like DataThief, Graph Data Extractor, and Web Plot Digitizer, that can extract numerical data from graphs.  These are very easy to use. But quality of extracfed data depends on quality, resolution and size of images of graphs. Even at their best, these extraction tools can provide you only approximate values.

Data extractors are useful but are not solution to the problem of hidden and lost data. Best solution is to store all of our observations in freely accessible repositories. 

There exists many data bases for storage and sharing of structured data, like sequence information, protein crystallographic data, microarray data. However, most of us do not use these services religiously.

The trouble is more for unstructured data, say all of our western blot images or cytotoxicity data of drugs that we are testing on mammalian cells. No body shares raw data of those experiments. Thankfully, several web services, like Figshare, have started to store unstructured data too. The best is that they provide a DOI for everything stored there. Therefore, those are identifiable, and citable so that you get due credit for your data.

As an individual scientists, we may start with baby steps.  Once a paper is published, one can share the data of published results and unpublished background results through such cloud services. Definitely, it would require time and efforts, to clean and structure the data before sharing. As a community, we have to encourage and appreciate such efforts.

However, I wonder how this model of cloud storage of data can sustain without financial support from funding agencies and academic institutions. Experimental biologist are churning out enormous amount of data at every moment. To store those for eternity, in a publicly accessible repository, requires enormous financial support.

When I get a grant, it pays for my reagents, and instruments.  It also pays for lab stationary like lab notebooks, where I record my observations. Such grants should also cover the cost of storing those observation, for the future.

There comes the requirement public digital repositories for data. Am sure, as the campaign for Open Data spreads, major funding agencies across the globe will chip in for such public repositories. Beyond science, it makes economic sense too.

In India, the idea and ideals of Open Access is slowly sipping in. DST and DBT has created repositories for papers published through their funding. Many institutions, like  IITs, have created publicly available digital repository for theses and similar documents. Recently, work on a National Digital Library has started to integrate all such repositories. Hope that the scientific community and policy makers would soon realise the importance of data repository.

Till then, let us share our data, codes, software through what ever means we have. Let us reuse and recycle every bit of information.