Monday, February 15, 2016

The Hidden Data: let's share & recycle it




Riding the success of Open Access, demand for Open Data is gaining momentum. Scientists do experiments, collect data, analyse those and draw conclusions. However, a scientific publication reflects all these processes only in brief. Methods of experiments and data collection get the least space in a paper. Often, this leaves most of the crucial steps of an experiment to the imagination of readers. Raw data metamorphose into graphs and images. Statistical analyses prove their existence only in some stars, somewhere on the graphs.

The main focus of a scientific article is story telling. Just like a film director, a scientist directs the readers through the paper in a chosen way. You learn the story that the author wants to tell you.

No, am not questioning the integrity of authors. They must have valid and honest stories to share. But that does not exclude the possibility of multiple other stories hidden in the data. Those can be unearth only when you allow every one to look into it; when you allow every one to think over your observations. After all science is communal.

Scientific data, particularly in Biology, comes in different forms. They can be images, videos, numerical values recorded in spreadsheets,  sounds collected from field, and even living organisms. With that diversity, comes the volume of the data. Obviously, you can not share such multitude of information using the pint media.

Thanks to digital media! Now we can store, and share different types of data easily (with only exception being living data). With enormous developments in data storage and cloud computing, we have no excuse to hide our data in lab notebooks and desktops.

As the demand for openness has increased, journals, likes PLoS, Science, Nature are now promoting data sharing to different extent. Some funding agencies also have  mandated such data sharing. However, there is no consensus across the board and many are raising apprehensions of misappropriation of raw data.

However, the debate mostly revolves around discloser of data from large scale studies, like clinical trails. But small scale experiments, performed everyday by most labs, also have the same fate. The observations are cherry picked, arranged, and then packaged in suitable graphical forms to present to the peers.

Say, you want to show that a drug inhibits Insulin signaling. For this, you need to identify the correct doses of insulin, the drug and the required treatment time. Therefore, experiments are performed to identify optimal doses and time. Eventually, you will perform an experiment at those optimum conditions. Observations of this experiment would be presented in a graphical form to substantiate your claim. For your story, elaborate dose and time dependent observations are not crucial and those are lost somewhere in your lab records.

But that hidden data may be crucial for someone working on the kinetics of Insulin signaling. Although you have already done the experiment, they have to do it again.

Even when such data are published, those are mostly in graphical forms. Graphs are good to communicate ideas and conclusions. But those are not suitable for data reuse. I can not get the exact numerical values of measured variables from a graph. Often, raw data are transformed before plotting (remember % cell viability of MTT experiment). Without adequate information, it is impossible to get back to the original values. Data exists wide and open, but we can not reuse it.

This is a frequent problem, faced by people in mathematical biology. Experimental observations are abundant in Biology. But most of the published data is not suitable for use in mathematical modeling. There are several free tools, like DataThief, Graph Data Extractor, and Web Plot Digitizer, that can extract numerical data from graphs.  These are very easy to use. But quality of extracfed data depends on quality, resolution and size of images of graphs. Even at their best, these extraction tools can provide you only approximate values.

Data extractors are useful but are not solution to the problem of hidden and lost data. Best solution is to store all of our observations in freely accessible repositories. 

There exists many data bases for storage and sharing of structured data, like sequence information, protein crystallographic data, microarray data. However, most of us do not use these services religiously.

The trouble is more for unstructured data, say all of our western blot images or cytotoxicity data of drugs that we are testing on mammalian cells. No body shares raw data of those experiments. Thankfully, several web services, like Figshare, have started to store unstructured data too. The best is that they provide a DOI for everything stored there. Therefore, those are identifiable, and citable so that you get due credit for your data.

As an individual scientists, we may start with baby steps.  Once a paper is published, one can share the data of published results and unpublished background results through such cloud services. Definitely, it would require time and efforts, to clean and structure the data before sharing. As a community, we have to encourage and appreciate such efforts.

However, I wonder how this model of cloud storage of data can sustain without financial support from funding agencies and academic institutions. Experimental biologist are churning out enormous amount of data at every moment. To store those for eternity, in a publicly accessible repository, requires enormous financial support.

When I get a grant, it pays for my reagents, and instruments.  It also pays for lab stationary like lab notebooks, where I record my observations. Such grants should also cover the cost of storing those observation, for the future.

There comes the requirement public digital repositories for data. Am sure, as the campaign for Open Data spreads, major funding agencies across the globe will chip in for such public repositories. Beyond science, it makes economic sense too.

In India, the idea and ideals of Open Access is slowly sipping in. DST and DBT has created repositories for papers published through their funding. Many institutions, like  IITs, have created publicly available digital repository for theses and similar documents. Recently, work on a National Digital Library has started to integrate all such repositories. Hope that the scientific community and policy makers would soon realise the importance of data repository.

Till then, let us share our data, codes, software through what ever means we have. Let us reuse and recycle every bit of information.





Sunday, February 07, 2016

Refresh MathBio 101 with Zika

By now, you must have got introduced to Zika. May have also heard the heart-wrenching stories of children born with small head. They call it Microcephaly. It is probably connected to Zika virus infection of pregnant mothers.  The epidemic of Zika virus is causing havoc in some places in south america. WHO has recently declared it as a Public Health Emergency of International Concern (PHEIC). 



Doctorsscientists, public health workers, across the globe, are working hard to contain the disease and to develop vaccine and drugs against it. The conspiracy theorists are also working hard. Am sure you have read articles, flooding the social media, connecting Zika epidemic with greedy pharma companies. Conspiracy or not, you must have thought that how come all of a sudden this virus is causing such a havoc. It seems, as if it appeared from thin air and is spreading like an avalanche

But there is nothing unusual about it. Epidemics spreads like that and mathematical models of epidemics explain such avalanche. Let us check one of the simple mathematical models of epidemic to understand the Zika epidemic. This model is taught in the introductory course in Mathematical Biology. Let's refresh MathBio 101.


                                                        Image source: BBC


An infectious disease spreads through contact between an infected and an uninfected person. In some cases, the contact may not be direct. For Zika, the contact is through  mosquito. Whatever be the mode of transfer, to spread the diseases there must be some infected people in the community. Size of that infected population may be very small; but epidemic can not start from zero. 

Zika was reported, first time, in a paper published in 1952. It was the first report of isolation of this virus from rhesus monkey caged in the canopy of Zika Forest of Uganda.There were subsequent sporadic reports of  human infection by this virus across the globe. In 2015, there were reports of Zika infection in Brazil, the center of current crisis. So, there was already an infected population with potential to spread it to susceptible people. 

Let us call the fraction of the population with infection as I. The fraction of susceptible people who are still not infected be S.  The disease spreads from I to S and size of the infected population (I) increases. With time, I can decrease, as some of the infected people recover, develop immunity or (sadly) die. Let us name that fraction of population that have recovered or died as R.  

As the infection spreads, with time, sizes of these three populations change. We can write three Ordinary Differential Equations (ODE) to capture this population dynamics. 




Remember, here, S + I + R = 1 and people who have recovered (R) do not get infected again.

This is called SIR model of epidemics. It is a generic model. We have not considered any particular mechanisms of spread of infection or any particular means of disease remission. It relies on the simple idea that infection spreads by interaction between susceptible and infected people and some people either die or recovers. 

We can simulate these model by numerical integration of these three ODEs. For that we require numerical values for the constant terms a and b. In this model, a, grossly, represents how frequently one susceptible person gets infected when he/she comes in contact with an infected one. For Zika, this would depend upon mosquito, their numbers, their behavior and also on the behavior of the virus. 

The other constant b, represent how frequently infected people either die or recovers. Again this will depend upon health of individual, condition of their immune system, existence of healthcare facilities and also on the virus. 

For simulation, let us take a = 0.2 and b = 0.05. We also have to specify, values of S, I and R at the beginning (t = 0). Say those are, S = 0.999; I = 0.001 (very few people are infected) and  R = 0. 

We have simulated the SIR model with these values. The results are shown here in the figure. 


Initially, most of the people are uninfected. With time, number of infected people increases exponentially. Some of the infected people either die or recover. Therefore, size of the infected population, I, can not increase forever. It reaches a peak and then start falling. Remember, those who die or recover do not get infected again. As R increases, there is less and less people, left to be infected and the disease stop spreading further. 

Check the second ODE (eq. 2) carefully. It represents, rate of change of I with time. 
When, a = b/S, 
dI/dt = (b/S).S.I - b.I = 0. 
That means, when a = b/S, the infection will not spread. 

Suppose, some people are infected with the virus but size of the infected population is very small. Say I = 0.001 and S = 0.999. Like the previous simulation, consider b = 0.05. So, b/S = 0.05005

For some reason (may be weather), the constant a is very low. Say a = 0.05005. This makes a = b/S. Therefore, dI/dt will be zero and the size of the infected population will not change with time. Though there is circulation of the infection in the community, it will not become an epidemic. 

Suppose, after 100 days, something happens; like mosquito population increases enormously due to a change in weather. This will change the constant a. Now, say a = 0.3. As a becomes greater than b/S, infection will start spreading very fast and the size of the infected population I will increase exponentially. 

Now, we have a full blown epidemic. Eventually, it will recede as people will recover, get immuned or die. This dynamics of sudden appearance of epidemics is shown in the following figure. Here, till 100 days (shown by arrow), a = 0.05005. After that a is changed to 0.3. 


This is one of the simple models for epidemics. This may not correctly explain the current Zika crisis. There are many complicated models for epidemics. Some are disease specific and consider finer details. 

Even then, this simple model explains, how all of a sudden an epidemic can start like an avalanche. This also explain how common precautions, like using mosquito net or vaccination reduces chance of an epidemic. All these steps reduce the value of the constant a. As long as we keep a less or equal to b/S, we are safe.  

Update: Very recently a paper is published that has modeled the transmission dynamics of Zika virus in French Polynesia. There was an outbreak of Zika in these islands in 2013-14. 

They have used a mathematical model very similar to the SIR. Only, this model is bit elaborate. 

The model includes the dynamics of mosquito population. There is a susceptible mosquito population (Sv) that can get the virus from infectious people (IH). Once they have the virus, we call them exposed (Ev). Some of these exposed mosquitoes become infectious (Iv) and infect susceptible human (SH).  


They have also included a human population (EH) that is exposed to Zika through mosquito bite but the infection is in latent stage.

This is called susceptible-exposed-infectious-removed (SEIR) model. Just like the SIR model, ODEs are used to model the dynamics of all the populations. Here, we have seven different populations. So they have used seven ODEs. They have another additional ODE, for cumulative number of infected people. 

For details, look into the paper. It is freely available at Biorxiv. 

By fitting the model to the population data of the out-break, they have made an interesting prediction. Suppose, people, recovered from infection, get life-long natural immunity to Zika. In such case, the model predicts that  it would take at least a decade before re-invasion of Zika in this island population. Some relief for the health workers!!