Showing posts with label Open Access. Show all posts
Showing posts with label Open Access. Show all posts

Monday, February 15, 2016

The Hidden Data: let's share & recycle it




Riding the success of Open Access, demand for Open Data is gaining momentum. Scientists do experiments, collect data, analyse those and draw conclusions. However, a scientific publication reflects all these processes only in brief. Methods of experiments and data collection get the least space in a paper. Often, this leaves most of the crucial steps of an experiment to the imagination of readers. Raw data metamorphose into graphs and images. Statistical analyses prove their existence only in some stars, somewhere on the graphs.

The main focus of a scientific article is story telling. Just like a film director, a scientist directs the readers through the paper in a chosen way. You learn the story that the author wants to tell you.

No, am not questioning the integrity of authors. They must have valid and honest stories to share. But that does not exclude the possibility of multiple other stories hidden in the data. Those can be unearth only when you allow every one to look into it; when you allow every one to think over your observations. After all science is communal.

Scientific data, particularly in Biology, comes in different forms. They can be images, videos, numerical values recorded in spreadsheets,  sounds collected from field, and even living organisms. With that diversity, comes the volume of the data. Obviously, you can not share such multitude of information using the pint media.

Thanks to digital media! Now we can store, and share different types of data easily (with only exception being living data). With enormous developments in data storage and cloud computing, we have no excuse to hide our data in lab notebooks and desktops.

As the demand for openness has increased, journals, likes PLoS, Science, Nature are now promoting data sharing to different extent. Some funding agencies also have  mandated such data sharing. However, there is no consensus across the board and many are raising apprehensions of misappropriation of raw data.

However, the debate mostly revolves around discloser of data from large scale studies, like clinical trails. But small scale experiments, performed everyday by most labs, also have the same fate. The observations are cherry picked, arranged, and then packaged in suitable graphical forms to present to the peers.

Say, you want to show that a drug inhibits Insulin signaling. For this, you need to identify the correct doses of insulin, the drug and the required treatment time. Therefore, experiments are performed to identify optimal doses and time. Eventually, you will perform an experiment at those optimum conditions. Observations of this experiment would be presented in a graphical form to substantiate your claim. For your story, elaborate dose and time dependent observations are not crucial and those are lost somewhere in your lab records.

But that hidden data may be crucial for someone working on the kinetics of Insulin signaling. Although you have already done the experiment, they have to do it again.

Even when such data are published, those are mostly in graphical forms. Graphs are good to communicate ideas and conclusions. But those are not suitable for data reuse. I can not get the exact numerical values of measured variables from a graph. Often, raw data are transformed before plotting (remember % cell viability of MTT experiment). Without adequate information, it is impossible to get back to the original values. Data exists wide and open, but we can not reuse it.

This is a frequent problem, faced by people in mathematical biology. Experimental observations are abundant in Biology. But most of the published data is not suitable for use in mathematical modeling. There are several free tools, like DataThief, Graph Data Extractor, and Web Plot Digitizer, that can extract numerical data from graphs.  These are very easy to use. But quality of extracfed data depends on quality, resolution and size of images of graphs. Even at their best, these extraction tools can provide you only approximate values.

Data extractors are useful but are not solution to the problem of hidden and lost data. Best solution is to store all of our observations in freely accessible repositories. 

There exists many data bases for storage and sharing of structured data, like sequence information, protein crystallographic data, microarray data. However, most of us do not use these services religiously.

The trouble is more for unstructured data, say all of our western blot images or cytotoxicity data of drugs that we are testing on mammalian cells. No body shares raw data of those experiments. Thankfully, several web services, like Figshare, have started to store unstructured data too. The best is that they provide a DOI for everything stored there. Therefore, those are identifiable, and citable so that you get due credit for your data.

As an individual scientists, we may start with baby steps.  Once a paper is published, one can share the data of published results and unpublished background results through such cloud services. Definitely, it would require time and efforts, to clean and structure the data before sharing. As a community, we have to encourage and appreciate such efforts.

However, I wonder how this model of cloud storage of data can sustain without financial support from funding agencies and academic institutions. Experimental biologist are churning out enormous amount of data at every moment. To store those for eternity, in a publicly accessible repository, requires enormous financial support.

When I get a grant, it pays for my reagents, and instruments.  It also pays for lab stationary like lab notebooks, where I record my observations. Such grants should also cover the cost of storing those observation, for the future.

There comes the requirement public digital repositories for data. Am sure, as the campaign for Open Data spreads, major funding agencies across the globe will chip in for such public repositories. Beyond science, it makes economic sense too.

In India, the idea and ideals of Open Access is slowly sipping in. DST and DBT has created repositories for papers published through their funding. Many institutions, like  IITs, have created publicly available digital repository for theses and similar documents. Recently, work on a National Digital Library has started to integrate all such repositories. Hope that the scientific community and policy makers would soon realise the importance of data repository.

Till then, let us share our data, codes, software through what ever means we have. Let us reuse and recycle every bit of information.





Thursday, October 23, 2014

Trajectories in Open Access

Scientific research is a social endeavor and throughout the world, it is funded primarily by public money. Therefore, there should be no barrier to the knowledge developed through such research and it should be accessible to everyone. The primary sources of such knowledge are the articles published in scientific journals. So, to spread knowledge, these articles should be freely available to everyone. The Open Access movement, which is  spreading through the academic world, is preaching this philosophy.


In the current model of publication, researchers submit their articles to journals and the journals publish the selected few after peer-review. In this process, the author voluntarily transfers the copyright of the article to the publisher. The publisher does not pay the author. But the reader, whether a researcher or a layman, has to subscribe the journal to read it. That's what most of our libraries spend money on. Over the years, the subscription fees have ballooned to such an extent that even libraries in the developed world are falling short of their budget.  


The Open Access movement strikes at this very issue. It promotes two models, to achieve open access to published work. In the first, any body is free to access the journal over the Web. Such journals are called open access journals. For publication in such a journal, authors have to pay a article-processing fee. The cost for editorial manpower, formatting, typesetting and server management are covered by that fee collected from the authors. Over the years, number of open access journals has increased exponentially, with many having doubtful reputation. Even then, several open access journals are well respected for consistently publishing high quality work. Though such journals are promoting open science, article processing charges are often very high and prohibitive for researchers working in developing countries.


The other model, for open access, is creation of public funded open access digital archives for scientific papers. PubMed Central, developed by  U.S. National Institutes of Health's National Library of Medicine, is one such digital archive, where publishers or the authors voluntarily submit a copy of their articles. In fact, such submission has been made mandatory for every work funded by NIH of USA. Some other funding agencies are also promoting the same model. Anyone having an access to the Web can read all the papers stored in such archives. Such archiving does not violate the copyrights of most of the publishers, as authors submit only their copy of the final peer-reviewed draft without having any editing and formatting by the publisher. ResearchGate, a social networking site for scientists, also follows a similar model for sharing scientific articles.


Very recently, DBT and DST, funding bodies for scientific research in India, has released the second draft on their open access policy. They have proposed establishment of open access digital archives, in different institutions as well centrally. Any publication coming out of a work funded by these agencies must be deposited in such archives. Like other such archives, authors will deposit only their own copy of the final peer-reviewed draft. Interestingly, the proposed policy explicitly discourages author-paying model of open access journals and has made it clear that they would not provide financial support for article-processing fees. This makes sense, as article-processing fees of many journals are exuberantly high for most Indian labs. Additionally, this will also discourage spread of predatory journals, many of which are published from India.


It will be interesting to follow how India's open access policy shapes eventually. But for the time being, let me imagine the evolution of the publishing industry in the age of open access. The open access digital archiving makes sense for every country and most of the major players in science would eventually move to this model. But that may trigger a trouble. A publisher takes care of peer-review, editing, formatting and publishing. They charge the fee to the reader or the author to cover the expenses for this process. The final published articles are usually smartly edited and eye candies to readers. The authors copy of the final draft deposited in open access archives are not adequately formatted but contains all the scientific contents. Therefore, though reading such a draft is bit cumbersome, but that does not affect the science. In fact, every scientist is well trained to read such drafts. If we get such final draft without any cost, why should we pay to read a well-formatted copy of the same, published by the journal? Obviously not. Eventually this will reduce the number of subscribers to such journals. So journals running on subscription based model will not survive in the world of open access. In fact all major publishers for science journal are offering some form of open access for their journals and testing the water. But we need publishers to manage the whole process of publishing scientific articles through peer-review. At least some one has to run the peer-review process and that also have a cost though the reviewers do not get paid. If the reader does not pay, the cost has to be covered by the author. And that brings us to author-paying model of current open access journals. Therefore, institutionalized discouragement to such journals may not be a good idea as we do not have an alternate model that will sustain in long run.


There is another model of science publishing. It involves post-publication peer review. In this, authors deposit articles that are published online without any peer-review. The readers can access those articles freely and can comment on those. In a variant of this model, the journal invites peers to review, once the paper is published online. arXiv, an open access server, publishes articles without peer-review and without any charges. However, it does not allow commenting or review by readers and in essence it is merely a repository.  Even then, this server is quite successful and authors regularly submit high quality articles, particularly in physics and allied subjects. Recently established PeerJ PrePrint archive is an attempt to replicate that for bio-medical sciences. It allows readers to review and comment in articles archived there. Most online journals also allows readers to comment on published peer-reviewed work. However, till now peer review by readers has not catch up the scientific world and very few readers express their opinion online. Even when they comment, those are not detailed like thorough peer-review. Currently, post-publication peer-review by readers doesn't seems to be a viable option. Over-reliance on reader's opinion may also bring  the vices of social media in science publishing. In essence, the current practice of organized peer-review managed by editors is still the gold standard. In the age of open access,  quality peer-review can be sustained only through the author-paying model. Therefore, rather than rejecting author-paying model, we need to develop technologies to reduce the cost of running the show and have to establish some peer-based mechanism to regulate this industry to maintain high standards.