I’m off to Amsterdam tomorrow for the Digital Curation Centre’s annual conference, IDCC ’13. The program is a diverse mix of some of the top thinkers when it comes to issues of digital curation, data sharing, standards and information management. I’m delighted to be joining such an all-star group, speaking Tuesday in the Innovation/Applications track on our work at Digital Science, and in general, making research more efficient.
At the tail end of last year, the organisers asked if I’d be interested in engaging in an email interview leading up to the event. Below is an excerpted version of the interview. For the full post, visit their website. You can also find the program here.
Your presentation will focus on Infrastructure. Are there any specific messages would you like people to take away from your talk?
It’s easy to think that we’ve worked out most of the kinks in research when we look at some of the latest advances in astronomy, genomics, and high-energy physics in the news, from the work at the LHC to the ENCODE project. But there are still a number of baseline assumptions in research that need rethinking – and in many cases, fixing. That’s what Digital Science was created to address, some of the oft overlooked roadblocks in things like search in the sciences, information management, and the dated incentive system which is keeping us from fully updating our practices in the lab.
We address three areas in our call this year – Infrastructure, Intelligence and Innovation. What do you see as the most pressing challenges across these?
Having worked on infrastructure issues in research for the last six years, I’d say one of the main challenges remains making the right design decisions. Whether that’s an open platform that operates on the back bone of the web or a lightweight software application for use in a research setting, design decisions are key, and in my experience, are often not thought through to the extent warranted.
There’s a reason why inefficiency still exists in modern research labs, and it’s not a shortage of tools. Part of that still lies in how the systems are crafted for the individual user, but also how it speaks to other systems.
Also, the age old incentive problem is still keeping us from reaching our full potential, as we continue to largely measure impact as papers produced. Not only does that skew researchers’ incentives to better manage and make available say, for instance, the data accompanying their research or the code needed to execute the experiment, but it only presents issues for other specialists whose main output may be software, not scholarly papers.
We need to rethink how we measure and reward research so that it better reflects a researcher’s contribution on his/her community and give the system a hard refresh.
And in terms of opportunities, do you see potential in data science as a new discipline?
Absolutely … though it’s not a “new” discipline, necessarily. There is an increasing understanding about the power in bringing together skillsets such as mathematics, machine learning, statistics, computer science and domain expertise (though not always necessary), which is helping us redefine how we think of hypothesis-driven research, becoming more data-driven.
What I find particularly fascinating is the spotlight it’s putting on how we teach science undergraduates – making sure they not only have the practical skills for working in a lab or conducting an experiment, but also the statistical literacy and analytical reasoning to understand the information they’re producing and collecting.
The conference theme recognises that the term ‘data’ can be applied to all manner of content. Do you also apply such a broad definition or are you less convinced that all data are equal?
I’m an equal opportunity data fan (and open purist, carried over from my time at Creative Commons). Too often, I feel, we get caught up in debates about the “worthiness” or “value” of particular data sets, a legacy from the publication world where only the most polished, interesting data counts. It’s pervasive and keeping us from doing more robust, reproducible work. I am a strong proponent of not cutting oneself off from yet unknown opportunities, and unfortunately classifications such as “junk data” are not only increasingly silly in the digital age, but borderline harmful.