Tuesday 8 March 2011

Pelagios on Big Data

Recently JISC commissioned a reporter to go to the O'Reily Strata (Big Data) Conference in California. The report, which can be accessed here, throws up a number of interesting questions that are pertinent to our own project. We sketch out some initial responses here—but we’d welcome further comments from within the team or from any of our readers.

Given the opening questions asked about Big Data—What information is out there? How can we find it? And how can it be accessed?—it is perhaps no surprise to find that the Pelagios team is sympathetic to much of the discussion. From a brief reading of this document, three things strike us as being of particular relevance:

1. Open access. The concluding point, ‘if data is less open, it is less useful—limiting access limits value’, is one with which we are in total accord. The importance of data being open is particularly acute in an academic environment, where cutting edge research depends fundamentally on the ability to access datasets of all different kinds. If somewhat more geared towards a science model of research, nevertheless open access could transform (in a positive way) the Humanities, making it incumbent on the researcher to show his/her working out: just what is the evidence for this or that interpretation, and so on. But access is not the only issue; data must also be reusable—which presents its own range of technological and intellectual challenges. Furthermore, care must be given if we are to avoid a tyranny of openness, by virtue of which research that (for whatever reason) is less open (whatever that might mean) is passed over or fails to gain funding or prestige. Or, to put that differently, does the mere fact that something is open make it worthwhile research?

2. Infrastructure. This issue seemed to be the keynote of the blog. It is raised by (among others) Mike Olson (Cloudera), Rod Smith (IBM) and Abhishek Metha (Tresata); and by Steve Midgley (U.S. Department for Education). How do we find and identify valuable resources? And how can it be connected up? Just such an issue has been the concern too at the European Science Foundation (http://www.esf.org/research-areas/humanities/strategic-activities/research-infrastructures-in-the-humanities.html)—the practice to be guarded against being termed ‘data silos’. But, given the importance of ‘infrastructure to store and share data within sectors’, how can it be achieved? Various ‘top-down’ approaches (whether commercial or educational) have thus far fallen flat either because of insufficient user uptake or the lack of reusability of the data, tools, etc. Accordingly, Pelagios supports a ‘bottom-up’ Linked Data approach, by which a variety of user groups work together to connect their resources in an open and transparent way, which others can assess and to which they may add their own. We certainly support the idea that resources should be shared wherever possible and that research (particularly in the Humanities) needs to move beyond a narrow competitive mentality.

3. Users. This is something touched upon in our previous response and that is implied in much of the blog—but whose fundamental importance is not really addressed. Focus seems to be rather on the producers and how they can provide context for the data or an expert ability to disseminate knowledge and understanding. But what about the user-end of the communication (something which Humanities teaches us to pay attention to)? Who are the users? Why do they use some Big Data and not others? And how can they be empowered to bring together for themselves different kinds of information? This is the critical challenge facing anyone working in the digital medium. In a commercial context it is clear that there is increasing demand for ‘Data Scientists’ capable of making sense of it all. Arguably that has always been the role of academics, yet few of us have the requisite technical skills and domain knowledge demanded by these new resources. Pelagios is a consortium project for this very reason and we suspect that multidisciplinary research groups will, almost of necessity, be the organizational structure best suited to exploiting academic Big Data.

1 comment:

  1. Point 1. ELOGeo(also jiscGEO funded) are looking at learning material that is open-licensed and so can be resued. http://elogeo.blogspot.com/

    Point 3. Someone at your workshop mentioned that once you have made your data available you should do something cool with it, it shows that the data can be used and in what ways, so others can be inspired. I don't think that was to say the data provider is solely responsible for ways in which the data can be used. The point is once you open the data it allows the others to be innovators or creative, but it was an interesting angle to think the provider can get the ball/excitement rolling.