What to put in an ideal JoRD service

The Feasibility Study has been asking researchers, representatives of Publishing Houses, repository staff and librarians about their image of an ideal JoRD service to give some sort of indication of how to build a resource that will be useful. So far, the most ideal service which would achieve the desires of all the stakeholders would not only include a database to contain all the details of every journal data sharing policy, cross-matched with funders requirements and lists of suitable repositories but also employ a team of human staff to constantly update the data base, provide customer service and advice about best practice and give educational workshops and seminars. This would be ideal, but expensive, and ideals cannot always be reached, at least not initially.

So, who wants what out of the service? These are the service requirements each stakeholder group suggested.

Researchers would like the service to:

  • Have a clear, visual user friendly website with technical support, and information about the service and its scope
  • Include summaries of policies, RCUK baseline policies, compliance statistics
  • Include the URL of journal policy
  • Provide contact details of researchers

Researchers told us that they would use the service to find the journal which is right for their data and funder’s requirements, find appropriate repositories and to look for openly accessed data.

Publishers asked for:

  • A simple attractive web page
  • An authoritative resource
  • Compliance monitoring and sanction information
  • Technical error reporting
  • Guidance about best practice, current issues, changes and trends and a model policy
  • A policy grading system
  • Levels of membership

Publishers said that they would use the service to gather competitor intelligence, a source of advice and as a central resource to get information about funder’s requirements and accredited repositories.

Both researchers and publishers wanted:

  • Guidelines about data submission,  such as copyright, use licensing, ethical clearance, restrictions and embargoes and file format
  • URLs of places where data can be archived and retrieved

As far as other stakeholders are concerned, librarians  considered that the service could give publication and funding compliance guidance for researchers as well as support research data management policies. Funders thought that the  service could track the development of Journal data policies and influence the data sharing behaviour of researchers. Representatives of repositories thought that a central data policy bank would be a resource where they could check consistency and compliance of journal data policies and possibly identify partner journals. It seems that a JoRD Policy Bank Service would have something to offer for everyone in the research industry. The quest now, as in all research activity, is finding someone who will pay, so that the ideal service will not be such a distant dream.

Barriers to sharing data

There is a stereo-typical image of a covetous academic, dedicated to their work and who hoards the data for their research, so that no-one else will achieve the acclaim for their life’s work. Presumable this stereo-type arose from such stories as Isaac Newton and Gottfried Leibniz having a major dispute over which of them first discovered Calculus. In hindsight, both of them discovered it independently and both deserved acclaim. Charles Darwin kept his data on the “Origin of the Species” for very many years, before being persuaded to publish what turned out to be a popular science book of its day.

But we are not in the 17th or 19th Centuries, we are in the age of Information, Internet and global networks where collaboration has become respected. Teams of scientists are now rewarded, for example the Manchester University Physicists Andre Geim and Kostya Novoselov who won the Nobel Prize for Physics with their invention of Graphene. The Royal Society report “Science as an Open Enterprise” (http://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/projects/sape/2012-06-20-SAOE.pdf) describes how an outbreak of e-coli which originated in Hamburg was contained by the work of scientists in four continents who posted their analysis of the virus onto open source sites.  The genetic sequencing of the virus was completed by scientists in Hamburg and China, which was then posted onto an open source site with an open data license. In July of last year the European Commission published a press release outlining the measures that they will take to improve open access to scientific information that is produced in Europe, because the Commission feels that open access to data will improve Research and Development,and increase knowledge and  competitiveness in Europe (“Scientific data: open access to research results will boost Europe’s innovation capacity” http://europa.eu/rapid/press-release_IP-12-790_en.htm).

Such openness and swift communication is expected by today’s researcher. However, an EU study found that only 25% of researchers openly share their data.  The researchers that participated in our study expressed the desire to share their data, some were already sharing, but others found that although they wanted to share it was not easy to achieve. Many felt that there were barriers put in their way, one of which involved the old stereotype, they were not expected to share. For example, funding bodies may well be encouraging researchers to give open access to data that was paid for from public funds, but researchers believe that they will not get funding from using the data that someone else has collected although it would be an efficient and economical way of  carrying out research. Researchers also reported that universities attract funding for new projects, not for re-use of data, and there is more interest in publishing new research rather than replication studies.

Practical reasons were also mentioned, for instance personal barriers to sharing data were listed as:

  • Not knowing  where to deposit data
  • Lack of time and resources to undertake the deposit of data
  • Confidentiality and sensitivity of data, restrictions from funding body or breaking trust with research participants

Barriers in the wider scientific environment were reported as the difficulty in accessing data repositories because of lack of standardisation, and a poorly supported data sharing environment. It would seem that there are two main barriers to be crossed before the open sharing of data is completely commonplace. First the stereotype of the data hugging scientist must disappear from the minds of  researchers, funders, Higher Educational Institutions and publishing houses. Secondly, the infra-structure of  data deposit sites, how, when and where to deposit data, has to be fully resolved, publicised and implemented. Once again, it would appear that a JoRD Policy Bank Service would be of great value to researchers because it would supply a central resource of how, when and where to share data,  contribute to improving the data-depositing infra-structure and remove one barrier to the open access of data.

Seasons Greetings

The JoRD team have been beavering away this week writing up the report for the feasibility study, but I am afraid that you have to wait until after the holidays to find out the outcomes. No posts for a few weeks while the office is closed and we are all refreshing our brains and enjoying our many Christmas pursuits.

 

So, on behalf of the JoRD team I wish you all a Happy Christmas and Prosperous New Year!

Data comes in all sorts of shapes and sizes

The JoRD project has not set out to define the term “data” (or the singular form of the word, “datum”). This was a fortunate choice, because one of the messages that has clearly come across from all the participants of our study is that data can take many forms. The recent Royal Society Report, “Science as an Open Enterprise”, (http://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/projects/sape/2012-06-20-SAOE.pdf) includes a glossary of data terms which illustrates the ways in which the term “data” can be used. For example:

  • big data – data that requires massive computing power to process
  • broad data – structured big data
  • data set – a collection of  information held in electronic form
  • linked data – data that has been allocated a unique identifying number to be able to access it from an electronic storage facility

… and those are just a few terms that it explains. The word “Data” is defined as “Qualitative  or  quantitative statement or numbers that are (or assumed to be ) factual”. The researchers that were part of this study considered that their data took more forms that just statements or numbers.

Researchers described the data that their research generated as:  software, video footage, geodata, geological maps, ontologies, web services and data models , as can be seen in the table below. The multitude of forms therefore makes it difficult for publishers to include in their on-line published articles. The publishers said that linked data in a journal article should be  “fit for use” and “replicable” and consider that data in many different formats is “Messy” and currently is not supplied with sufficient meta-data. Another consideration is the resulting file size of an article if the publisher saves the embedded data on their own servers. Data repositories and data centres are the more practical method of data storage with published articles incorporating linked data.

Therefore that is one reason for Journals to have a data policy, and a good argument for those policies to be collected and made accessible in a centralised resource, a JoRD Policy Bank  Service.

Researchers description of data Qualitative(documents and text) Quantitative(figures) Visual data (images) Virtual data (software or protocols)
Collection of examiner reports and questions supervisory reports, letters and other documentary evidence.
Dataset of measurements and statistical analyses
Digitised Textual Sources
Excavation, field observation, environmental monitoring, software to collate mine and analyse
Excel sheets
Focus Group, Interview Transcripts, some footage of people using computers, digital photographs
Geodata
Geologic maps, chemical and isotopic analyses of Earth Materials, GIS datasets
Interview transcripts
Ontologies
Reports
Visualization
Web Services, Data Models and Specifications

Other data intiatives that are out in the world

To find out whether there are any other projects, products or services already performing the same function as JoRD, a quick survey was done to find out what other data initiatives there are, and what services they offer. So far 29 have been identified, although there may well be others. Most of them are known to be current ongoing initiatives, but some of them seem to have started, but have not been updated for a while. They are mainly funded by Universities from around the world and at least four demonstrate successful collaboration between Universities internationally. Many UK initiatives are JISC funded. Three are funded by Governments, one being an international initiative. Eleven are subject specific.

Only five of the initiatives indicate that they can advise researchers about data policies and guidelines, and four deal with best practice. Nine are concerned with linked data. Fortunately, none of them appear to be supplying the type of service that JoRD would deliver. Here are some details of the most interesting projects.

  • DaMaRo ( http://damaro.oucs.ox.ac.uk/)  is an initiative between JISC and Oxford University to create the University’s data management policy and build the  infrastructure to be able to comply with the policy. It is associated with DataFlow (http://www.dataflow.ox.ac.uk/) and DataBank (http://www.dataflow.ox.ac.uk/index.php/about/about-databank.) which are being developed by Oxford University and the Bodleian Library to provide an open source developed infrastructure that will aid the storage of create DOIs for large data sets.
  • DRYAD (http://datadryad.org/) is a digital repository that is supported by many international scientific societies. It has been created by open source  development and facilitates data storage and retrieval, provides advice on best practice, links data and attributes DOIs.
  • Global Biodiversity Information Facility or GBIF (http://www.gbif.org/) provides infrastructure and links to biodiversity data
  • SPQR (http://spqr.cerch.kcl.ac.uk/) provides links and meta-data searches to ancient documents
  • KAPTUR (http://www.vads.ac.uk/kaptur/) is a new project run by a consortium of art universities to capture, preserve and produce best practice of data management unusual data formats, such as sketchbooks or textile designs.

A more explanatory table can be found here.

Chart of Data initiatives

If you have any further information about the initiatives in the table, or you know about other, then please respond by comment.

Summary of workshop, discussion about the nature of JoRD

Here is another summary of the concluding discussion that took place at the workshop on 13th November. This is about the expectations and perceptions of publishers concerning the nature of the JoRD Data Bank service.

A prominent consideration of the publishers was that JoRD should be an authoritative resource, such that a JoRD compliance stamp, or quality mark, could be displayed on Journal’s websites. There was discussion that for JoRD to be authoritative, the content of the database should be added, updated and maintained by the JoRD team. It was mentioned that publishers might initially populate the data base, but ongoing maintenance would be the responsibility of JoRD. However, there should be a guarantee that the content is accurate and that publishers would need to commit to providing policies that can be machine readable in order for them to be automatically harvested.

It was suggested that the operational database should not be merely a static catalogue or encyclopaedia. It was requested that the non-compliance of a journal to a data sharing policy, or to a funder’s policy, could be flagged and reported to the publisher, although that request was queried as to whether that was the remit of the service, or the publisher themselves. Similarly, it was questioned whether the service would mediate user complaints, and proposed that it would engage with complaints concerning policies only. To maintain functionality, could there be automatic URL checking which would send an alert to the publisher if links were broken.  Updates to policy changes would also be a useful function.

The service website should include a model data policy framework or an example of a standard data policy and offer guidance and advice to journals and funders about policy development. However, the processing and ratification of a model policy could be a time consuming process to some publishers. It was asked whether repository policies would also be included, and there was mention of compliance with the OpenAIRE European repository network. The website should also contain:

  • Links to the publishers web-pages
  • Dates of the records
  • Lists of links to repositories
  • Set of criteria for data hosting repository

It should look inviting, but businesslike and be simple and clear, but be sufficiently detailed.

Methods of funding the service were considered and the benefits of membership. For example, would only the policies of members to the service be entered into the database? Would there be different levels of membership or different service options that publishers could choose? and would there be extra costs for extra services? One such service could be to contain historical records and persistent records to former policies. In the publisher’s opinion, they would be prepared to pay for a service that is transparent and would save them time.

Other comments included:

  • Would the service be a member of the World Data System?
  • Could it be released in Beta?
  • There are around 4-600 titles to enter initially
  • When set up the service could be studied to discover its effectiveness and impact
  • Further consultation may be needed

Very brief summary of JoRD workshop

On Tuesday 13th November some of the JoRD team met with representatives of several well known journal publishers for workshop a session to discuss a number of points concerning the potential JoRD data bank service. This is a very potted summary of the discussions that took place. If any of the attendees are reading this and feel that their comments have not been correctly interpreted, then please comment to correct any misunderstandings.

Preservation of and sustained access to published supplementary material: The current situation
The group perceived that at present there are a variety of issues that impede the maintenance of data added to an on-line journal as supplementary material, or even the practice of including data within an article. The areas where difficulties lie include:
• Technology
• Data repositories
• Embargoes
• Peer review
• Licensing
• Copyright
Unstable URLs, PDF formats and usable forms of preserved data present technological problems that need to be solved to ensure that data can be accessed in the long term. However, transferring data to new formats has fewer difficulties. Data may be linked to external repositories, but they present a problem because they each have different policies and practices. Embargoes placed on data release complicates matters, there is not standard for their length. To overcome these issues, an alternative solution would be not to include the data file with the article but to add information of where it can be obtained directly from the researcher. However, on-line journals will be upgrading to enriched HTML and should therefore commit to include data.

The group were concerned about the peer review of data, which is currently “Ad Hoc”. It was queried whether peer reviewers have time to examine data alongside judging arguments and suggested that data is reviewed by the research community. Currently publishers’ practices concerning licensing and copyrighting of data as supplementary material vary greatly. However EU legislation does not allow data to be copyrighted. Authors could be offered choices of licensing and work is being done to define data and on forms of data citation, however, publishers do feel a duty of care to the knowledge that they publish.

About data repositories: Advantages and disadvantages
Ideally, publishers would like repositories to be a searchable archive that manages data and collects retrospectively, such as the library of Columbia University gathering data for PLOS.

Advantages

  • The situation for publishers would be made simpler should data be held in external repositories
  • Technically more able to deal with digital data
  • Guidelines about re-depositing data if closed
  • Institutional repositories could manage data then aggregate it as in Australia

Disadvantages

  • May want to take over from publishers
  •  Not currently ready for influx of data
  • Funding may not be sustained
  • Discovery issues

Solutions to any of the issues posed above are not given in this post, but there is opportunity for you to comment. The remainder of the discussion focused on the structuring and content of a JoRD Policy Bank service, which will be summarised in the next post.

Online survey results part two

The second set of questions asked in the online survey ask for the opinions of researchers about data sharing and the usefulness of a data policy bank service. They are as follows:

  • Where do you access or locate the research output of other researchers?
  • In your opinion are the key drivers behind increasing access to research data?
  • In your opinion what are the main problems associated with sharing research data?
  • What do you think about linking a publication with digital data that are integral to its main conclusions?
  • What do you think about linking an article with supplementary material that enhances the article?
  • Do you think that journals should provide digital data sharing policies?
  • Do you think there would be benefits in having a service offering information about journal research data policies?
  • Would you use a service of this kind?
  • What information should be included in a policy bank service?
  • Do you have any other comments?

Most of the respondents locate other researcher’s data from colleagues or in their own institution or organisation and feel that the four most important key drivers to increasing access to data are:

  • Openness
  • Accountability
  • Increased access to data
  • Increased efficiency of research resources

The most frequently expressed concern is that of attribution of intellectual property right to the data being shared. The next frequently expressed issue is that current  institutional and establishment models and mindsets of institutions and some individuals create barriers to sharing data. However just over one-third of respondents (35%) consider that linking digital data as an integral part of  main conclusions in published online journals would be useful and should be mandatory.

Linking articles to supplementary data to enhance the article was considered useful by more respondents (43%) but it would also depend on the context of the data shared. Over 74% of researchers considered that journals should provide data sharing policies and a similar percentage (73%) thought that such a service would be of benefit, because it would be a central resource. Nearly 80% of respondents said that they would use such a service, either to gather data, or as a means of selecting where to publish their work. Many ideas of what to include in a policy data bank were suggested, which included:

  • Clarity and simplicity of use
  • Archiving URLs
  • Guidelines
  • Usage licences (eg Creative Commons)

Eight researchers commented that they considered the initiative important.

The least number of respondents said that they gather other research data from their own blog, or from hard copy data sets. The concerns expressed about sharing data were those of trust, confidentiality and the need to overcome existing mindsets and institutional barriers. A small number of researchers felt that sharing data would affect the future of research and that before sharing data certain conditions would have to be fulfilled. A very low number of people (3%) said that linking data to main conclusions was not useful and unnecessary; that they would only be interested in a published article, not in any additional material and that journals should not provide data sharing policies. One researcher commented that further research about the topic with a trial  would help their decision as to whether published data sharing policies would be of personal benefit.

Three percent of respondents thought that there would be no benefit to a data policy bank service, because it is not needed, not feasible or there would be conflicting journal ethos. Twenty one percent considered that they would not use such a service because they did not find it relevant and one researcher stated that they would prefer to deal directly with the journal.

On balance, it appears that more respondents are pro-data sharing, have positive opinions about the JoRD policy bank service and would find it useful, than respondents who feel that there is no need or use for such a service.

Preliminary Results of Online Questionnaire

The online questionnaire  closed on Monday 5th November and had been answered by 70 researchers. The survey comprised 20 questions asking for information about the researcher, their data sharing habits, their opinions of the possibility of openly sharing their data and the utility of a policy bank service. The first ten questions were as follows:

  • What is your academic discipline?
  • What is your subject?
  • How long have you been a researcher?
  • In which part of the world is your research institution based?
  • Do you generate research data/materials/programs etc?
  • What kind of data/materials/programs do you generate?
  • Where do you currently store you digital data?
  • Where do you currently store your non-digital data?
  • How accessible are your data/materials/programs to other researchers?
  • Are your data/materials/programs etc sharing habits going to change in the future?

Most of the respondents worked in the disciplines of Science or Social Science, however there were representatives from a substantial range of fields which means that the self selecting  sample was from a cross-section of research disciplines. The most frequently listed subject was some variety of Information Studies and around 33% of respondents were actively working on a PhD or M/Phil and roughly 30% had been post qualification researchers for between 5 – 14 years. The respondents were overwhelmingly based in Europe and nearly all of them considered that they generated some sort of data, which was mainly qualitative, but there was an equal balance between textual and numerical data.  Most people stored digital data on own computer and at a work server. The favoured form of other digital storage was Dropbox. However, when it came to non-digital data, many more people stored that at their workplace. Surprisingly around 56% of respondents already share their data, albeit with their colleagues. Slightly more researchers thought that they were unlikely to change their sharing habits (approx 37%) than change their sharing habits (36%).

The least number of respondents were from the field of Economics, one respondent was studying for a MSc, and fewer respondents had been working as researchers for over 15 years. Geographically, a very small number of respondents were based in South America and Africa, and a very few people answered that they did not generate any data. Visual Data was the least form generated. Few respondents stored digital data on a disciplinary digital or archive,  or non-digital data at an external repository. One correspondent appeared to destroy all raw data after research publication. None of the correspondents answered that they shared data with no-one, although certain researchers  shared only with their research partner. A few considered that they would share less of their data in future, while a small number of researchers were not able to share because of the sensitive nature of the data.

Questions 11 – 20 will be analysed and reported next week.

 

Some interesting news on Open Data

I’ve been away from the desk for a few days, here are some of the open data related news I found upon my return:

From January 2013, the BMJ will require there to be a commitment to make the relevant anonymised patient level data available on reasonable request, before publishing clinical trials results

http://www.bmj.com/content/345/bmj.e7304

Erin C. McKiernan, a researcher working primarily in experimental and theoretical neuroscience, asks Who owns research data and the rights to publish?

http://emckiernan.wordpress.com/2012/10/24/who-owns-research-data-and-the-rights-to-publish-it/

http://emckiernan.wordpress.com/2012/10/31/who-owns-research-data-and-the-rights-to-publish-part-ii/

And finally, what are the Benefits of Open Data – Impact on Economic Research

http://oanow.org/2012/11/the-benefits-of-open-data-impact-on-economic-research/