Going back to basics – reusing data

It is almost a year since the first set of data was gathered to analyse journal articles, and now the benefits of saving data well is becoming fruitful. Two things are happening that means we are getting the basic figures out, dusting them off and looking at them again. The first is a paper about the development of a model journal research data policy, which is being co-authored by the JoRD team members, and the second is in response to certain questions that various people are asking.

The idea of creating a model policy emerged from the mass of data that was being found in the analytical process, and it was based on what journals were already doing, and suggestions from the report “Sharing Publication-related Data and Materials: Responsibilities of Authorship in the Life Sciences” (Committee on Responsibilities of Authorship in the Biological Sciences, 2003,  http://www.nap.edu/openbook.php?isbn=0309088593). The report was the outcome of a workshop in the United States which involved Biological Scientists. The five principles and ten recommendations stated in the report were strongly in favour of open access to the data that underpins the research reported in published articles.  A summary of the principles and recommendations can be found here: http://www.councilscienceeditors.org/files/scienceeditor/v26n6p192-193.pdf. The report suggested that the data could either be included into the article , or deposited in a reputable repository and linked to the article. The focus of the first model data policy was therefore based on the rather patchy and inconsistent set of policies that were found, from less than half the journals we analysed, and a report which was biased towards one scientific discipline.  It was decided to compare the initial model data policy with the needs of the stakeholders, which were examined at a later stage in the JoRD project. This has entailed, not only going over the data gathered from the stakeholder interviews and questionnaire, but also digging retrospectively into the reasons for the initial model criteria to be chosen.

The second reason for examining the basic data has come from interesting questions asked by a number of bodies that know about the JoRD project, and therefore assume that the JoRD team are experts in the field of Journal research data policies, an assumption that is becoming increasingly true as more questions are answered. In order for the questions to be answered, the data needed to be looked at from a different perspective. For example, to answer “How many journals make sharing a requirement of publication?” the original data set was re-examined and journals counted, because the original analysis was looking at the number of policies, some journals having up to three different data policies. Here follows a table with figures from a journal perspective:

Results of Journal Survey
Total no. of Journals surveyed 371
Total no. of Journals with data sharing policies 162
Total no. of Journals that make sharing a requirement of publication 31
Total no. of Journals that enforce the policies 27
Total no. of Journals that state consequences for non compliance 7

This process is an illustration of the way that well organised data, saved  safely, and as in this case in digital form, can be re-used after a particular project has ended. Surely it is generally after research has been concluded that questions arise and the iterative process of dipping in and out of data to validate or extend the research then begins. The moral of this blog post? Manage your data well because you never know what you will asked.

Advertisement

Another week, another presentation

Early this morning, well before normal work time, the dedicated Centre for Research Communication employees, Marianne and Jane, entered the special media communication room which contains the video conferencing equipment so that they could jointly present “Publisher Interest towards a role for Journals in Data Sharing: The Findings of the JoRD Project”. In the true spirit of global access and the digital world, they presented in Nottingham, UK and the presentation was seen at the ELPUB conference in Karleskrona, Sweden. We are pleased to report that the Nottingham technology worked really well, but a fellow presenter, also speaking through Adobe Connect, had difficulties with her connection and transmitted the sound of a large aircraft which was passing over the room where she was speaking. Jane and Marianne had chosen the high-tech route, because currently a tram line and bridge is being noisily constructed out side their office window, and had they decided to present from their computer, there would have been the sound of heavy machinery moving, beeps and rumbles, drilling and clangs.

Here is the link for the power-point slides:

JoRDELPUB

 

 

 

 

A rather long post, but quite a brief summary

Here is a summary of the the project so far.

Sharing the data which is generated by research projects is increasingly being recognised as an academic priority by funders, researchers and publishers.  The issue of the policies on sharing set out by academic journals has been raised by scientific organisations, such as the US National Academy of Sciences, which urges journals to make clear statements of their sharing policies. On the other hand, the publishing community expresses concerns over the intellectual property implications of archiving shared data, whilst broadly supporting the principle of open and accessible research data .

The JoRD Project was a feasibility study on the possible shape of a central service on journal research data policies, funded by the UK JISC under its Managing Data Research Programme. It was carried out by the Centre for Research Communications Research at Nottingham University (UK) with contributions from the Research Information Network and Mark Ware Consulting Ltd. The project used a mix of methods to examine the scope and form of a sustainable, international service that would collate and summarise journal policies on research data for the use of researchers, managers of research data and other stakeholders. The purpose of the service would be to provide a ready reference source of easily accessible, standardised, accurate and clear guidance and information, on the journal policy landscape relating to research data. The specific objectives of the study were:  to identify the current state of journal data sharing policies; to investigate the views and practices of stakeholders; to develop an overall view of stakeholder requirements and possible service specifications; to explore the market base for a JoRD Policy Bank Service; and to investigate and recommend sustainable business models for the development of a JoRD Policy Bank Service

A review of relevant literature showed evidence that scientific institutions are attempting to draw attention to the importance of journal data policies and a sense that the scientific community in general is in favour of the concept of data sharing.  At the same time it seems to be the case that more needs to be done to convince the publishing world of the need for greater consistency in data policy and author guidelines, particularly on vital questions such as when and where authors should deposit data for sharing.

The study of journal policies which currently exist found that a large percentage of journals do not have a policy on data sharing, and that there are great inconsistencies between journal data sharing policies. Whilst some journals offered little guidance to authors, others stipulated specific compliance mechanisms. A valuable distinction is made in some policies between two categories of data: integral, which directly supports the arguments and conclusions of the article, and supplementary, which enhanced the article, but was not essential to its argument. What we considered to be the most significant study on journal policies (Piwowar & Chapman, 2008), defined journal data sharing policies as “strong”, “weak” or “non-existent”. A strong policy mandates the deposit of data as a condition of publication, whereas a weak policy merely requests the deposit of data. The  indication from previous studies that researchers’ data sharing behaviour is similarly inconsistent was confirmed by our online survey. However, there is general assent to the data sharing concept and many researchers who would be prepared to submit data for sharing along with the articles they submit to journals.

We then investigated a substantial sample of journal policies to establish our own picture of the policy landscape. A selection of 400 international and national journals were purposefully chosen to represent the top 200 most cited journals (high impact journals), and the bottom 200 least cited (low impact journals), equally shared between Science and Social Science, based on the Thomson Reuters citation index.  Each policy we identified relating to these journals was broken into different aspects such as: what, when and where to deposit data; accessibility of data; types of data; monitoring data compliance and consequences of non compliance. These were then systematically entered onto a matrix for comparison. Where no policy was found, this was indicated on the matrix. Policies were categorised as either being “weak”, only requesting that data is shared, or “strong”, stipulating that data must be shared.

Approximately half the journals examined had no data sharing policy. Nearly three quarters of the policies we found we assessed as weak and only just under one quarter we deemed to be strong (76%: 24%). The high impact journals were found to have the  strongest policies,  whereas not only did fewer low impact journals include a data sharing policy, those policies were  were less likely to stipulate data sharing, merely suggested that it may be done. The policies generally give little guidance on which stage of the publishing process is data expected to be shared.

Throughout the duration of the project, representatives from publishing and other stakeholders were consulted in different ways. Representatives of publishing were selected from a cross section of different types of publishing house; the researchers we consulted were self selected through open invitations by way of the JoRD Blog. Nine of them attend a focus group and 70 answered an online survey. They were drawn from every academic discipline and ranged over a total of 36 different subject areas. During the later phases of the study, a selection of representatives of stakeholder organisations was asked to explore the potential of the proposed JoRD service and to comment on possible business models. These included publishers, librarians, representatives of data centres or repositories, and other interested individuals. This aspect of the investigation included a workshop session with representatives of leading journal publishers in order to assess the potential for funding a JoRD Policy Bank service. Subsequently an analysis of comparator services and organisations was performed, using interviews and desk research.

Our conclusion from the various aspects of the investigation was that although idea of making scientific data openly accessible for share is widely accepted in the scientific community, the practice confronts serious obstacles. The most immediate of these obstacles is the lack of a consolidated infrastructure for the easy sharing of data. In consequence, researchers quite simply do not know how to share their data. At the present juncture, when policies are either not available, or provide inadequate guidance, researchers acknowledge a need for the kind of information that a policy bank would supply. The market base for a JoRD policy bank service would be the research community, and researchers did indicate they believed such a service would be used.

Four levels of possible business models for a JoRD service were identified and finally these were put to a range of stakeholders. These stakeholders found it hard to identify a clear cut option of service level that would be self sustaining. The funding models of similar services and organisations were also investigated. In consequence, an exploratory two phase implementation of a service is suggested. The first phase would be the development of a database of data sharing policies, engagement with stakeholders, third party API development with the intention to build use to the level at which a second phase, a self sustaining model, would be possible.

The shape of a JoRD policy bank service?

We have established that researchers would certainly use a JoRD service, and publishers, repository managers, librarians would all find their own uses for the service. It has already been blogged that an ideal service that contains every item requested by stakeholders would be an expensive and extensive project, so what sort of service could be offered. Four options were devised and market tested on an assortment of stakeholders, academic librarians, publishers, repository managers, researchers, funders and representatives from similar data initiatives. The options were as follows:

  • Basic – an online searchable database of journal data policies, similar in approach to RoMEO
  • Enhanced  – an online searchable database of journal data policies with additional data integration such as funder policies, lists of recommended  repositories, or institutional policies
  • Advisory – as Basic and Enhanced services with the addition of research and advisory services, for example guides and instructions , best practice, model policy,and language, updates
  • Database with Application Programming Interface  (API) – as Basic and Enhanced but with no or minimal web interface but with and API which would allow third-parties to use data and develop applications

Most of the people interviewed thought that the basic option was option they would use.  Here is a table to show that  Possible value propositions. However, it was thought too basic to generate any income and some groups considered that it had limited value on its own. The enhanced service seems to be favoured by publishers, for example the inclusion of funder policies would be more valuable than other publisher’s journal data policies. The Advisory service was the option that most people thought would be the greater value for money, but participants cited other advisory services that could provide the same function as that aspect of JoRD. Finally, the high quality database with API and strong invitation for third party Apps was thought of as being a practical way to create an enhanced service. Unfortunately, none of the options emerged from the consultation as the optimum service which would generate its own income.

So, the shape of a JoRD service is still unknown and the method of funding is still unknown, but what has been achieved is that now there are no unknown, unknowns.

 

What to put in an ideal JoRD service

The Feasibility Study has been asking researchers, representatives of Publishing Houses, repository staff and librarians about their image of an ideal JoRD service to give some sort of indication of how to build a resource that will be useful. So far, the most ideal service which would achieve the desires of all the stakeholders would not only include a database to contain all the details of every journal data sharing policy, cross-matched with funders requirements and lists of suitable repositories but also employ a team of human staff to constantly update the data base, provide customer service and advice about best practice and give educational workshops and seminars. This would be ideal, but expensive, and ideals cannot always be reached, at least not initially.

So, who wants what out of the service? These are the service requirements each stakeholder group suggested.

Researchers would like the service to:

  • Have a clear, visual user friendly website with technical support, and information about the service and its scope
  • Include summaries of policies, RCUK baseline policies, compliance statistics
  • Include the URL of journal policy
  • Provide contact details of researchers

Researchers told us that they would use the service to find the journal which is right for their data and funder’s requirements, find appropriate repositories and to look for openly accessed data.

Publishers asked for:

  • A simple attractive web page
  • An authoritative resource
  • Compliance monitoring and sanction information
  • Technical error reporting
  • Guidance about best practice, current issues, changes and trends and a model policy
  • A policy grading system
  • Levels of membership

Publishers said that they would use the service to gather competitor intelligence, a source of advice and as a central resource to get information about funder’s requirements and accredited repositories.

Both researchers and publishers wanted:

  • Guidelines about data submission,  such as copyright, use licensing, ethical clearance, restrictions and embargoes and file format
  • URLs of places where data can be archived and retrieved

As far as other stakeholders are concerned, librarians  considered that the service could give publication and funding compliance guidance for researchers as well as support research data management policies. Funders thought that the  service could track the development of Journal data policies and influence the data sharing behaviour of researchers. Representatives of repositories thought that a central data policy bank would be a resource where they could check consistency and compliance of journal data policies and possibly identify partner journals. It seems that a JoRD Policy Bank Service would have something to offer for everyone in the research industry. The quest now, as in all research activity, is finding someone who will pay, so that the ideal service will not be such a distant dream.

Very brief summary of JoRD workshop

On Tuesday 13th November some of the JoRD team met with representatives of several well known journal publishers for workshop a session to discuss a number of points concerning the potential JoRD data bank service. This is a very potted summary of the discussions that took place. If any of the attendees are reading this and feel that their comments have not been correctly interpreted, then please comment to correct any misunderstandings.

Preservation of and sustained access to published supplementary material: The current situation
The group perceived that at present there are a variety of issues that impede the maintenance of data added to an on-line journal as supplementary material, or even the practice of including data within an article. The areas where difficulties lie include:
• Technology
• Data repositories
• Embargoes
• Peer review
• Licensing
• Copyright
Unstable URLs, PDF formats and usable forms of preserved data present technological problems that need to be solved to ensure that data can be accessed in the long term. However, transferring data to new formats has fewer difficulties. Data may be linked to external repositories, but they present a problem because they each have different policies and practices. Embargoes placed on data release complicates matters, there is not standard for their length. To overcome these issues, an alternative solution would be not to include the data file with the article but to add information of where it can be obtained directly from the researcher. However, on-line journals will be upgrading to enriched HTML and should therefore commit to include data.

The group were concerned about the peer review of data, which is currently “Ad Hoc”. It was queried whether peer reviewers have time to examine data alongside judging arguments and suggested that data is reviewed by the research community. Currently publishers’ practices concerning licensing and copyrighting of data as supplementary material vary greatly. However EU legislation does not allow data to be copyrighted. Authors could be offered choices of licensing and work is being done to define data and on forms of data citation, however, publishers do feel a duty of care to the knowledge that they publish.

About data repositories: Advantages and disadvantages
Ideally, publishers would like repositories to be a searchable archive that manages data and collects retrospectively, such as the library of Columbia University gathering data for PLOS.

Advantages

  • The situation for publishers would be made simpler should data be held in external repositories
  • Technically more able to deal with digital data
  • Guidelines about re-depositing data if closed
  • Institutional repositories could manage data then aggregate it as in Australia

Disadvantages

  • May want to take over from publishers
  •  Not currently ready for influx of data
  • Funding may not be sustained
  • Discovery issues

Solutions to any of the issues posed above are not given in this post, but there is opportunity for you to comment. The remainder of the discussion focused on the structuring and content of a JoRD Policy Bank service, which will be summarised in the next post.

Stakeholder Consultation – Researchers and Public Engagement

NOTES FROM THE FOCUS GROUP MEETING HELD ON MONDAY 8TH OCTOBER

ABOUT THE FOCUS GROUP

The Focus group was carried out as one of the normal meetings of the Nottingham Café Scientifique et Culturel on the evening of Monday 8th October 2012. This society meets for the purposes of ‘public engagement’ with the latest ideas arising in science and culture. The audience is mainly comprised of academics, professionals, and students. The audience therefore has an interest in understanding research and associated matters. As the ‘general public’ they are also interested in how public money is spent on research, and what happens to the outputs that are gained from this research.

Prior to the Focus Group starting, the purpose of the Focus Group was explained to the participants. They were then asked to sign a consent form and given a sheet of suggested questions which they could refer to throughout the Focus Group. They were asked to provide either their experiences or opinions related to this area.

The following topics arose during the course of the Focus Group Meeting.

 THE RESEARCH OR DATA OF THE FOCUS GROUP ITSELF

The focus group reported a variety of academic, practical (e.g.  professional purposes such as obtaining  community data to submit funding bids), and personal research projects (e.g. database of output of personal research interests in an academic field). The focus group participants seemed to have a data sharing mindset and overall felt that data should be shared.

LOCATION OF THE DATA / LOCATING DATA

People wondered where data should be submitted so that it did not get lost – this is important as it is a public record produced often at the public’s expense.

What is the best method of finding data?

Journals – People still publish via journals, people are used to this model, and it means that people then know where to look for research output.

Use of Google Scholar – Google Scholar can help with locating studies. But Google itself provides a search list which shows the items that are most frequently consulted, rather than necessarily showing those which are of better quality.

Institutional Repositories – However, these are not consistent from one organisation to another; they have different methods and the software can be configured differently. IRs may lead to searching Google instead.

 

WHEN DO YOU SHARE RESEARCH DATA – AT WHAT POINT IN THE PROCESS OF RESEARCH?

At what point in the research process should the data be shared?

Should there be a choice about the timing of release?

Raw Data – Should the data be in its ‘raw’ state or should it be contextualised by the researcher first? The data in some of its early states may not be comprehensible or usable by others. In these states it could be liable to misuse. It may be better to release the data once it is determined that there are no errors in it which could lead to unreliable studies by other researchers.

Interpretation before release – If people are still processing the data, they may feel the need to interpret it before sharing it. They may thus wait until the PhD, or other report is finished, before going public with the actual data. People would not necessarily want to share their data prior to producing their publications in order to maximise the number of publications.

The nature of the data – It may depend on what people want to use the data for, and the nature of the data itself as to whether shared data is useful. Is this more of an issue for qualitative data which is based on the interpretation of the researcher, rather than quantitative data?

Relevance of the data – Should the data be released while it is still of interest? Old data may lose its relevance or appeal.

Peer Review – The data should be available for the review process to enable peer-review to check the data. This could however be a time consuming process. Not all reviewers may feel they have the time to check the data as well as the article to which it relates.

BENEFITS OF DATA SHARING

New outcomes – Other people may be able to produce fresh interpretations of the data to advance the subject. Different researchers may find patterns that other people have missed.

Preservation – data which is copied and updated by others is more likely to be preserved; it is also more likely to be checked and is thus more reliable.

Ensuring reliability – e.g. making pharmaceutical data open ensures that it is not ‘rubbish’ (see arguments of Ben Goldacre)

Producing a sharing culture – everyone sharing their data means that people cannot ‘bury’ flawed research.

Collaboration/Comprehensivity – sharing a personal database of research means that other people would be able to contribute; one person cannot collect all the data necessary for the project. This would then lead to a comprehensive database.

Pooling data – sharing data would enable data to be pooled from different sources.

ISSUES WITH DATA SHARING

Confidentiality – Issues of confidentiality were raised related to data sharing which would make it difficult to be shared.

Infrastructure – Lack of infrastructure in the researcher’s organisation may deter data sharing.

Preservation – Data formats: some are not straightforward; digital data may have been stored in formats that are no longer used (floppy discs for example); more reliable formats are needed; readers for obsolete data types may be required. What would assist with data preservation? (e.g. more reliable formats such as tablets of stone, the web).

Time – It is time consuming to prepare data for sharing.

Value judgments – Who is qualified to make a judgment on what data should be preserved, as not everything can be preserved?; Who should have the job of filtering other people’s minds? Will this lead to value judgments being made about some forms of data?

Knowledge is power – it is also access to future funding. People may be concerned about sharing data if it means that it is used by others in a way which prevents them from obtaining future funding to continue with the line of research.

Misuse – future analyses may be incorrect, or cherry-picking of the data may take place to aid a particular argument – and data which does not support the argument can be ignored.

Processed data – people may claim that the data has been fiddled with (processed in some unreliable way).

Lack of Knowledge of how to share data – Someone reported that they did not know how to share data but would like to be able to do this.

Information Overload – A data sharing culture may mean that eventually there is too much information out there to manage successfully.

New research – Research could become a process of analysing old datasets rather than producing new data. Science would then become a process of interpretation.

Different languages – This could be a barrier to collaboration and sharing.

Ownership disputes – There could be disputes between authors as to who owns the data.

Verification studies – Funders do not want to fund them, they are seen as low status and not worthwhile. Journals do not want to publish straightforward replication studies; they value newness but this does not mean that the study is necessarily worthwhile. Again there are value judgements being made here, but not by the researchers themselves. The researchers are at the joint mercy of funders and publishers.

New models of data sharing – The way data is shared changes frequently (e.g. CD v iTunes model); people have to keep up to date with the environment of sharing.

Financial models – publishers need to make money, may impede the process of data sharing? OA needs to find a way of being sustainable.

INCENTIVISATION

Data Citation – This ensures that all data re-use is cited so that the original researcher(s) get(s) the credit for the data they have produced.

How to incentivise?  – Given that University promotion is based on new research and high impact journals, how can researchers be incentivised to share their data if they perceive that this may weaken their professional progress?

Peer review – Peer review of data could lead to public attributions of merit.

LEVELS OF ACCESS

Free information – One attendee wanted to make their personal research available – but wanted the access to be free.

Researcher pays? – This seems like vanity publishing to one of the attendees.

QUANTITATIVE V QUALITATIVE DATA

Someone mentioned that they could not find statistical data to back up their research but anecdotal, qualitative research supported their assumption. If they had waited for the supporting figures it would have taken too long to set their community project in motion. This is why community groups are now commissioning research.

Invitation to researchers to participate in an online survey for JoRD

Researchers – as stakeholders in the data sharing and policy environment – are invited to participate in an online survey for project JoRD.

JoRD (Journal Research Data Policy Bank) is a JISC funded initiative conducting a feasibility study into the scope and shape of a service to collate and summarise journal policies on Research Data. Such a service could provide researchers, managers of research data and other stakeholders with a central source of reference to understand and comply with these policies.
 
The project has just launched an online survey aimed at researchers. The survey is part of the stakeholder consultation phase of the project and aims to gauge researcher opinions/practices concerning research data, data sharing, the policies of journals, and thoughts on the shape of such a service.

The results of the survey will allow us to build a picture of researcher needs and will help inform recommendations made by the report.

The link to the online survey is given here:

http://www.surveymonkey.com/s/GZVP5ZS

Please feel free to distribute this link to other researchers that you may know.

We plan to keep the survey open until the end of Sunday 4th November.

Stakeholder Consultation

A crucial component of the JoRD project is now under way. Central to the building of a case for the JoRD policy bank is an in-depth consultation with stakeholders that have an interest in the policies and practices deployed by academic journal publishers with regards to data produced by researchers. These stakeholders naturally include publishers and journals, but also other players such as research funders, research administrators, data managers and librarians.

The consultation is intended to build on other strands of JoRD which are identifying and categorising current relevant data policies. We will thus seek to tease out the thinking and philosophy that underlies these policies, views about how these might develop and perceptions of what research data represents in the context of the publication process.

In the first instance, the consultation takes the form of semi-structured interviews with a selection of fifteen individuals. The questions that frame the interviews, reflecting the sort of issues outlined above, are attached here for information. The individuals concerned, ten of which come from the publishing world, have now been contacted and, with the exception of a couple of them from which a final confirmation is expected, all have readily expressed an interest and have agreed to take part. An interview schedule is being drawn up, covering the period 17 September to 12 October; indeed, as of today (20 September), the first interviews have already taken place.

The tight timescale of the project, along with budgetary constraints, limits the number of interviews that can realistically be carried out by mid-October. However, to capitalise on the positive reactions which the project has generated, and to enrich as much of possible the range of views that are being gathered, we are also asking an additional ten or so individuals to provide us with written responses to the interview questions. We are thus aiming to collate and synthesise the thoughts of about 25 key people. These will reflect a good diversity of circumstances; within the publishing world, we thus aim to represent the standpoints of commercial and learned society publishers; open-access and subscription based publishers; small and large organisations; university presses; and individual journals, where these have in place policies that are distinct from their parent publishing houses. Figuring among the non-publishing organisations to be consulted will be RCUK, HEFCE, DCC, JISC and ARMA – and hopefully, to provide an non-UK perspective, the Australian National Data Service.

The outputs from the interviews and responses to the questions will be synthesised into an interim report, to be produced during the second half of October. This in turn will form the basis of a discussion at an expert workshop, which will flesh out and refine salient points that will have emerged. The event is expected to take place at a date to be confirmed in late October or during the first half of November. Several of the interviewees have already agreed in principle to take part. More about this in a later post, once matters have progressed in the initial phases of the consultation.

Stéphane Goldstein