Going back to basics – reusing data

It is almost a year since the first set of data was gathered to analyse journal articles, and now the benefits of saving data well is becoming fruitful. Two things are happening that means we are getting the basic figures out, dusting them off and looking at them again. The first is a paper about the development of a model journal research data policy, which is being co-authored by the JoRD team members, and the second is in response to certain questions that various people are asking.

The idea of creating a model policy emerged from the mass of data that was being found in the analytical process, and it was based on what journals were already doing, and suggestions from the report “Sharing Publication-related Data and Materials: Responsibilities of Authorship in the Life Sciences” (Committee on Responsibilities of Authorship in the Biological Sciences, 2003,  http://www.nap.edu/openbook.php?isbn=0309088593). The report was the outcome of a workshop in the United States which involved Biological Scientists. The five principles and ten recommendations stated in the report were strongly in favour of open access to the data that underpins the research reported in published articles.  A summary of the principles and recommendations can be found here: http://www.councilscienceeditors.org/files/scienceeditor/v26n6p192-193.pdf. The report suggested that the data could either be included into the article , or deposited in a reputable repository and linked to the article. The focus of the first model data policy was therefore based on the rather patchy and inconsistent set of policies that were found, from less than half the journals we analysed, and a report which was biased towards one scientific discipline.  It was decided to compare the initial model data policy with the needs of the stakeholders, which were examined at a later stage in the JoRD project. This has entailed, not only going over the data gathered from the stakeholder interviews and questionnaire, but also digging retrospectively into the reasons for the initial model criteria to be chosen.

The second reason for examining the basic data has come from interesting questions asked by a number of bodies that know about the JoRD project, and therefore assume that the JoRD team are experts in the field of Journal research data policies, an assumption that is becoming increasingly true as more questions are answered. In order for the questions to be answered, the data needed to be looked at from a different perspective. For example, to answer “How many journals make sharing a requirement of publication?” the original data set was re-examined and journals counted, because the original analysis was looking at the number of policies, some journals having up to three different data policies. Here follows a table with figures from a journal perspective:

Results of Journal Survey
Total no. of Journals surveyed 371
Total no. of Journals with data sharing policies 162
Total no. of Journals that make sharing a requirement of publication 31
Total no. of Journals that enforce the policies 27
Total no. of Journals that state consequences for non compliance 7

This process is an illustration of the way that well organised data, saved  safely, and as in this case in digital form, can be re-used after a particular project has ended. Surely it is generally after research has been concluded that questions arise and the iterative process of dipping in and out of data to validate or extend the research then begins. The moral of this blog post? Manage your data well because you never know what you will asked.

Advertisement

Barriers to sharing data

There is a stereo-typical image of a covetous academic, dedicated to their work and who hoards the data for their research, so that no-one else will achieve the acclaim for their life’s work. Presumable this stereo-type arose from such stories as Isaac Newton and Gottfried Leibniz having a major dispute over which of them first discovered Calculus. In hindsight, both of them discovered it independently and both deserved acclaim. Charles Darwin kept his data on the “Origin of the Species” for very many years, before being persuaded to publish what turned out to be a popular science book of its day.

But we are not in the 17th or 19th Centuries, we are in the age of Information, Internet and global networks where collaboration has become respected. Teams of scientists are now rewarded, for example the Manchester University Physicists Andre Geim and Kostya Novoselov who won the Nobel Prize for Physics with their invention of Graphene. The Royal Society report “Science as an Open Enterprise” (http://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/projects/sape/2012-06-20-SAOE.pdf) describes how an outbreak of e-coli which originated in Hamburg was contained by the work of scientists in four continents who posted their analysis of the virus onto open source sites.  The genetic sequencing of the virus was completed by scientists in Hamburg and China, which was then posted onto an open source site with an open data license. In July of last year the European Commission published a press release outlining the measures that they will take to improve open access to scientific information that is produced in Europe, because the Commission feels that open access to data will improve Research and Development,and increase knowledge and  competitiveness in Europe (“Scientific data: open access to research results will boost Europe’s innovation capacity” http://europa.eu/rapid/press-release_IP-12-790_en.htm).

Such openness and swift communication is expected by today’s researcher. However, an EU study found that only 25% of researchers openly share their data.  The researchers that participated in our study expressed the desire to share their data, some were already sharing, but others found that although they wanted to share it was not easy to achieve. Many felt that there were barriers put in their way, one of which involved the old stereotype, they were not expected to share. For example, funding bodies may well be encouraging researchers to give open access to data that was paid for from public funds, but researchers believe that they will not get funding from using the data that someone else has collected although it would be an efficient and economical way of  carrying out research. Researchers also reported that universities attract funding for new projects, not for re-use of data, and there is more interest in publishing new research rather than replication studies.

Practical reasons were also mentioned, for instance personal barriers to sharing data were listed as:

  • Not knowing  where to deposit data
  • Lack of time and resources to undertake the deposit of data
  • Confidentiality and sensitivity of data, restrictions from funding body or breaking trust with research participants

Barriers in the wider scientific environment were reported as the difficulty in accessing data repositories because of lack of standardisation, and a poorly supported data sharing environment. It would seem that there are two main barriers to be crossed before the open sharing of data is completely commonplace. First the stereotype of the data hugging scientist must disappear from the minds of  researchers, funders, Higher Educational Institutions and publishing houses. Secondly, the infra-structure of  data deposit sites, how, when and where to deposit data, has to be fully resolved, publicised and implemented. Once again, it would appear that a JoRD Policy Bank Service would be of great value to researchers because it would supply a central resource of how, when and where to share data,  contribute to improving the data-depositing infra-structure and remove one barrier to the open access of data.

Overview of policy types from the Science journals in the sample

Policy Types – Science Publications

From an analysis, the following sections represent various different policy types represented in the sample of Science publications.

Integral – Data/Materials/Software (Integral to your article)

Various policies talk about the data, materials and software etc that have been generated or used in the study, which would be integral to the study findings and necessary for subsequent study replication/verification purposes or to enable other researchers to build on the findings. These are illustrated below.

1. Data Release and Materials Release Policies

Examples

Cold Spring Harbor Laboratory Press: Genome Research (Top Science)

This is a clearly laid out and extensive ‘Life Science’ type policy denoting that it is a condition of publication of the journal that materials required to replicate the work must be made freely available – this principle needs to be agreed to on acceptance. Data should also be made as freely accessible as possible prior to publication. There are clear guidelines about the location of materials and a whole set of weblinks are given for the locations of the following types of material:

  • Sequence data
  • Genotype/Phenotype and genomic variation data
  • Microarray data
  • Proteomics and molecular interactions

Accession numbers must be included in the abstract. The policy says that if reasonable requests are not honoured then researchers should contact the Editor

BioMed Central: Multidisciplinary Respiratory Medicine (Bottom Science)

In the Instructions for Authors, the following data types are listed in their policy under ‘Data and Materials release’; these are also fairly typical for Life Science disciplines:

  • Nucleotide Sequences
  • Protein Sequences
  • Mass spectrometry
  • Structures
  • Chemical structures and assays
  • Functional genomics data (such as microarray, RNA-seq or ChIP-seq data)
  • Computational Modelling
  • Plasmids

Each data type specifies the named databases for storing the data and gives weblinks for ease of access. Appropriate external guidelines for the data are given where appropriate such as MIAME. These materials are classed as “readily reproducible” and are to be made “freely available to any scientist wishing to use them for non-commercial purposes”. This ‘life science’ type policy only seems unusual in its timescale for inclusion of the Accession Number – which is in time to be included in the published article (rather than say with the submitted manuscript).

Cell Press :  A range of publications including e.g. Cell (Top Science)

Cell Press publications have a ‘Distribution of Materials and Data’ section which states that it is a term and condition of publishing for authors to be willing to distribute any materials (cells, DNA, antibodies, reagents, organisms, mouse strains, ES cells) and protocols. Structures should also have their relevant information lodged with named or appropriate databases. MIAME guidelines should be followed as appropriate. Authors should contribute additional data/materials to appropriate databases and repositories. Accession numbers are required.

The Royal Society of Chemistry: Chemical Society Reviews (Top Science)

RSC journals have very comprehensive guidelines for both single crystal and powder diffraction data.  In the case of the former, authors should prepare their work in CIF (Crystallographic Information File) format. For single crystal work, structural information should be deposited with the Cambridge Crystallographic Data Centre (CCDC) and upon submission of the manuscript the CCDC reference numbers will be requested. Powder diffraction data may be submitted as a CIF file via the RSC submissions service.

Nature Publishing Group (includes all journals published by Nature which have Nature in the title)

NPG has very full guidelines for ‘Availability of data and materials’, ‘Sharing Materials’, and ‘Sharing data sets’. They refer to various named repositories and databases for many types of materials and data. Guidelines such as MIAME are noted. There is also a comprehensive Further Reading list which encompasses Nature Journal editorials on these topics. http://www.nature.com/authors/policies/availability.html

2. Data/Materials Sharing as the ‘Ethical Guidelines’ of the discipline

Several publishers/publications refer to ‘ethical  guidelines’ which are part of the landscape of the discipline concerned. Professional conduct means that data/materials should be made available for appropriate researchers to allow for further analysis and review.

 American Chemical Society: Chemical Reviews (Top Science)

The American Chemical Society publishes a number of journals and has created a set of “Ethical Guidelines to Publication of Chemical Research”. Authors wishing to publish in journals such as Chemical Reviews are expected to follow these ethical guidelines. Part of the ‘Ethical Obligations of Authors’ states that “When requested, the authors should make every reasonable effort to provide data, methods, and samples of unusual materials……. to other researchers” and “Authors are encouraged to submit their data to a public database, where available”.

American Physical Society: Reviews of Modern Physics (Top Science)

See Under:  Ethics and Values (Guidelines for Professional Conduct) – Research Results:

“The results of research should be recorded and maintained in a form that allows analysis and review”

Elsevier: (e.g.  Progress in Polymer Science – Top Science)

Ethics in Research Publication – Data access and retention:

“Authors may be asked to provide the raw data in connection with a paper for editorial review, and should be prepared to provide public access to such data (consistent with the ALPSP-STM Statement on Data and Databases), if practicable, and should in any event be prepared to retain such data for a reasonable time after publication.

3. Data Sharing – not necessarily mandatory

BMJ Group – British Medical Journal (Top Science)

Authors are ‘encouraged’ to link their articles to external databases (no hosting to be done by BMJ) and then include a ‘data sharing statement’ at the end of the manuscript. This statement should state if data sharing is available or not, and if it is, where to obtain the information. BMJ is also interested in the informed consent of the research participants and reference to this should also be made in the statement.

Data sharing is thus not mandatory to the journal, but the journal recognises that it could be mandatory according to certain funders etc.

4. Database Linking – connecting with external databases

Elsevier – Current Opinion in Cell Biology (Top Science)

In the Author Information Pack, Elsevier draw attention to linking to external databases that help to build a better understanding of the described research:

“Elsevier encourages authors to connect articles with external databases”.

This is very vague, and is rather more ‘encouraging’ than ‘mandatory’.

 Nature Reviews – e.g. Neuroscience, Molecular Cell Biology (Top Science)

Nature Reviews are journals which publish reviews of existing data in different fields in any case – “Proteins, protein domains, genes and diseases are linked to specific pages in relevant and high-quality public databases”.

 5. Links to materials on an authors’ institutional website

American Physiological Society: Physiological Reviews (Top Science)

This journal permits one of the authors to provide a working URL from their institutional website (links to additional datasets and/or detailed methods and protocols) which is to be given in an Endnote in the manuscript – under the proviso that it is recognised that this material is not peer-reviewed and may be updated from time to time. It is for readers seeking to replicate or expand on the work.

Supplementary Materials

Supplementary materials are frequently of the request/suggest type and are lodged with the journal – they are mainly of the ‘enhancing your article type’ and often include multimedia. There are, however, exceptions to this general idea of article ‘enhancement’ as some of the supplementary materials could actually be classed as ‘integral’ to the article’s findings.

1. Request/Suggest type – and happy to accept it – usually submitted with the manuscript – published with the journal

Essentially similar to those which are prevalent in the Social Sciences – especially where the publisher is the same (e.g. Springer Publications such as Proceedings of the National Academy of Sciences, India Section B: Biological Sciences – Bottom Science).

Cell Press (See for example Immunity – Top Science)

They see ‘supplemental information’ as a useful resource, but recognise that it needs to be managed by structure and limits. The material is considered to be “additional or secondary support for the main conclusions” (thus implying not of the integral type). They require information to be submitted according to three headings: 1. Supplemental Data, 2. Supplemental Experimental Procedures, 3. Supplemental References. They give file formats and sizes. Alongside this, Immunity also has a Distribution of Materials and Data policy. Immunity is one of the journals to have more than one data policy.

Annual Reviews (See for example Astronomy and Astrophysics – Top Science)

A comprehensive ‘Supplemental Materials Policy’.  Preparation guidelines are provided, along with acceptable and unacceptable file types. This material is to be “supportive but not primary”.

Nature publications (e.g. Cell Biology – Top Science)

Supplementary information is not copy-edited, modifications after publication require a formal correction, guidelines are to be followed for it or publication may be delayed, each piece of supplementary material must be referred to at least once in the text of the main article. There is a comprehensive set of guidelines for SI.

The Lancet (e.g. Infectious Diseases, Neurology – Top Science)

Unlike other publications which refer to ‘supplementary’ or ‘supplemental’ material, publications by The Lancet tend to refer to ‘Guidelines for web extra material’, however these refer to fairly standard things such as text, tables, data, drug names, references, figures, and audio/video material. It is preferred that this material is submitted as one PDF with the paper, and it will be peer-reviewed.

The American Astronomical Society: Astrophysical Journal Supplement Series

The AAS have a policy on machine readable tables (MRT) whereby lengthy tables should be moved to MRT format. There are full guidelines about this.

2. Hosting this material is new to us

The Canadian Field Naturalist (Bottom Science) – “Supplementary Material”

“Supplementary material is a new feature for CFN so we do not know which file formats can and cannot be accepted; please consult our journal Manager with any question about specific formats”

This journal is just starting out on the process and has yet to clarify its procedures.

3. Supporting Information – but ‘essential’ or ‘central’ for understanding the main points of the article – with journal

Wiley Online Library: Angewandte Chemie International Edition (Top Science)

From the ‘Supporting Information’ section – here, although the information is classed under the heading of ‘supporting’, it is actually deemed ‘essential to understanding the article and includes “experimental procedures, spectroscopic data, graphics etc”, rather than just enhancing the article. There is a blurring here of supporting material with integral material. This is interesting here as the same journal also has a policy about Crystal Structure Analysis, in that Crystallographic data should not be sent as Supporting Information but should be lodged with the named Data Centres and deposition numbers must be supplied with the manuscript.

American Society of Clinical Oncology: Journal of Clinical Oncology (Top Science)

This journal “requires that large data sets central to the premise of a manuscript be submitted along with the original work as a supplemental file”. It does also state that data which can be submitted to a public database should be deposited and accession numbers provided.

 4. Supplementary Information – which should not be the sole evidence for the article

Wiley Online Library: Ecology Letters (Top Science)

From  the ‘Online Supplementary Information’ section – the journal clearly states that “the material published on the internet cannot be used as sole evidence for the print version of the article”. This implies that more integral data – the evidence base for the findings – should also be available elsewhere.

5. Supplemental Materials – only at the Editor’s discretion

Wiley Online Library: CA: A Cancer Journal for Clinicians (Top Science)

This journal states that Supplemental Materials presented as Appendices are not permitted and should be placed within the manuscript or eliminated. Supplemental materials are published at the Editor’s discretion. This journal is not really encouraging concerning the use of supplementary materials.

American Society for Microbiology: Microbiology and Molecular Biology Reviews

The Supplemental Material section states “Please avoid supplemental material”. It is an Editorial decision if any is to be published.

6. Supplementary Information – carefully controlling the volume of SI

Nature: Neuroscience, Immunology  (Top Science)

This publisher suggests with respect to these journals that since SI is proliferating and can be unwieldy “we have therefore decided to carefully control the volume of Supplementary Information”

New Data – Which is the Actual Publication itself

Examples

 IngenieraQuimica – Chemical Engineering (Bottom Social Science):

Under their ‘Write for the Site’ section they include:

“Post, articles, images related to chemical engineering, software or spreadsheets that you have prepared.”  Here the data to be shared becomes the article.

Overview of policy types from the Social Science journals in the sample

Policy Types – Social Sciences

 Integral – Data/Materials/Software (Integral to your article)

Like the Sciences, some Social Science publications also have policies for integral data.

1.  Integral data – but weak policy

This is the type of policy that should actually be strong in that it should really be monitored, and can indeed be monitored, but is referred to in terms which are quite weak implying that you can do this if you want to.

Examples

a) Elsevier – Schizophrenia Research (Top Social Science)

This policy refers to DNA sequences and GenBank Accession numbers – which in the case of strong policies are used to monitor that the data has been deposited.

However, the policy says “Many Elsevier journals cite “gene accession numbers”” and “Elsevier authors wishing to enable other scientists to use the accession numbers….” – which are not statements indicating that the data must be deposited as a requirement of publication.

b) The Royal College of Psychiatrists – The British Journal of Psychiatry (Top Social Science)

Under ‘Access to Data’ their policy states:

“If the study includes original data, at least one author must confirm that he or she had full access to all the data in the study, and takes responsibility for the integrity of the data and the accuracy of the data analysis. We strongly encourage authors to make their source data publicly available.”

This is the entirety of what it states and whilst it appears to be strong there is no indication of any monitoring or recommendations as to how the data should be made accessible. There is no recommendation here as to where data could be stored.

2. Integral data – the journal refers you to external Ethical Guidelines

a) Sage – Personality and Social Psychology Review (Top Social Science).

The Submission Guidelines refer you to the ethical guidelines of the American Psychological Association – these ethical guidelines contain the following statement:

8.14 Sharing Research Data for Verification
(a) After research results are published, psychologists do not withhold the data on which their conclusions are based from other competent professionals who seek to verify the substantive claims through reanalysis and who intend to use such data only for that purpose, provided that the confidentiality of the participants can be protected and unless legal rights concerning proprietary data preclude their release. This does not preclude psychologists from requiring that such individuals or groups be responsible for costs associated with the provision of such information. http://www.apa.org/ethics/code/index.aspx?item=11

Here the journal policy refers you to data sharing policies that are part of the ethical landscape of the discipline.

This also obviously refers to any publication that is actually published by the American Psychological Association – e.g. Psychological Methods (Top Social Science)

b) Sage – American Sociological Review (Published in association with the American Sociological Association – Top Social Science)

Under the Manuscript Submission procedures, this journal refers you to the ethical guidelines of the American Sociological Association – these ethical guidelines contain the following statement:

“Sociologists make their data available after completion of a project or its major publications, except where proprietary agreements with employers, contractors, or clients preclude such accessibility or when it is impossible to share data and protect the confidentiality of the data or the anonymity of research participants (e.g. raw field notes or detailed information from ethnographic interviews)”

3. Integral Data – but refers to the analysis of pre-existing datasets

a) Physicians Postgraduate Press – The Journal of Clinical Psychiatry (official journal of the American Society for Clinical Psychopharmacology)

See under:  ‘Analyses of Preexisting Datasets’ – Here the author is not necessarily the creator of the original dataset but is required to provide details about how the dataset in question can be accessed.

Supplementary Materials (Enhancing your article)

1. Request/Suggest type – and happy to accept it – submitted with the manuscript – published with the journal

Examples

a) Taylor and Francis publications – “Adding multimedia and supplementary content to your article” (generic to the publications of the publisher)

  • Reviewed in connection with Journal of Spanish Cultural Studies and Asia Pacific Journal of Social Work and Development (Bottom Social Science Journals)

This policy makes a range of suggestions about what types of material would enhance the article and is happy to accept the material to be published with the journal.

This policy refers to Animations, Movie Files, Sound files, Text files, and Supplementary Material (pertinent and support the article).

A range of file formats, file sizes and other instructions are provided. The material must be submitted with the manuscript.

The policy is weak. The material is not a ‘requirement’ of publication.

Both Elsevier (Video Data and Supplementary Data) and the American Psychological Association Publications (Multimedia Files) have a similar generic policy on data which enhances articles.

b) Springer Publications – “Electronic Supplementary Material (generic to publisher)

  • Reviewed  in connection with Asia Europe Journal (Bottom Social Science Journal)

This generic policy similarly refers to Audio, Video, and Animations. But it does also make mention of more specialised formats such as .pdb (chemical), .wrl (VRML) and .tex.

This policy also refers to the “Accessibility” of the provided content (related to catering for disabilities etc).

c) Springer Publications: Studies in East European Thought – “Electronic Supplementary Material”

This also makes mention of large original data such as additional tables.

Wiley Blackwell also have a “Supporting Information” type policy which contains Multimedia elements but also refers to “native datasets and specialist software” (possibly moving into the integral data arena as in the section below).

2. Request/Suggest type – and happy to accept it – but you can also link to an external database or repository (but not your own website)

a) Maney Publishing: London Journal – “Supplementary Material” (Bottom Social Science)

Formats and instructions are given.

3. ‘Supplemental Type’ Materials – which should really be described as ‘Integral’ Materials

a) Project HOPE – Health Affairs – “Supplemental Materials” (Top Social Science)

Some of the materials are probably described as supplemental (and thus supplementary to the article itself) because they will be deposited with the journal. However, the material they refer to is not of the multimedia type (which would enhance the article) but concerns “supplying information that is necessary to evaluate the credibility of their work” and probably should therefore be described as ‘integral’. There is a definition issue at work here.

The journal is particularly keen on the full details of any regressions which have been used.

Other ‘supplemental materials’ “may” be submitted and are therefore properly of the ‘request/suggest’ type.

b) Lippincott Williams & Wilkins  – Epidemiology – “Online Supplemental Material” (Top Social Science)

Underneath the section on ‘Online Supplemental Material’ the journal makes reference to Questionnaires – which should also be provided as online supplemental material. As these are foundational to the actual dataset, they are properly classed as integral materials. This is one of the few mentions of a questionnaire in a data/materials policy. Questionnaires are emphasised separately here as they are very frequent research tools in the Social Sciences but they are only mentioned once in the policies under review in the JoRD project.

NOTES ON THE ANALYSIS GENERALLY

  • Mention of appropriate Databases and Repositories for Social Sciences are not much in evidence in the policies unless the journal discipline is more scientifically orientated. This begs the question of where Social Science related materials would be stored if policies were to be made STRONG in the Social Sciences. What are the qualitative data databases that need to be referred to? What would the equivalent of an Accession number be?
  • Not much mention is made of specifically Social Science types of data – e.g. Transcripts of Interviews, Focus Groups, Questions and Questionnaires, although some of the data may be implicit in Multimedia policies (e.g. videos of scenarios under investigation in the article – the deposit of such data is complicated by needing to gain the permission of the participants who may be taking part in the videos and recordings of interviews?? As below).
  • There is a debate in the Social Sciences about the nature of ‘data’ itself. Social Sciences debate the concept of whether data is ‘out there’ waiting to be found (positivist assumptions), or ‘constructed’ in a ‘reflexive’ manner between researcher and participant. Also, can the context of a previous research study be transferred to the new study – or does the new researcher bring a new reflexivity to the data in question.
  • The data landscape in the Humanities and Social Sciences is complicated by the data being collected needing to respect the anonymity of individual human subjects who may be recognisable from raw data such as field notes. Can totally raw data be provided in the Social Sciences? (this may also apply to Science journals and patient data though, as some patients may be recognisable from their symptoms).