Skip to content
Sergey Parinov edited this page May 11, 2017 · 18 revisions

1. Introduction

The CitEcCyr project funded by RANEPA aims to create a source of open citation data, which will allow new types of citation analysis, in particularly for research papers in Russian language. Expected added values comparing with existed sources of the citation data are: a) a transparency and a distributed architecture of a technology that produces the citation data; b) an openness of all built/used software and created citation data; c) extended set of citation data sufficient for the citation content analysis; d) services of a public control over a citing activity of researchers and a quality of produced citation data.

Concerning the citation data the project currently uses two approaches implemented as two project's branches called CitEcCyr and CyrCitEc:

  1. the CitEcCyr branch aims to upgrading of the CitEc technology to make possible extracting and processing of citation relationships from research papers in Russian language available at RePEc and Socionet; and
  2. the CyrCitEc branch has to create a technology for extracting and processing of the citation data sufficient for the citation content analysis and to use this technology for processing research papers available at RePEc and Socionet to produce an open source of the citation content data.

Details about the CitEcCyr branch are here.

Below we provide details about the CyrCitEc branch.

The main goal of the CyrCitEc branch is to create conditions for radical improvement of a research assessment quality that based on traditional citation data. It can be done by moving from the traditional citation analysis to the citation content analysis. The citation content analysis (CCA) we understand as it has been presented by the paper of Guo Zhang, Ying Ding and Staša Milojević - “Citation content analysis (cca): A framework for syntactic and semantic analysis of citation content” (arXiv:1211.6321, Submitted on 27 Nov 2012, https://arxiv.org/abs/1211.6321).

2. Citation Content Analysis requirements

The citation content analysis needs data on how references (cited papers) are mentioned in the text of citing papers. We use the term "in-text reference" to denote a construction like "[number]" or "(authors)", which typically are used to make a link to mentioned publications presented in details at a reference list.

The first CCA requirement is to recognise in a paper text all "in-text references" and link them unequivocally with appropriate publications from a reference list of the paper. The in-text reference should have an attribute specified its parent publication from a reference list.

The paper “Citation content analysis (cca): ...” provides a Table 4. “Two-dimension and two-modular code book for CCA”, which specifies attributes of the citing and cited papers, and the attributes of the citing-cited interaction.

At least, following attributes for cited papers from the Table 4 can be recognised by processing a text of a paper:

Location of mentioning: 1)Abstract; 2)Introduction; 3)Literature Review; 4)Methodology; 5)Results/discussion; 6)Conclusion; 7)Others.

Frequency of mentioning: 1)Once; 2) 2 to 4 times; 3) 5 times or more.

Style of mentioning: 1)Not specifically mentioning; 2)Specifically mentioning but interpreting; 3)Direct quotation.

The second CCA requirement follows from a task to determine the "location of mentioning". We have to find titles of all sections in a paper and map them according some classification ("introduction", "review", "methodology", etc.). To make relations between sections and in-text references we have to determine text coordinates for these both entities. The text coordinates are numbers of the start and the end symbols of each entity in the paper's text. So the in-text reference should have its coordinates as one more attribute.

The third requirement follows from a task to count a frequency of mentioning for each in-text reference. We have to assign to each in-text reference an unique ID as an attribute, which allows us to count them.

The forth requirement originates from a task to determine the style of mentioning. To do this we have to have a text located on the left and on the right for each in-text reference. Analysing this text we can classify a style of mentioning of cited publications in the citing papers, but also make many other types of semantic analysis, as well. So one more attribute of an in-text reference is some text string on the left and on the right from it.

Summarising CCA requirements for the in-text reference, to be sufficient for CCA it should have, at least, following attributes:

  1. an unique ID of related publication from a reference list;
  2. text coordinates as number of its start and end symbols;
  3. own unique ID to allow processing it as an information object;
  4. text context on the right and on the left from it.

In the requirements above we mentioned a "publication from a reference list". It means, that we have also some requirements to the data extracted from paper's reference list. The main one here is the data have to allow producing the unique ID of references. The ID of a reference should properly match with content of the in-text reference.

3. Method

To create the open data source for the citation content analysis with added values like "transparency", "distributed architecture", "openness" and "public control" we are developing following services:

  1. Conversion of PDF. Since the most of research papers are PDF files, we produce converted versions of papers in PDF, which allow extraction (parsing) of additional data according CCA requirements. We share our conversion tools and related software at github and provide an open access to converted versions of PDF papers.

  2. Template of CCA data. Using converted versions of PDF papers we can parse extended citation data that include a) references, b) in-text references, c) its relationships (both the internal within the PDF paper and external between citing and cited papers), d) text content around in-text references, etc. It can be done by decentralised. We propose an XML template for CCA data, which allows to integrate citation data for the same papers created by different distributed services.

  3. Parsing references.

  4. Linking references with metadata of the same papers.

  5. Parsing in-text references.

  6. Visualise citation content analysis data over papers' PDF.

Clone this wiki locally