Lessons Learned: The Complexity of Accurate

Author Department of Computer Science, Mohammad Ali Jinnah University, Pakistan ,

Keywords #

Abstract The importance of citations is widely recognized by the scientific community. Citations are being used in making a number of vital decisions such as calculating impac t factor of journals, calculating impact of a researcher (H-Index), ranking universities and research organizations. Furthermor e, citation indexes, along with other criteria, employ citation counts to retrieve and rank relevant research papers. However , citing patterns and in-text citation frequency are not used for such important decisions. The identification of in-text citation from a scientific document is an important problem. However, identification of in-text citation is a tough ask d ue to the ambiguity between citation tag and conten t. This research focuses on in-text citation analysis and makes the following s pecific contributions such as: Provides detailed in-text citation analysis on 16,000 citations of an online journal, reports diff erent pattern of citations-tags and its in-text citations and highlights the problems (mathematical ambiguities, wrong allotment s, commonality in content and string variation) in identifying in-text citations from scientific documents. The accurate i dentification of in-text citations will help information retrieval systems, digital libraries and citation indexes.

References with Zero in-Text Citation Frequency There is another interesting finding that out of 16 ,000 citations, we were able to identify more than 3,000 citations which were not even cited a single time i n the body text of the cited by document. Such citations are being used for making vital decision such as: Calculating impact factors of journals, H-Index of authors etc., such authoritative systems may cross check the in-text citation frequencies before makin g 484 The International Arab Journal of Information Techn ology, Vol. 12, No. 5, September 2015 such vital decisions. Furthermore, the administrati on of journals/conference and reviewers should at least m ake sure that all reference have been cited even once i n the body text of the document. 3.5. Identification of in-Text Citation Frequency A citation tag is a unique combination of character s used to cite a particular reference in the body tex t of the paper. For example, consider the following scenario: Figure 1-a represents a typical reference from a real document from our dataset where the citation tag is

[Weber 1987] . Figure 1-b represents text snipp ets where the citation tag

[Weber 1987] has been used . It is obvious from Figure 1-b that the in-text cita tion frequency of this reference is four. In this section we will explain the reasons for incorrect identification of in-text citations. For this purpose, we have identified clusters based on diffe rent types of citation tags. a) Reference whose citation tag is Weber 1987 . b) In-text citations of Weber 1987 in body of the article. Figure 1. Typical scenario of in-text citation occu rrences for a reference 3.5.1. Clustering Citation-Tags The clusters based on citation-tags are as follows: Numeric: This cluster represents all such citations which have a numeric citation tag for example

[1] , 1. and (1) etc. Alphabetic: This cluster represents all such citati ons which have an alphabetic citation tag. This cluster was the most populated one. The citation-tag examples of this cluster are: Srinivasan, Scherbakov 1995 ,

[Davenport and Prusak, 1998],

[Staiger 1993],

[Olson et al. 2002],

[MPEG-7] etc. Single Character: This is an interesting cluster having citation-tags of single character long such as

[N] ,

[P] ,

[A] etc. However, this cluster was the less populated one. 3.5.2. Identification of Incorrect in-Text Citations We identified a number of different reasons for incorrect identification of in-text citations as li sted all below. Each reason has a relation with the above mentioned clusters: Wrong Allotment: When in-text citation of one cited article is assigned to another cited article. Mathematical Ambiguities: When Intervals, equation, figures or vector values are considered a s in-text citations. Commonality in Content: When normal text is considered as in-text citation tag. For example, we have a citation tag

[P] of a reference and P is v ery common character which is being used in the paper s content frequently. String Variations: When the citation tag in the tex t of the document is a variant of citation tag in the reference list. These variations are normally due t o include/exclude of some characters. Sometimes, authors may refer a citation bit differently in the content as compared to the reference. For example, the citation tag from the reference list

[Davenpor t and Prusak, 1998] may be referred in the text of t he document in different ways such as:

[Daven-port and Prusak, 1998] ,

[Davenport and Prusak, 1998] and

[Davenport-and Prusak, 1998] etc. The overall results are presented in Figure 2. The X- axis shows different clusters as discussed above. T he Y-axis shows error percentage in different categori es. This graph shows interesting patterns, for example, the error category commonality in Content is the most frequently occurring category in the cluster Singl e Character ., the String Variations and Wrong Allotment are related with the cluster Alphabetic , and the Mathematical ambiguities is the most highlighted problem in the cluster Numeric , however, String variations is also an important i ssue to be addressed in the cluster Numeric . Error Rate in Percentage 100 Wrong Allotment Mathematical Ambiguities Commonality in Content String Variations 80 60 40 20 0 Numeric Alphabetic Single Character Figure 2. Reasons for wrong identification of in-te xt citations. This comprehensive study of more than 16,000 citations identified insights in the identification of in- text citations. This analysis is helpful for the sy stems which identifies in-text citations. The error categ ories are strongly correlated with the clusters. For exam ple, if a citation entry has an Alphabetic citation tag, the system should focus on the issues of Wrong allotment and String variations . 4. Real Scenarios From Scientific Documents Based on manual inspection and analysis of the incorrect results, we are presenting interesting re al Lessons Learned: The Complexity of Accurate Identification of in-Text Citations 485 scenarios from the documents where in-text citation has been identified incorrectly. The following scenarios demonstrate real issues where accurate identification of in-text citations is problematic. These scenarios highlight the ambiguit y of identification of citation tags in a typical par t of paper s content. Below is the detail of each scenar io. Each scenario is a typical example of common reason s identified above. 4.1. Scenario 1-Mathematical Ambiguity Interval A reference is shown in Figure 3-a extracted from reference sections of an article. In this case, the citation tag is 2 . The citation in the running text of the document could be made using the following citation tags:

[2] ,

[2, , , 2] ,

[2 , 2] .

[, 2,] or it can be hidden in the following citation tag

[1-5] which is referring all references from 1 to 5. However, Figu re 3- b presents another snippet from the same document where

[-2, 2] is part of the paper text and does not belong to a citation tag. The tag

[-2, 2] is bein g used in a mathematical formula for donating an interval. Traditional in-text citation discovery systems will incorrectly make this interval values as in-text ci tation of reference 2 . a) Reference snapshot from a paper. b) Content snippet that can mislead the results for above reference. Figure 3. Scenario-1: mathematical ambiguity interv al. For tackling this type of problems, the automated tool needs to discover the context of the citation and needs to disambiguate between actual citation tag a nd content of the paper. 4.2. Scenario 2-Mathematical Ambiguity Parenthesis This scenario is an extension of the scenario numbe r 1. A reference is shown in Figure 4-a from the referen ce section of an article where its citation tag is 8 . In the body text of that article, (8) could be the one possible citation tag. However, Figure 4-b demonstrates a text from the same document where th e (8) is being referred for some mathematical equat ion defined in that article. Thus, it will again become ambiguous for an automated tool to identify in-text citation accurately. Similarly, another example of mathematical ambiguity is shown in Figure 5-a and Figure 5-b. In the shown example the citation tag

[1] is used to refer first reference. However, in body text of the paper there are some assertion being made an d referred as (1). Therefore, again it will become ambiguous for automated tools to correctly mark in- text citation for that reference

[1] . a) Reference snapshot from a paper. b) Content snippet that can mislead the results. Figure 4. Scenario-2: mathematical ambiguity interv al. a) Reference snapshot from a paper. b) Content snippet that can mislead the results for above reference. Figure 5. Scenario-2: mathematical ambiguity parent hesis. The equation number and intervals were found as two important misleading contents for the accurate identification of in-text citation frequencies. The se types of problems increased the incorrect results a s were shown in the Table 1. These kinds of problem may be addressed by disambiguating in-text citation and context of usage of such citation tag in articl e. 4.3. Scenario 3-String Variations In this scenario, we have shown that hyphen can be used within the citation tag while referring to a particular reference in body text of the document. For example, in Figure 6-a, the citation tag is

[Lawve re and Schanuel 1997] , however, Figure 6-b represents a snippet from the same document where the citation t ag

[Law-vere and Schanuel 1997] is used to refer to th at reference. The inclusion of additional characters s uch as hypen (-) in the in-text citation was another re ason. These types of problems can be resolved using some string comparisons such as edit distance and Levenshtein distance etc. a) Reference snapshot from a paper. b) Content snippet that can mislead the results for above reference. Figure 6. Scenario-3: String variation. 486 The International Arab Journal of Information Techn ology, Vol. 12, No. 5, September 2015 4.4. Scenario 4-Wrong Allotment In J.UCS dataset we found that some articles have u sed authors and year information for citation tag. Mult iple papers of an author with different team in the same year are referred as shown in Figure 7-a. a) Reference snapshot from a paper. b) Content snippet that can mislead the results. Figure 7. Scenario-3: Wrong allotment. There are two separate tags for each citation i.e.,

[Viroli and Omicine, 2001] and

[Viroli et al. 2001]. Automated solutions such as PDFx wrongly build a regular expression for citation tag based on only f irst author and year information. Therefore, a regular expression, designed to calculate in-text citation of Viroli, 2001 would mislead the results. Improper building of regular expression was one of the reasons that took part in the overall improper marking of in-text citation as sho wn in Table 1. To solve such problems, we should desig n a regular expression carefully such as in the above c ase, two separate regular expression should be designed:

[Viroli and Omicine, 2001] and

[Viroli et al., 2001]. Similar to above example, in Figure 8, references snapshot from a paper is shown. In this case, automated tools may fails due regular expression fo r finding in-text citations based on first author of a paper. Figure 8. Reference snapshot from a paper. 4.5. Scenario 5-Commonality in Content We found that some authors have used very common citation tags. For example, in the reference entry shown in Figure 9 represents a citation-tag

[p] . Here, the contemporary systems will only use the characte r P as a reference tag, as show in Figure 9. Figure 9. Reference snapshot from a paper. These kinds of citation tags are very sensitive as P is common character which may occur many times in the full text of the paper and will mislead the calculation of in-text citation frequencies. The us e of common character as a citation tag was one of reaso ns that caused the overall incorrect marking of in-tex t citations as shown in Table 1. These types of problems may be handled by designing proper regular expressions. For example, in the above scenario, the extensive list of regular expression would be as follows:

[P] ,

[P, , ,P] ,

[P , P] .

[,P,] . 5. Summary In-text citation can be used in a number of areas. Therefore, accurate marking of in-text citation is crucial. In this paper, we have presented detailed analysis of in-text citation and some interesting r eal scenarios explored during manual analysis and verification of in-text citation frequencies. The presented analysis and interesting scenarios will h elp the researchers to understand the problems of corre ctly marking of in-text citations automatically. In-text citations are made with the help of citation tag. Different problems have been discussed that are associated with different citation tags such as usi ng only numbers, alphabets and alphanumeric etc., ther e is a need for a deeper analysis of the content of the paper to better disambiguate between mathematical equatio n numbers, intervals and the accurate citation-tag. B eside the difficulty of accurate identification of citati on tags, there are certain other issues which are related wi th PDF to text/ XML conversion. The most important are subscript, superscripts and encoding etc. Thus, whe n devising an automatic solution for in-text citation exploitation, the aforementioned issues must be carefully planned so that maximum accuracy can be achieved. 6. Conclusions This research focuses on the exploration of in-text citation frequencies in the text of scientific docu ments. In this paper, we have provided detailed in-text ci tation analysis on 16,000 citations of an online journal, reported different pattern of citations tags and it s in- text citations and presented some interesting real problems that a researcher may confront while exploiting in-text citations. Furthermore, citation tags of inaccurate identification of its in-text citatio ns were divided into three different clusters such as Nume ric , Alphabetic and Single Characters . The Numeric and Alphabetic clusters were most populated clust ers as compared to Single Character cluster. Based up on Lessons Learned: The Complexity of Accurate Identification of in-Text Citations 487 these three types of clusters, different reasons fo r inaccurate identification of in-text citations were discovered. The frequent errors were due to wrong allotment, mathematical ambiguities, commonality in content and string variations. Finally, we have als o highlighted the possible solutions for each problem that will help future systems which focus on the identification of in-text citations in various doma ins. In future we plan to develop a technique and algorithm s to tackle the discussed problems accurately in a systematic way. Moreover, we are planning to build a comprehensive system that can mark various types of the existing in-text citations with sufficient accu racy. References

[1] Afzal M., Kulathuramaiyer N., Maurer H., and Balke W., Creating Links into the Future, the Journal of Universal Computer Science , vol. 13, no. 9, pp. 1234-1245, 2007.

[2] Afzal M., Maurer H., Balke W., and Kulathuramaiyer N., Rule Based Autonomous Citation Mining with TIERL, the Journal of Digital Information Management , vol. 8, no. 3, pp. 196-204, 2010.

[3] Beel J. and Gipp B., Google Scholar s Ranking Algorithm: The Impact of Citation Counts (An Empirical Study), in Proceedings of the 3 rd International Conference on Research Challenges in Information Science , F s, Morocco, pp. 439-446, 2009.

[4] Ciancarini P., Iorio A., Nuzzolese A., Peroni S., and Vitali F., Semantic Annotation of Scholarly Documents and Citations, in Proceedings of the 13 th International Conference of the Italian Association for Artificial Intelligence , Turin, Italy, pp. 336-347, 2013.

[5] Constantin A., Pettifer S., and Voronkov A., PDFX: Fully-Automated PDF-to-XML Conversion of Scientific Literature, in Proceedings of ACM Symposium on Document Engineering , Florence, Italy, pp. 177-180, 2013.

[6] Garfield E., Citation Analysis as a Tool in Journal Evaluation, available at: http:// www.garfield.library.upenn.edu/essays/V1p527y 1962-73.pdf, last visited 2013.

[7] Giles C., Bollacker K., and Lawrence S., CiteSeer: An Automatic Citation Indexing System, in Proceedings of the 3 rd ACM Conference on Digital Libraries , Pennsylvania, USA, pp. 89-98, 1998.

[8] Gipp B. and Beel J., Citation Proximity Analysis (CPA)-A New Approach for Identifying Related Work based on Co-Citation Analysis, in Proceedings of the 12 th International Conference on Scientometrics and Informetrics , Rio de Janeiro, Brazil, pp. 571-575, 2009.

[9] Goodall A., Should Top Universities be Led by Top Researchers and are They?: A Citations Analysis, the Journal of Documentation , vol. 62 no. 3, pp. 388-411, 2006.

[10] Hirsch J., An Index to Quantify an Individual s Scientific Research Output, the Proceedings of the National Academy of Sciences of the United States of America , vol. 102, no. 46, pp. 16569- 16572, 2005.

[11] Iorio A., Nuzzolese A., and Peroni S., Characterising Citations in Scholarly Documents: the CiTalO Framework, in Proceedings of Semantic Web: ESWC 2013 Satellite Events , Montpellier, France, pp. 66-77, 2013.

[12] Iorio A., Nuzzolese A., and Peroni S., Identifying Functions of Citations with CiTalO, in Proceedings of Semantic Web: ESWC 2013 Satellite Events , Montpellier, France, pp. 231-235, 2013.

[13] Iorio A., Nuzzolese A., and Peroni S., Towards the Automatic Identification of the Nature of Citations, available at: http://ceur-ws.org/Vol- 994/paper-06.pdf, last visited 2013.

[14] Liu S. and Chen C., The Effects of Co-citation Proximity on Co-citation Analysis, in Proceedings of the 13 th Conference of the International Society for Scientometrics and Informetrics , Durban, South Africa, pp. 474-484, 2011.

[15] Maricic S., Spaventi J., Pavicic L., and Pifat- Mrzljak G., Citation Context versus the Frequency Counts of Citation Histories, the Journal of the American Society for Information Science , vol. 49, no. 6, pp. 530-40, 1998.

[16] Noor S. and Bashir S., Evaluating Bias in Retrieval Systems for Recall Oriented Documents Retrieval, the International Arab Journal of Information Technology , vol. 12, no. 1, pp. 53-59, 2015.

[17] Ritchie A., Citation Context Analysis for Information Retrieval, Doctoral Dissertation, University of Cambridge, 2009.

[18] Shahid A., Afzal M., and Qadir M., Discovering Semantic Relatedness between Scientific Articles through Citation Frequency, Australian Journal of Basic and Applied Sciences , vol. 5, no. 6, pp. 1599-1604, 2011.

[19] Shotton D., Portwin K., Klyne G., and Miles A., Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article, available at: http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC2663789/, last visited 2013.

[20] Teufel S. and Kan M., Robust Argumentative Zoning for Sensemaking in Scholarly Documents , Springer Berlin Heidelberg, 2011. 488 The International Arab Journal of Information Techn ology, Vol. 12, No. 5, September 2015

[21] Teufel S., Citations and Sentiment, available at: http://www.nactem.ac.uk/event_slides/Teufel 291009.pdf, last visited 2013. Abdul Shahid is a Lecturer in Computer Science at Institute of Information Technology, Kohat University of Science and Technology, Pakistan. Currently, he is pursuing his PhD in computer science from Mohammad Ali Jinnah University Islamabad, Pakistan. His research focuse s on recommending relevant documents with the help of in-text citation frequencies and patterns. In this field, he has published number of good quality papers in different international conferences and journals. B eside his research activities, he is a professional softw are developer and working as consultant for software companies for last six year. Muhammad Tanvir Afzal earned his masters in computer science (with Gold Medal) from Quaid-i- Azam University, Pakistan, He was awarded PhD with distinction from Graz University of Technology, Austria. He is working as Assistant Professor in the Department of Computer Science at Mohammad Ali Jinnah University, Pakistan, adjunct professor in institute for information systems and computer media at Graz University of Technology, Austria, and Editor-in-Chief for the journal: Journ al of universal computer science. He has published more than 60 research papers in well reputed journals an d conferences. His research interest includes: digita l libraries, semantic web, social web, knowledge management, and sentiment analysis. Muhammad Abdul Qadir received his PhD degree from University of Surrey GUILDFORD, UK in 1995. He serves as full professor and Dean at Mohammad Ali Jinnah University, Pakistan. He has more than 25 years of experience in industry, academia and management. Currently, he is actively involved in teaching/ R and D and academic management. He is recipient of two research project s of worth more than 55 million rupees. His current research focus is semantic web, multimedia semantic s, ontologies, distributed systems and bioinformatics. He has published more than 100 research publications i n International Refereed Proceedings and Journals.