The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A Hybrid Technique for Annotating Book Tables Asima Latif1, Shah Khusro1, Irfan Ullah1, and Nasir Ahmad2 1Department of Computer Science, University of Peshawar, Pakistan 2Department of Computer Systems Engineering, University of Engineering and Technology Peshawar,

Table extraction is usually complemented with the table annotation to find the hidden semantics in a particular piece of document or a book. These hidden semantics are determined by identifying a type for each column, finding the relationships between the columns, if any, and the entities in each cell. Though used for the small documents and web-pages, these approaches have not been extended to the table extraction and annotation in the book tables. This paper focuses on detecting, locating and annotating entities in book tables. More specifically it contributes algorithms for identifying and locating the tables in books and annotating the table entities by using the online knowledge source DBpedia Spotlight. The missing entities from the DBpedia Spotlight are then annotated using Google Snippets. It was found that the combined results give higher accuracy and superior performance over the use of DBpedia alone. The approach is a complementary one to the existing table annotation approaches as it enables us to discover and annotate entities that are not present in the catalogue. We have tested our scheme on Computer Science books and got promising results in terms of accuracy and performance.


[1] Adhikesavan K., An Integrated Approach for Measuring Semantic Similarity between Words and Sentences using Web Search Engine, The International Arab Journal of Information Technology, vol. 12, no. 6, pp. 589-596, 2015.

[2] Alrashed S., Finding Hidden Semantics of Text Tables, in Proceedings ofInternational Workshop on Document Analysis Systems, Heidelberg, pp. 449-461, 2006.

[3] Amin M., Bhattacharjee A., and Jamil H., Wikipedia Driven Autonomous Label Assignment in Wrapper Induced Tables with Missing Column Names, in Proceedings of the 2010 ACM Symposium on Applied Computing, Switzerland, pp. 1656-1660, 2010.

[4] Amyuni T., Amyuni Technologies Inc., Montreal, Available: http://blog.amyuni.com/?p=1062, Last Visited, 2018.

[5] Cimiano P. and Volker J. Towards Large-scale, Open-domain and Ontology-based Named Entity Classi cation, in Proccedings ofInternational Conference on Recent Advances in Natural Language Processing, Bulgaria, pp. 166-172, 2005.

[6] E-Silva A., New Metrics for Evaluating Performance in Document Analysis Tasks_Application to the Table Case, in Proccedings of 9th International Conference on Document Analysis and Recognition, Parana,pp.481-485, 2007.

[7] E-Silva A., Jorge A., and Torgo L., Design of an end-to-end Method to Extract Information from Tables, International Journal of Document Analysis and Recognition, vol.8, no. 2, pp. 144- 171, 2006.

[8] Embley D., Lopresti D., and Nagy G., Notes on Contemporary Table Recognition, in Proccedings of Document Analysis Systems VII, Berlin, pp. 164-175, 2006.

[9] Fang J., Gao L., Bai K., Qiu R., Tao X., and Tang Z., A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures, in Proceedings of International Conference on Document Analysis and Recognition, pp. 779-783, 2011.

[10] Fang J., Tao X., Tang Z., Qiu R., and Liu Y., Dataset, Ground-Truth and Performance Metrics for Table Detection Evaluation, in Proceedings of 10th IAPR International Workshop on Document Analysis Systems, Gold Cost, pp. 445- 449, 2012.

[11] Fleischman M. and Hovy E. Fine Grained Classi cation of Named Entities, in Proceedings of 19th International Conference on Computational Linguistics, Stroudsburg, pp. 1-7, 2002.

[12] Guo X., Chen Y., Chen J., and Du X. ITEM: Extract and Integrate Entities from Tabular Data to RDF Knowledge Base, in Proceedings of 13th Asia-Paci c Web Conference on Web Technologies and Applications, Heidelberg, pp. 400-411, 2011.

[13] Han L., Finin T., Parr C., Sachs J., and Joshi A., RDF123: From Spreadsheets to RDF, in Proceedings of the 7th International Conference on The Semantic Web, Karlsruhe, pp. 451-466, 2008.

[14] Hassan T. and Baumgartner R., Table recognition and understanding from pdf files, in Proceedings of 9th IEEE International Conference on Document Analysis and Recognition,Parana, pp. 1143-1147, 2007.

[15] Hignette G., Buche P., Dibie-Barth lemy J., and Haemmerl O., Fuzzy Annotation of Web Data Tables Driven by a Domain Ontology, The Semantic Web: Research and Applications, Crete, pp. 638-653, 2009.

[16] Hu J., Kashi R., Lopresti D., and Wilfong G., Evaluating the Performance of Table Processing Algorithms, International Journal on Document Analysis and Recognition, vol. 4, no. 3, pp.140- 153, 2002.

[17] Hurst M., Towards a Theory of Tables, International Journal of Document Analysis and Recognition, vol. 8, no. 2-3, pp. 123-131, 2006.

[18] Jiang D. and Yang X., Converting PDF to HTML Approach based on Text Detection, in Proccedings of 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human,Seoul, pp. 982-985, 2009.

[19] Khusro S., Latif A., and Ullah I., On Methods and Tools of Table Detection, Extraction and Annotation in PDF Documents, Journal of Information Science, vol. 41, no. 1, pp. 41-57, 2014.

[20] Kieninger T. and Dengel A., A Paper-to-Html Table Converting System, Document Analysis Systems, vol. 98, pp. 1-10, 1998.

[21] Limaye G., Sarawagi S., and Chakrabarti S., Annotating and Searching Web Tables Using Entities, Types and Relationships, in Proccedings of VLDB Endowment, vol. 3, no. 1, pp. 1338-1347, 2010.

[22] Liu Y., Tableseer: Automatic Table Extraction, Search, and Understanding, Pennsylvania State University, 2009.

[23] Liu Y., Bai K., Mitra P., and Giles C., TableSeer:Automatic Table Metadata Extraction and Searching in Digital Libraries, in Proccedings of 7th ACM/IEEE-CS joint conference on Digital libraries, Vancouver,pp. 91-100, 2007.

[24] Mohemad R., Hamdan A., Othman Z., and Noor A Hybrid Technique for Annotating Book Tables 783 N., Automatic Document Structure Analysis of Structured PDF Files, International Journal of New Computer Architectures and their Applications, vol. 1, no. 2, pp. 404-411, 2011.

[25] Mulwad V., DC Proposal : Graphical Models and Probabilistic Reasoning for Generating Linked data from tables, in Proccedings of The Semantic Web-ISWC, Heidelberg, pp. 317-324, 2011.

[26] Mulwad V., Finin T., and Joshi A., Generating Linked Data by Inferring the Semantics of Tables, in Proccedings of The workshop Very Large Data Search VLDS, pp. 17-22, 2011.

[27] Oro E. and Ruffolo M.. Xonto: An Ontology- based System for Semantic Information Extraction from pdf Documents, in Proccedings of 20th IEEE International Conference on Tools with Artificial Intelligence, Dayton, pp. 118-125, 2008.

[28] Oro E. and Ruffolo M., PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents, in Proccedings of 10th International Conference on Document Analysis and Recognition, Barcelona, pp. 906-910, 2009.

[29] Quercini G. and ReynaudC., Des Donn es Tabulaires RDF: l'extraction de Donn es de Google Fusion Tables, in Proccedings of Atelier Ontologies et Jeux de Donn es pour valuer le Web S mantique, Paris, 2012.

[30] Quercini G. and Reynaud C., Entity Discovery and Annotation in Tables, in Proccedings of 16th International Conference on Extending Database Technology,Genoa, pp. 693-704, 2013.

[31] Schmoekel I., PDF-Analyzer Pro 4.0. Software- Development and Distribution, Informer Technologies, vol. 1, pp. 1-11, 2010.

[32] Shahab A., Shafait F., Kieninger T., and Dengel A., An Open Approach Towards the Benchmarking of Table Structure Recognition Systems, in Proccedings of 9th IAPR International Workshop on Document Analysis Systems,Massachusetts, pp. 113-120, 2010.

[33] Suchanek F., Kasneci G., and Weikum G., Yago: A Core of Semantic Knowledge, in Proccedings of 16th International World Wide Web Conference, Alberta, pp. 697-706, 2007.

[34] Van Assem M., Rijgersberg H., Wigham M., and Top J., Converting and Annotating Quantitative Data Tables, in Proccedings of 9th International Semantic Web Conference on The Semantic Web, Heidelberg, pp. 16-31, 2010.

[35] Venetis P., Halevy A., Madhavan J., Pa ca M., Shen W., Wu F., Miao G., and Wu C., Recovering Semantics of Tables on the Web, VLDB Endowment, vol. 4, no. 9, pp. 528-538, 2011.

[36] Wang J., Wang H., Wang Z., and Zhu K., Understanding Tables on the Web, Conceptual Modeling, Heidelberg, pp. 141-155, 2012.

[37] Wu W., Li H., Wang H., and Zhu K., Towards a Probabilistic Taxonomy of Many Concepts, Technical Report, Microsoft Research Asia, 2011.

[38] Yildiz B., Kaiser K., and Miksch S., pdf2table: A Method to Extract Table Information from PDF Files, in Proceedings of the 2nd Indian International Conference on Artificial Intelligence, Pune, pp. 1773-1785, 2005.

[39] Zanibbi R., Blostein D., and Cordy J., A Survey of Table Recognition, Document Analysis and Recognition, vol. 7, no. 1, pp. 1-16, 2004. Asima Latif obtained her BS and MS degrees in Computer Science from the Department of Computer Science, University of Peshawar, Pakistan. Her research interests include information retrieval, information extraction, information semantics and search engines. Shah Khusro received his Ph.D. degree from Vienna University of Technology, Vienna, Austria. He is currently working as Professor at the Department of Computer Science, University of Peshawar, Pakistan. His research interests include Web Semantics, Web Engineering, Information Retrieval, Web based Systems, Ambient Assisted Living, and Mobile Technology for People with Special Needs. Irfan Ullah received his MS degree in Web Engineering from the Department of Computer Science, University of Peshawar, Pakistan in 2014. He is now pursuing his PhD from the same institute. His research interests include Web Semantics, Linked Open Data, Information Retrieval, Web Engineering and Digital Libraries. He is also working as Assistant Professor at Shaheed Benazir Bhutto University, Sharingal, Pakistan. Nasir Ahmad received his PhD degree from Loughborough University, UK. Currently he is working as Assistant Professor at the Department of Computer Systems Engineering, University of Engineering and Technology Peshawar, Pakistan. His Research interests include Speech and Video Processing and Digital Signal Processing, Pattern Recognition, and Machine Learning.