The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


A Framework of Summarizing XML Documents with Schemas

eXtensible Markup Language (XML) has become one of the de facto standards of data exchange and representation in many applications. An XML document is usually to o complex and large to understand and use for a hum an being. A summarized XML document of the original document is useful in such cases. Three standards are given to evaluate the final summarized XML document: document size, information content, and information importance. A framework of summarizing an XML document based both on the document itself and the schema is given, which applies schema to summarize XML documents because there are many important semantic and structural information implied by the schema. In our framework, redundant data are first removed by abnormal functi onal dependencies and schema structure. Then tags a nd values of the XML document are summarized based on the document i tself and schema. Our framework is a semi-automatic approach which can help users to summarize an XML document in the sense that some parameters must be specified by the users. Experiments show that the framework can make the summarized XML document has a good balance of document size, information content, and information importance comparing with the origi nal one.


[1] Amini M., Tombros A., Usunier N., and Lalmas M., Learning
[2] Buneman P., Davidson S., Fan W., Hara C., and Tan W., Keys for XML, Computer Networks, vol. 39, no. 5, pp. 473<487, 2002.

[3] Dalamagas T., Cheng T., Winkel K., and Sellis T., A Methodology for Clustering XML Documents by Structure, Information Systems, vol. 31, no. 3, pp. 187<228, 2006.

[4] DBLP, available at: http://dblp.uni
[5] Dilek B. and Sanjay M., Entropy as a Measure of Quality of XML Schema Document, The International Arab Journal of Information Technology , vol. 8, no. 1, pp. 75<83, 2011.

[6] Fischer G. and Campista I., A Template
[7] Freire J., Haritsa J., Ramanath M., and Simon J., StatiX: Making XML Count, in Proceedings of the International Conference on Management of Data , USA, pp. 181<191, 2002.

[8] Hahn U. and Mani I., The Challenges of Automatic Summarization, Journal of Computer , vol. 33, no. 11, pp. 29<36, 2000.

[9] League C. and Eng K., Type
[10] Lv T., Gu N., and Yan P., Normal forms for XML Documents, Information and Software Technology , vol. 46, no. 12, pp. 839<846, 2004.

[11] Maneth S., Mihaylov N., and Sakr S., XML Tree Structure Compression, in Proceedings of the 3 rd International Workshop on XML Data Management Tools and Techniques , Italy, pp. 243<247, 2008.

[12] Mayorga V. and Polyzotis N., Sketch
[13] Polyzotis N., Garofalakis M., and Ioannidis Y., Approximate XML Query Answers., in Proceedings of SIGMOD International Conference on Management of Data , France, pp. 263<274, 2004.

[14] Ramanath M. and Kumar K., A Rank
[15] W3C, Extensible Markup Language, available at: http://www.w3.org/XML/, last visited 2011.

[16] Wang W., Jiang H., Lu H., and Yu J., Bloom Histogram: Path Selectivity Estimation for XML Data with Updates, in Proceedings of the 30th International Conference on Very Large Data Bases VLDB, Canada, pp. 240<251, 2004.

[17] Yu C. and Jagadish H., Schema Summarization, in Proceedings of the 32 nd International Conference on Very Large Data Bases VLDB , Korea, pp. 319<330, 2006.

[18] Zhang N., Ozsu T., Aboulnaga A., and Ilyas I., Xseed: Accurate and Fast Cardinality Estimation for XPath Queries, in Proceedings of the 2 2nd International Conference on ICDE , USA, pp. 61, 2006. A Framework of Summarizing XML Documents with Schemas 27 Teng Lv received his PhD degree from Fudan University, China. His research interests include database and XML data management. He is the author or coauthor of more than 50 journal papers or reviewed conference papers. He is the reviewers or PC members of several journals and conferences both at home and abroad. Ping Yan received her PhD degree from Fudan University, China. Her research interests include partial differential equations and their applications in neural network and epidemic diseases, databases, and XML data management.