The International Arab Journal of Information Technology (IAJIT)

..............................
..............................
..............................


BPTI: Bilingual Printed Text Images Dataset for Recognition Purposes

Datasets of text images are important for optical text recognition systems. Such datasets can be used to enhance performance and recognition rates. In this research work, we present a bilingual dataset consists of Arabic/English text images to address the lack of availability of bilingual text databases. The presented dataset consists of 97812 text images, which are categorized into two groups; Scanned page and digitized line images. Images of the two forms are written with 10 fonts and four sizes, and prepared/scanned with four dpi resolutions. The dataset preparation process includes text collection, text editing, image construction, and image processing. The dataset can be used in optical text recognition, optical font recognition, language identification, and segmentation. Different text recognition and language identification experiments have been conducted using images of the dataset and Hidden Markov Model (HMM) classifier. For the digitized images recognition experiments, the best- achieved recognition correctness is 99.01% and the best accuracy is 99.01%. The font that has the highest recognition rates was Tahoma. For the scanned images recognition experiments, Tahoma has also shown the highest performance with 97.86% for correctness and 97.73% for accuracy. For the language identification experiments, Tahoma has shown the performance with 99.98% for word-language identification rate.

[1] AbdelRaouf A., Higgins C., and Khalil M., “A database for Arabic Printed Character Recognition,” in Proceedings of the International Conference on Image Analysis and Recognition, Povoa de Varzim, pp. 567-578, 2008. Doi: 10.1007/978-3-540-69812-8_56

[2] Al Arabiya Middle East Broadcasting Center MBC, Available: https://www.alarabiya.net, Last Visited, 2022.

[3] Al Ekhbariya Saudi News Channel, Available: http://www.alekhbariya.net, Last Visited, 2022.

[4] Al Maadeed S., Ayouby W., Hassaine A., and Aljaam J., “QUWI: An Arabic and English Handwriting Dataset for Offline Writer Identification,” in Proceedings of the International Conference on Frontiers in Handwriting Recognition, Bari, pp. 746-751, 2012. Doi: 10.1109/ICFHR.2012.256

[5] AL-Hourani O., Express English, Available: http://www.expenglish.com, Last Visited, 2022.

[6] Al-Muhtaseb H., Arabic Text Recognition of Printed Manuscripts, PhD Theses University of Bradford, 2010.https://ethos.bl.uk/OrderDetails.do?uin=uk. bl.ethos.529712

[7] Amara N., Mazhoud O., Bouzrara N., and Ellouze N., “ARABASE: A relational Database for Arabic OCR Systems,” The International Arab Journal of Information Technology, vol. 2, no. 4, pp. 259- 266, 2005.

[8] Barcha P., PedroBarcha/old-books-dataset, UNICAMP, University of Campinas, Brazil, Available: https://github.com/PedroBarcha/old- books-dataset. Last Visited, 2022.

[9] Bartos G., HoĊŸcan Y., Kauer A., and Hajnal É., “A multilingual Handwritten Character Dataset: THE Dataset,” Acta Polytechnica Hungarica, vol. 17, no. 9, pp. 141-160, 2020. DOI: 10.12700/APH.17.9.2020.9.8

[10] Brunessaux S., Giroux P., Grilheres B., Manta M., Bodin M., Choukri K., Galibert O., and Kahn J., “The Maurdor Project: Improving Automatic Processing of Digital Documents,” in Proceedings of the 11th IAPR International Workshop on Document Analysis Systems, Tours, pp. 349-354, 2014. DOI: 10.1109/DAS.2014.58

[11] Chanda S., Pal U., and Terrades O., “Word-wise Thai and Roman Script Identification,” ACM Transactions on Asian Language Information Processing, vol. 8, no. 3, pp. 1-21, 2009. DOI: 10.1145/1568292.1568294

[12] Chernyshova Y., Emelianova E., Sheshkus A., and Arlazarov V., “MIDV-LAIT: A challenging Dataset for Recognition of IDs with Perso-Arabic, Thai, and Indian Scripts,” in Proceedings of the 16th International Conference on Document Analysis and Recognition, Lausanne, pp. 258-272, 2021. https://doi.org/10.1007/978-3-030-86331- 9_17

[13] Chtourou I., Rouhou A., Jaiem F., and Kanoun S., “ALTID: Arabic/Latin Text Images Database for Recognition Research,” in Proceedings of the 13th International Conference on Document Analysis and Recognition, Tunis, pp. 836-840, 2015. DOI: 10.1109/ICDAR.2015.7333879.

[14] Dhanya D., Ramakrishnan A., and Pati P., “Script Identification in Printed Bilingual Documents,” Sadhana, vol. 27, no. 1, pp. 73-82, 2002. DOI: 10.1007/BF02703313

[15] Djeddi C., Gattal A., Souici-Meslati L., Siddiqi I., Chibani Y., and El Abed H., “LAMIS-MSHD: A multi-script Offline Handwriting Database,” in Proceedings of the 14th International Conference on Frontiers in Handwriting Recognition, Hersonissos, pp. 93-97, 2014. DOI: 10.1109/ICFHR.2014.23 666 The International Arab Journal of Information Technology, Vol. 20, No. 4, July 2023

[16] Doush I., AIKhateeb F., and Gharibeh A., “Yarmouk Arabic OCR Dataset,” in Proceedings of the 8th International Conference on Computer Science and Information Technology, Amman, pp. 150-154, 2018. DOI: 10.1109/CSIT.2018.8486162

[17] Hamdi A., Pontes E., Boros E., Nguyen T., Hackl G., Moreno J., and Doucet A., “A multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, pp. 2328-2334, 2021. https://doi.org/10.1145/3404835.3463255

[18] Hassan E., Garg R., Chaudhury S., and Gopal M., “Script Based Text Identification: A multi-level Architecture,” in Proceedings of the Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, Beijing, pp. 1-8, 2011. https://doi.org/10.1145/2034617.2034630

[19] Hegghammer T., “OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment,” Journal of Computational Social Science, vol. 5, no. 1, pp. 861-882, 2022. DOI:10.31235/osf.io/6zfvs

[20] HTK3, Cambridge University Engineering Department, Available: http://htk.eng.cam.ac.uk, Last Visited, 2022.

[21] Khoddami M. and Behrad A., “Farsi and Latin Script Identification Using Curvature Scale Space Features,” in Proceedings of the 10th Symposium on Neural Network Applications in Electrical Engineering (NEUREL), Belgrade, pp. 213-217, 2010. Doi: 10.1109/NEUREL.2010.5644061

[22] Lehal G., “A Bilingual Gurmukhi-English OCR Based on Multiple Script Identifiers and Language Models,” in Proceedings of the 4th International Workshop on Multilingual OCR, Washington, pp. 1-5, 2013. https://doi.org/10.1145/2505377.2505381

[23] Lin X., Guo C., and Chang F., “Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines,” in Proceedings of International Conference on Document Analysis and Recognition, Beijing, pp. 498-502, 2011. DOI: 10.1109/ICDAR.2011.106

[24] Lu Z., Bazzi I., Kornai A., Makhoul J., Natarajan P., and Schwartz R., “Robust Language- Independent OCR System,” in Proceedings of the 27th AIPR Workshop: Advances in Computer- Assisted Recognition, Washington, pp. 96-104, 1999. Doi: 10.1117/12.339811

[25] Lucas S., Panaretos A., Sosa L., Tang A., Wong, S., and Young R., “ICDAR 2003 Robust Reading Competitions,” in Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, pp. 682-687, 2003. Doi: 10.1109/ICDAR.2003.1227749.

[26] Luqman H., Mahmoud S., and Awaida S., “KAFD Arabic Font Database,” Pattern Recognition, vol. 47, no. 6, pp. 2231-2240, 2014. https://doi.org/10.1016/j.patcog.2013.12.012

[27] Mahmoud S., Ahmad I., Alshayeb M., Al-Khatib W., Parves M., Fink G., Margner V., and Abed H., “Khatt: Arabic Offline Handwritten Text Database,” in Proceedings of International Conference on Frontiers in Handwriting Recognition, Bari, pp. 449-454, 2012. DOI: 10.1109/ICFHR.2012.224

[28] Marti U. and Bunke H., “The IAM-Database: An English Sentence Database for Offline Handwriting Recognition,” International Journal on Document Analysis and Recognition, vol. 5, no. 1, pp. 39-46, 2002. DOI:10.1007/s100320200071

[29] Mathew M., Singh A., and Jawahar C., “Multilingual OCR for Indic Scripts,” in Proceedings of the 12th IAPR Workshop on Document Analysis Systems, Santorini, pp. 186- 191, 2016. DOI: 10.1109/DAS.2016.68.

[30] Mezghani A., Kanoun S., Khemakhem M., and El Abed H., “A Database for Arabic Handwritten Text Image Recognition and Writer Identification,” in Proceedings of the International Conference on Frontiers in Handwriting Recognition, Bari, pp. 399-402, 2012. DOI: 10.1109/ICFHR.2012.155

[31] Natarajan P., Saleem S., Prasad R., MacRostie E., and Subramanian K., “Multi-lingual Offline Handwriting Recognition Using Hidden Markov Models: A Script-Independent Approach,” in Proceedings of the Arabic and Chinese Handwriting Recognition, College Park, pp. 231- 250, 2006. https://doi.org/10.1007/978-3-540- 78199-8_14

[32] Pal U. and Chaudhuri B., “Automatic Identification of English, Chinese, Arabic, Devnagari and Bangla Script Line,” in Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, pp. 790-794, 2001. DOI: 10.1109/ICDAR.2001.953896

[33] Pechwitz M., Maddouri S., Margner V., Ellouze N., and Amiri H., “IFN/ENIT-dtabase of Handwritten Arabic Words,” in Proceedings of the CIFED, Hammamet, pp. 127-136, 2002. https://www.researchgate.net/publication/228904 501_IFNENIT- database_of_handwritten_Arabic_words

[34] Peng X., Cao H., Setlur S., Govindaraju V., and Natarajan P., “Multilingual OCR Research and Applications: An Overview,” in Proceedings of the 4th International Workshop on Multilingual OCR, Washington, pp. 1-8, 2013. https://doi.org/10.1145/2505377.2509977 BPTI: Bilingual Printed Text Images Dataset for Recognition Purposes 667

[35] Philip B. and Samuel R., “A Novel Bilingual OCR for Printed Malayalam-English Text Based on Gabor Features and Dominant Singular Values,” in Proceedings of International Conference on Digital Image Processing, Bangkok, pp. 361-365, 2009. DOI: 10.1109/ICDIP.2009.50

[36] Plötz T., and Fink G., “Markov Models for Offline Handwriting Recognition: A Survey,” International Journal on Document Analysis and Recognition, vol. 12, no. 4, pp. 269-298, 2009. DOI: 10.1007/s10032-009-0098-4

[37] Rani R., Dhir R., and Lehal G., “Performance Analysis of Feature Extractors and Classifiers for Script Recognition of English and Gurmukhi Words,” in Proceedings of the workshop on Document Analysis and Recognition, Mumbai, pp. 30-36, 2012. DOI: 10.1145/2432553.2432559

[38] Saito T., “On the Data Base ETK9B of Handprinted Characters in JIS Chinese Characters and its Analysis,” IEICE Trans, vol. 68, no. 4, pp. 757-772, 1985.

[39] Saudi Press Agency, Available: https://www.spa.gov.sa/viewstory.php?lang=ar& newsid=941565, Last Visited, 2021.

[40] Slimane F., Ingold R., Kanoun S., Alimi A., and Hennebert J., “A New Arabic Printed Text Image Database and Evaluation Protocols,” in Proceedings of the 10th International Conference on Document Analysis and Recognition, Barcelona, pp. 946-950, 2009. DOI: 10.1109/ICDAR.2009.155

[41] Thomas B., and Venugopal C., “Bilingual Malayalam English OCR System Using Singular Values and Frequency Capture Approach,” in Proceedings of International Conference of Advances in Computing, Communication and Control, Mumbai, pp. 372-377, 2011. DOI: 10.1007/978-3-642-18440-6_47

[42] Tounsi M., Moalla I., and Alimi A., “ARASTI: A Database for Arabic Scene Text Recognition,” in Proceedings of the 1st International Workshop on Arabic Script Analysis and Recognition, Nancy, pp. 140-144, 2017. DOI: 10.1109/ASAR.2017.8067776

[43] Tounsi M., Moalla I., Pal U., and Alimi A., “Arabic and Latin Scene Text Recognition by Combining Handcrafted and Deep-Learned Features,” Arabian Journal for Science and Engineering, vol. 47, pp. 9727-9740, 2022. DOI: 10.1007/s13369-021-06311-1

[44] Win H., Khine P., and Tun K., “Bilingual OCR System for Myanmar and English Scripts with Simultaneous Recognition,” in Proceedings of the International Journal of Scientific and Engineering Research, vol. 2, no. 10, 2011.