Pairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array

Author ,

Keywords #

Abstract Sequence alignment is a bioinformatics application that determines the degree of similarity between nucleotide sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from the user and makes an alignment against large and genomic sequence data sets and locate targets that are similar to an input query sequence. Existi ng accurate algorithm, such as smith)waterman and F ASTA are computationally very expensive, which limits their use in practice. The existing search tools, such as BLAST and WU) BLAST, employ heuristics to improve the speed of su ch searches. However, such heuristics can sometimes miss targets, in which many cases are undesirable. Considering th e rapid growth of database sizes, this problem demands ever) growing computation resources and remains as a comp utational challenge. Most common sequence alignment algorithms like BLAST, WU)BLAST and Sequance Compar asion Tool (SCT) searches a given query sequence against set of database sequences. In this paper, Biologica l Data Base Compression Tool using Minimum Perfect Hash Function (BioDBMPHF) tool has been developed to fin d pair wise local sequence alignment by preprocessing the database. Preprocessing is done by means of finding Longest Common Substring (LCS) from the database o f sequences that have the highest local similarity with a given query sequence and reduces the size of the databas e based on frequent common subsequence. In this BioDBMPHF tool fine)tuned enhanced suffix array is constructed and used to find LCS. Experimental results show that hash index algorithm reduces the time and space complexity to access LCS. Time complexity to find LCS of the hash index algor ithm is O(2+γ) where ‘γ’ is the time taken to access the pattern. Space complexity of fine)tuned enhanced suffix arra y is 5n bytes per character for reduced enhanced Lo ngest Common Prefix (LCP) table and to store bucket table it req uires 32 bytes. Data mining technique is used to cr oss validate the result. It is proved that the developed BioDBMPHF t ool effectively compresses the database and obtains same results compared to that traditional algorithm in approxima tely half the time taken by them thereby reducing the time complexity.

References

[1] Abouelhoda M. and Dawood A., Fine Tuning the Enhanced Suffix Array, in Proceedings of the 4th Cairo International Biomedical Engineering Conference , IEEE Explore , Cairo, Eygpt, pp. 1*4, 2008.

[2] Abouelhoda M., Kurtz S., and Ohlebusch E., Replacing Suffix Trees with Enhanced Suffix Arrays, available at: http:// www.vmatch.de/ AboKurOhl2004.pdf, last visited 2004.

[3] Abouelhoda M., Stefan K., and Ohlebusch E., The Enhanced Suffix Array and its Application to Genome Analysis, in Proceedings of the 2 nd Workshop on Algorithms in Bioinformatics , Rome, Italy, pp. 1*15, 2004.

[4] Altschul S., Gap Costs for Multiple Sequence Alignment, the Journal of Theoretical Biology , vol. 138, no. 3, pp. 297*309, 1989.

[5] Altschul S., Gish W., Miller W., and Lipman J., Basic Local Alignment Search Tool, the Journal of Molecular Biology , vol. 215, no. 3, pp. 403*410, 1990.

[6] Altschul S., Madden T., Sch ffer A., Zhang J., Zhang Z., Miller W., and Lipman D., Gapped BLAST and PSI*BLAST: A New Generation of Protein Database Search Programs, Nucleic Acids Research , vol. 25, no. 17, pp. 3389*3402, 1997.

[7] Divya R., Abdullah N., and Xindong W., Using an Extended Suffix Tree to Speed*Up Sequence Alignment, in Proceedings of IADIS International Conference Applied Computing , Salamanca, Spain, pp. 655*660, 2006.

[8] Fischer J. and Heun V., A New Succinct Representation of RMQ Information and Improvements in the Enhanced Suffix Array, in Proceedings of the International Symposium on Combinatorics , Algorithms , Probabilistic and Experimental Methodologies , Hangzhou, China, pp. 459*470, 2007.

[9] Mousa H., Moustafa K., Abdel*Wahed W., and Hadhoud M., Data Hiding Based on Contrast Mapping using DNA Medium, the International Arab Journal of Information Technology , vol. 8, no. 2, pp.147* 154, 2011.

[10] Kim D., Kim M., and Park H., Linearized Suffix Tree: An Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays, Algorithmica , vol. 52, no. 3, pp. 350* 377, 2008.

[11] Kunthavai A. and Vasantharathna S., Bio* Database Compression using Enhanced Suffix Array for Pair wise Sequence Alignment, the Journal of Cell and Molecular Biology , vol. 9, no. 1, pp. 45*52, 2011.

[12] Manber U. and Myers E., Suffix Arrays: A New Method for On*Line String Searches, the SIAM Journal on Computing , vol. 22, no. 5, pp. 935*948, 1993.

[13] Needleman S. and Wunsch C., A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins, available at: http://csb.stanford.edu/class/ public/readings/Bioinformatics_I_Lecture6/Need leman_Wunsch_JMB_70_Global_alignment.pdf, last visited 2012.

[14] Pearson W. and Lipman D., Improved Tools for Biological Sequence Comparison, the Proceedings of the National Academy of Sciences of the United States of America , vol. 85, no. 8, pp. 2444*2448, 1988.

[15] Pearson W., Effective Protein Sequence Comparison, Methods in Enzymology , Elsevier, 2000.

[16] Pearson W., Flexible Sequence Similarity Searching with the FASTA3 ProProgram Package, Methods in Molecular Biology , Springer, 2000.

[17] Smith T. and Waterman M., Identification of Common Molecular Subsequences, available at: http://csb.stanford.edu/class/public/readings/Bioi nformatics_I_Lecture6/Smith_Waterman_JMB_ 81_Local_sequence_alignment.pdf, last visited 1991. Arumugam Kunthavai received BE degree in computer science and engineering from Madurai Kamarajar University, MS degree (by research) in 2008 and PhD degree in 2014 from Anna University, India. Currently, she is working as an Assistant Professor in computer scien ce and engineering at Coimbatore Institute of Technology, India. She is life member of ISTE. Her research interest includes bioinformatics, distribu ted system and data mining. Pairwise Sequence Alignment using Bio)Database Compression by Improved Fine Tuned Enhanced Suffix Array 359 Somasundaram Vasantharathna received BE degree in electrical engineering, ME degree in applied electronics and phD degree in electrical engineering all from Bharathiar University, Tamilnadu, India. Currently, she is working as a Professor and Head of electrical and electronics engineering at Coimbatore Institute of Technology, India. She is life member of ISTE. Her research int erest includes operating systems, power quality and smart energy systems and image processing applications to food adulteration. Swaminathan Thirumurugan received BE degree in computer science and engineering from Coimbatore Institute of Technology, India in 2012. Currently, he is working atInformatica. His research interest includes algorithm analysis and data mining.

Abstract: Sequence alignment is a bioinformatics application that determines the degree of similarity between nucleotide sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from the user and makes an alignment against large and genomic sequence data sets and locate targets that are similar to an input query sequence. Existi ng accurate algorithm, such as smith)waterman and F ASTA are computationally very expensive, which limits their use in practice. The existing search tools, such as BLAST and WU) BLAST, employ heuristics to improve the speed of su ch searches. However, such heuristics can sometimes miss targets, in which many cases are undesirable. Considering th e rapid growth of database sizes, this problem demands ever) growing computation resources and remains as a comp utational challenge. Most common sequence alignment algorithms like BLAST, WU)BLAST and Sequance Compar asion Tool (SCT) searches a given query sequence against set of database sequences. In this paper, Biologica l Data Base Compression Tool using Minimum Perfect Hash Function (BioDBMPHF) tool has been developed to fin d pair wise local sequence alignment by preprocessing the database. Preprocessing is done by means of finding Longest Common Substring (LCS) from the database o f sequences that have the highest local similarity with a given query sequence and reduces the size of the databas e based on frequent common subsequence. In this BioDBMPHF tool fine)tuned enhanced suffix array is constructed and used to find LCS. Experimental results show that hash index algorithm reduces the time and space complexity to access LCS. Time complexity to find LCS of the hash index algor ithm is O(2+γ) where ‘γ’ is the time taken to access the pattern. Space complexity of fine)tuned enhanced suffix arra y is 5n bytes per character for reduced enhanced Lo ngest Common Prefix (LCP) table and to store bucket table it req uires 32 bytes. Data mining technique is used to cr oss validate the result. It is proved that the developed BioDBMPHF t ool effectively compresses the database and obtains same results compared to that traditional algorithm in approxima tely half the time taken by them thereby reducing the time complexity.
URL: https://iajit.org/paper/4746

,abstract={Sequence alignment is a bioinformatics application that determines the degree of similarity between nucleotide sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from the user and makes an alignment against large and genomic sequence data sets and locate targets that are similar to an input query sequence. Existi ng accurate algorithm, such as smith)waterman and F ASTA are computationally very expensive, which limits their use in practice. The existing search tools, such as BLAST and WU) BLAST, employ heuristics to improve the speed of su ch searches. However, such heuristics can sometimes miss targets, in which many cases are undesirable. Considering th e rapid growth of database sizes, this problem demands ever) growing computation resources and remains as a comp utational challenge. Most common sequence alignment algorithms like BLAST, WU)BLAST and Sequance Compar asion Tool (SCT) searches a given query sequence against set of database sequences. In this paper, Biologica l Data Base Compression Tool using Minimum Perfect Hash Function (BioDBMPHF) tool has been developed to fin d pair wise local sequence alignment by preprocessing the database. Preprocessing is done by means of finding Longest Common Substring (LCS) from the database o f sequences that have the highest local similarity with a given query sequence and reduces the size of the databas e based on frequent common subsequence. In this BioDBMPHF tool fine)tuned enhanced suffix array is constructed and used to find LCS. Experimental results show that hash index algorithm reduces the time and space complexity to access LCS. Time complexity to find LCS of the hash index algor ithm is O(2+γ) where ‘γ’ is the time taken to access the pattern. Space complexity of fine)tuned enhanced suffix arra y is 5n bytes per character for reduced enhanced Lo ngest Common Prefix (LCP) table and to store bucket table it req uires 32 bytes. Data mining technique is used to cr oss validate the result. It is proved that the developed BioDBMPHF t ool effectively compresses the database and obtains same results compared to that traditional algorithm in approxima tely half the time taken by them thereby reducing the time complexity.},
keywords={},
ISSN={2413-9351},
month={Jan}}

AB - Sequence alignment is a bioinformatics application that determines the degree of similarity between nucleotide sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from the user and makes an alignment against large and genomic sequence data sets and locate targets that are similar to an input query sequence. Existi ng accurate algorithm, such as smith)waterman and F ASTA are computationally very expensive, which limits their use in practice. The existing search tools, such as BLAST and WU) BLAST, employ heuristics to improve the speed of su ch searches. However, such heuristics can sometimes miss targets, in which many cases are undesirable. Considering th e rapid growth of database sizes, this problem demands ever) growing computation resources and remains as a comp utational challenge. Most common sequence alignment algorithms like BLAST, WU)BLAST and Sequance Compar asion Tool (SCT) searches a given query sequence against set of database sequences. In this paper, Biologica l Data Base Compression Tool using Minimum Perfect Hash Function (BioDBMPHF) tool has been developed to fin d pair wise local sequence alignment by preprocessing the database. Preprocessing is done by means of finding Longest Common Substring (LCS) from the database o f sequences that have the highest local similarity with a given query sequence and reduces the size of the databas e based on frequent common subsequence. In this BioDBMPHF tool fine)tuned enhanced suffix array is constructed and used to find LCS. Experimental results show that hash index algorithm reduces the time and space complexity to access LCS. Time complexity to find LCS of the hash index algor ithm is O(2+γ) where ‘γ’ is the time taken to access the pattern. Space complexity of fine)tuned enhanced suffix arra y is 5n bytes per character for reduced enhanced Lo ngest Common Prefix (LCP) table and to store bucket table it req uires 32 bytes. Data mining technique is used to cr oss validate the result. It is proved that the developed BioDBMPHF t ool effectively compresses the database and obtains same results compared to that traditional algorithm in approxima tely half the time taken by them thereby reducing the time complexity.