Research on Modelling Capability of English Multimodal File Search based on Transformer
With the exponential growth of file data in the multimedia era, file retrieval ability to achieve effective data management has become a hot research field. Based on people’s English file search needs, this paper proposes an English multimodal file search model based on transformer. Through ablation experiments on two public data sets and comparison experiments with the benchmark model, the effectiveness and superiority of the proposed transformers algorithm model in multi- modal data processing are verified. The multi-modal fusion retrieval system can usually achieve better performance than the single-modal retrieval system. This experiment focuses on three modes: Audio, Image and Text. The experimental results show that the proposed method can not only improve the efficiency of file search, but also extract modal features and perform feature fusion better. In the future, we can further explore different types of other attention mechanisms or integrate a variety of different architectures to further enhance the feasibility and superiority of multimodal file search.
[1] Bianchi F., Grattarola D., Livi L., and Alippi C., “Graph Neural Networks with Convolutional Arma Filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3496-3507, 2022. DOI:10.1109/TPAMI.2021.3054830
[2] Bruch S., Gai S., and Ingber A., “An Analysis of Fusion Functions for Hybrid Retrieval,” ACM Transactions on Information Systems, vol. 42, no. 1, pp. 1-35, 2023. https://dl.acm.org/doi/10.1145/3596512
[3] Guo J., Fan Y., Pang L., Yang L., Ai Q., Zamani H., Wu C., Croft W., and Cheng X., “A Deep Look into Neural Ranking Models for Information Retrieval,” Information Processing and Management, vol. 57, no. 6, pp. 102067, 2020. https://doi.org/10.1016/j.ipm.2019.102067
[4] Han K., Wang Y., Chen H., Chen X., Guo J., and Liu Z., “A Survey on Vision Transformer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87-110, 2022. DOI:10.1109/TPAMI.2022.3152247
[5] Haveliwala T., “Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 4, pp. 784-796, 2003. DOI:10.1109/TKDE.2003.1208999
[6] Ilharco G., Wortsman M., Gadre S., Song S., Hajishirzi H., Kornblith S., Farhadi A., and Schmidt L., “Patching Open-Vocabulary Models by Interpolating Weights,” in Proceedings of the Advances in Neural Information Processing Systems, New Orleans, pp. 1-50, 2022. https://arxiv.org/abs/2208.05592
[7] Kleinberg J., “Authoritative Sources in a Hyperlinked Environment,” Journal of the ACM, vol. 46, no. 5, pp. 604-632, 1999. https://doi.org/10.1145/324133.324140
[8] Larson R., “A Fusion Approach to XML Structured Document Retrieval,” Information Retrieval, vol. 8, pp. 601-629, 2005. https://link.springer.com/article/10.1007/s10791- 005-0749-0
[9] Liu H., Chen W., “Re-Transformer: A Self- Attention-based Model for Machine Translation,” Procedia Computer Science, vol. 189, pp. 3-10, 2021. https://doi.org/10.1016/j.procs.2021.05.065
[10] Navarro G., “Spaces, Trees, and Colors: The Algorithmic Landscape of Document Retrieval on Sequences,” ACM Computing Surveys (CSUR), vol. 46, no. 4, pp. 1-47, 2014. https://doi.org/10.1145/2535933
[11] Panboonyuen T., Thongbai S., Wongweeranimit W., Santitamnont P., Suphan K., and Charoenphon C., “Object Detection of Road Assets Using Transformer-based YOLOX with Feature Pyramid Decoder on Thai Highway Panorama,” Information, vol. 13, no. 1, pp. 1-12, 2021. https://doi.org/10.3390/info13010005
[12] Quan H., Lai H., Gao G., Ma J., Li J., and Chen D., “Pairwise CNN-Transformer Features for Human-Object Interaction Detection,” Entropy, vol. 26, no. 3, pp. 205-217, 2024. https://doi.org/10.3390/e26030205
[13] Sandhu M., Ahmed M., Hussain M., Head S., and Khan I., “Protecting Sensitive Images with Improved 6-D Logistic Chaotic Image Steganography,” The International Arab Journal of Information Technology, vol. 21, no. 6, pp. 1064-1073, 2024. Doi: 10.34028/iajit/21/6/10 Research on Modelling Capability of English Multimodal File Search based on Transformer 123
[14] Shah S., Soules C., Ganger G., and Noble B., “Using Provenance to Aid in Personal File Search,” in Proceedings of the USENIX Annual Technical Conference, California, pp. 171-184, 2007. http://usenix.org/events/usenix07/tech/full_paper s/shah/shah.pdf
[15] Shang Y., Ma C., Yang K., and Tan D., “Regenerative Braking Control Strategy Based on Multi-Source Information Fusion under Environment Perception,” International Journal of Automotive Technology, vol. 23, no. 3, pp. 805- 815, 2022. https://link.springer.com/article/10.1007/s12239- 022-0072-4
[16] Sherstinsky A., “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network,” Physica D: Nonlinear Phenomena, vol. 404, pp. 132306, 2020. https://doi.org/10.1016/j.physd.2019.132306
[17] Singh D., Reddy S., Hamilton W., Dyer C., and Yogatama D., “End-to-End Training of Multi- Document Reader and Retriever for Open- Domain Question Answering,” in Proceedings of the 35th International Conference on Neural Information Processing System, Online, pp. 25968-25981, 2021. https://dl.acm.org/doi/10.5555/3540261.3542249
[18] Wei Y., Zhao Y., Lu C., Wei S., Liu L., and Zhu Z., “Cross-Modal Retrieval with CNN Visual Features: A New Baseline,” IEEE Transactions on Cybernetics, vol. 47, no. 2, pp. 449-460, 2017. DOI:10.1109/TCYB.2016.2519449
[19] Yu H., Ma R., Su M., An P., and Li K., “A Novel Deep Translated Attention Hashing for Cross- Modal Retrieval,” Multimedia Tools and Applications, vol. 81, no. 18, pp. 26443-26461, 2022. https://link.springer.com/article/10.1007/s11042- 022-12860-w
[20] Zand M., Nasab M., Sanjeevikumar P., Maroti P., and Holm-Nielsen J., “Energy Management Strategy for Solid-State Transformer-based Solar Charging Station for Electric Vehicles in Smart Grids,” IET Renewable Power Generation, vol. 14, no. 18, pp. 3843-3852, 2020. https://doi.org/10.1049/iet-rpg.2020.0399
[21] Zhang C., Song J., Zhu X., Zhu L., and Zhang S., “HCMSL: Hybrid Cross-Modal Similarity Learning for Cross-Modal Retrieval,” ACM Transactions on Multimedia Computing Communications and Applications (TOMM), vol. 17, no. 1s, pp. 1-22, 2021. https://doi.org/10.1145/3412847
[22] Zhang Q., Chang J., Meng G., Xu S., Xiang S., and Pan C., “Learning Graph Structure Via Graph Convolutional Networks,” Pattern Recognition, vol. 95, pp. 308-318, 2019. https://doi.org/10.1016/j.patcog.2019.06.012
[23] Zhu C., Ping W., Xiao C., Shoeybi M., Goldstein T., Anandkumar A., and Catanzaro B., “Long- Short Transformer: Efficient Transformers for Language and Vision,” in Proceedings of the 35th International Conference on Neural Information Processing Systems, Virtual, pp. 17723-17736, 2021. https://dl.acm.org/doi/10.5555/3540261.3541617