..............................
..............................
..............................
An Anti-Spam Filter Based on One-Class IB Method in Small Training Sets
We present an approach to email filtering based on one-class Information Bottleneck (IB) method in small training
sets.When themes of emails are changing continually, the available training set which is high-relevant to the current theme
will be small. Hence, we further show how to estimate the learning algorithm and how to filter the spam in the small training
sets. First, In order to preserve classification accuracy and avoid over-fitting while substantially reducing trainingset size, we
consider the learning framework as the solution of one-class centroid onlyaveraged by highly positive emails, and second, we
design a simple binary classification model to filters spam by the comparison of similarity between emails and centroids.
Experimental results show that in small training sets our method can significantly improve classification accuracy compared
with the currently popular methods, such as: Naive Bayes, AdaBoost and SVM.
[1]Allias N., Noor M., Ismail M., and Silva K., A HybridGini PSO-SVMFeature Selection Based on TaguchiMethod:An EvaluationonEmail FilteringinProceedings of the8thInternational Conference on Ubiquitous Information Management and Communication, Siem Reap, pp. 94-97,2014.
[2]Androutsopoulos I.,Koutsias J., Chandrinos K., George Paliouras G., and Spyropoulos C., An Evaluationof Naive BayesianAnti-spam Filteringavailable at: http://arxiv.org/pdf/cs/0006013.pdf,last visited 2000.
[3]Androutsopoulos I., Learning toFilter Unsolicited Commercial E-MailTechnical Report, National Center for Scientific Research, 2004. 684The International Arab Journal of Information Technology, Vol. 13, No. 6, November 2016
[4]Barigou F.,a, Beldjilali B., and Atmani B., Using Cellular Automata for Improving KNN Based Spam The International Arab Journal ofInformation Technology, vol. 11, no. 4, pp. 345-353, 2014.
[5]Blanzieri E.andBrylA., ASurveyofLearning- basedTechniquesofEmail Spam Filtering Artificial Intelligence Review, vol. 29, no. 1, pp. 63-92, 2008.
[6]Burns R., MorphAdorner: Morphological
[7]Carreras X. andMarquez L., Boosting Trees for Anti-Spam Email available at: http://web.cs.ucla.edu/~miodrag/cs259-security/ carreras01boosting.pdf, last visited 2001.
[8]Carreras X.,M rquez L., and Padr L., A Simple Named Entity Extractor in Proceedings of the7thConferenceon Natural Language Learning, Reykjavik, pp. 152-155, 2003.
[9]Chang C.andLin C., LIBSVM:A Libraryfor Support Vector Machinesavailable at: http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm. pdf,last visited2011.
[10]Cormack G., TREC 2007 Spam Track inProceedings of the 6thText REtrieval Conference, Maryland, USA, pp. 1-16, 2007.
[11]Cover M.andThomas J.,Thomas. Elements of Information Theory, Wiley Press, New York, 1991.
[12]Crammer K.,Talukdar P., andPereiraF., A Rate-distortionOne-classModeland its ApplicationstoClusteringin Proceedings of the25thInternational Conferenceon Machine learning, Helsinki, pp. 184-191, 2008.
[13]Csisz I and Tusn dy G., InformationGeometry andAlternating Minimization Procedures Statistics andDecisions, pp. 205-237, 1984.
[14]El-Yaniv R.,Fine S., and Tishby N., Agnostic Classification of Markovian Sequencein Proceedings of the10thAnnual Conference on Neural Information Processing Systems, pp. 465- 471, 1997.
[15]Harremoes P. and Tishby N., TheInformation Bottleneck RevisitedorHowtoChooseaGood Distortion MeasureinProceedings of the29th IEEE International Symposium on Information Theory, Nice, France, pp. 566-570, 2007.
[16]Kosmopoulos A.,Paliouras G., and Androutsopoulos I., AdaptiveSpam Filtering using onlyNaive Bayes Text Classifiersin Proceedings of the5thConference on Email and Anti-Spam, Mountain View,pp. 1-3,2008.
[17]Michelakis E., Filtron: ALearning-basedAnti- Spam FilterinProceedings of the1st Conference on Email and Anti-Spam, California, pp. 1-8,2004.
[18]Rish I., AnEmpirical Studyof theNaiveBayes ClassifierinProceedings of the17th International Joint Conference on Artificial Intelligence, Washington State, pp. 41-46, 2001.
[19]Sahami M.,Dumais S., Heckerman D., and Horvitz E., A BayesianApproachtoFiltering Junk E-Mailavailable at: http://research.microsoft.com/en- us/um/people/horvitz/spam.pdf,last visited1998.
[20]Sculley D. and Wachman G., RelaxedOnline SVMs forSpam FilteringinProceedings of the 30thAnnual International Conferenceon Research andDevelopmentinInformation Retrieval, Amsterdam, pp. 415-422, 2007.
[21]Sculley D. and Wachman G., Relaxed Online SVMs in the TREC Spam Filtering available at: http://trec.nist.gov/pubs/trec16/papers/tuftsu.spa m.final.pdf,last visited2007.
[22]Tishby N.,Pereira F., and Bialek W., The Information Bottleneck available at: http://arxiv.org/pdf/physics/0004057.pdf,last visited1999. Chen Yangreceived his BE and ME degreesfrom the School of InformationEngineering, Zhengzhou University.Currently, he isa PhD candidateinSchool of Information,RenminUniversityof China,China andis alsoanassistantin School of Software Engineering at Zhengzhou University of Light Industry, China. His research interests includemachine learning andBigDatasystem. Shaofeng Zhaoreceived his BE and ME degreesfrom the School of Information Engineering, Zhengzhou University.Currently, heis an Assistantin library at Henan University of Economics and Law, China. His research interests include cloud computing and cloud storage. Dan Zhangreceived her BE degree from the school of computer, Henan University of Economicsand Law, and ME degree from the School of Information Engineering, Zhengzhou University.Currently, sheis anEngineerin Geophysical Exploration Center of China Earthquake Administration. Her research interests includecomplex system and machine learning. An Anti-SpamFilter Based on One-Class IB Method in Small Training Sets685 Junxia Mareceived her ME degree from Zhengzhou University. Currently, sheis a lecturer in the School of Software Engineering at Zhengzhou University of Light Industry, China. Her research interests includeartificial intelligence, data mining