Direkt zum Inhalt
WI2020 Zentrale Tracks
Multi-Class Detection of Abusive Language Using Automated Machine Learning

Mackenzie Jorgensen1, Minho Choi2, Marco Niemann3, Jens Brunk3, and Jörg Becker3
1 Villanova University, Dept. of Computing Sciences, Villanova, USA 2 Lewis & Clark College, Dept. of Mathematical Sciences, Portland, USA 3 University of Münster – ERCIS, Münster, Germany

Abusive language detection online is a daunting task for moderators. We propose Automated Machine Learning (Auto-ML) to semi-automate abusive language detection and to assist moderators. In this paper, we show that multi-class classification powered by Auto-ML is successful in detecting abusive language in English and German as well as and better than the state-ofthe- art machine learning models. We also highlight how we combatted the imbalanced data problem in our data-sets through feature selection and undersampling methods. We propose Auto-ML as a promising approach to the field of abusive language detection, especially for small companies who may have little machine learning knowledge and computing resources.

Schlüsselwörter: Abusive Language Detection, Automated-Machine Learning, Multi-Class Classification
Quellen:

1. Krikorian, R.: New Tweets per second record, and how! (2013), https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-andhow. html
2. ADL: Anti-Semitic Targeting of Journalists during the Presidential Campaign. Tech. rep. (2016)
3. Dhrodia, A.: Unsocial Media: The Real Toll of Online Abuse against Women (2017), https://medium.com/amnesty-insights/unsocial-media-the-real-toll-of-online-abuseagainst- women-37134ddab3f4
4. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated Hate Speech Detection and the Problem of Offensive Language. In: Elev. Int. aaai Conf. web Soc. media (ICWSM 2017). pp. 512–515 (2017)
5. Wiegand, M., Siegel, M., Ruppenhofer, J.: Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language. In: GermEval 2018, 14th Conf. Nat. Lang. Process. (KONVENS 2018). pp. 1–10. No. September (2018)
6. Myung, K.M., Yoo, J., Kim, S.W., Lee, J.H., Hong, J.: Autonomic Machine Learning Platform. Int. J. Inf. Manage. (2019)
7. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive Language Detection in Online User Content. In: 25th Int. Conf. world wide web. pp. 145–153. IW3C2 (2016)
8. Waseem, Z., Hovy, D.: Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. Proc. NAACL Student Res. Work. pp. 88–93 (2016)
9. Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., Vakali, A.: Mean Birds: Detecting Aggression and Bullying on Twitter. Proc. 2017 ACM web Sci. Conf. pp. 13–22 (2017)
10. Balaji, A., Allen, A.: Benchmarking Automatic Machine Learning Frameworks (2018)
11. Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.: Auto-WEKA: Automatic Model Selection and Hyperparameter Optimization in WEKA. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Autom. Mach. Learn. Methods, Syst. Challenges, chap.
4. Springer, Cham (2019)
12. Shepperd, M., Bowes, D., Hall, T.: Researcher Bias: The Use of Machine Learning in Software Defect Prediction. IEEE Trans. Softw. Eng. 40(6), 603–616 (2016)
13. H2O.ai: H2O AutoML (August 2019), 3
14. Feurer, M., Springenberg, J.T., Klein, A., Blum, M., Eggensperger, K., Hutter, F.: Efficient and Robust Automated Machine Learning. In: Adv. Neural Inf. Process. Syst. pp. 2962–2970 (2015)
15. Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.: Auto-WEKA: Automatic Model Selection and Hyperparameter Optimization in WEKA. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Autom. Mach. Learn. Methods, Syst. Challenges, chap.
4. Springer, Cham (2019)
16. Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In: Proc. Genet. Evol. Comput. Conf. 2016. pp. 485–492. ACM (2016)
17. Friedl, J.: Mastering Regular Expressions. "O’Reilly Media, Inc." (2006)
18. Porter, M.F.: An algorithm for suffix stripping. Program (2006)
19. Weissweiler, L., Fraser, A.: Developing a stemmer for German based on a comparative analysis of publicly available stemmers. In: Int. Conf. Ger. Soc. Comput. Linguist. Lang. Technol. pp. 81–94. Springer, Cham (2017)
20. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. "O’Reilly Media, Inc." (2009)
21. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
22. Hutto, C., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment analysis of social media text. In: Eighth Int. AAAI Conf. Weblogs Soc. Media. pp. 216–225 (2014)
23. Loria, S.: textblob Documentation. Tech. rep. (2017)
24. Kumar, V., Sonajharia, M.: Feature Selection: A literature Review. Smart Comput. Rev. 4(3), 211–229 (2014)
25. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Gramfort, A., Niculae, V., Prettenhofer, P., Grobler, J., Layton, R., VanderPlas, J., Varoquaux, G., Joly, A., Holt, B.: API design for machine learning software: experiences from the scikitlearn project. In: ECML PKDD Work. Lang. Data Min. Mach. Learn. pp. 108–122 (2013)
26. Weiss, G.M.: Mining with Rarity: A Unifying Framework. ACM SIGKDD Explor. Newsl. 6(1), 7–19 (2004)
27. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, P.W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
28. Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18(17), 1–5 (2017)
29. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)