Application of Machine Learning in the Process of Classification of Advertised Jobs
Abstract
Institutions that provide official statistics tend to use external data sources such as administrative data sources besides regular statistical surveys. In addition to the mentioned data sources, Big Data became recognized as a new data source for the provider of official statistics. Classification of textual data is one of the elementary tasks for the provider of official statistics, regardless of data sources. In this paper, application of traditional machine learning algorithms, Multinomial Naive Bayes and Support Vector Machine, for the classification of advertised jobs according to ISCO-08, has been presented. The paper presents the methods of collecting data on advertised jobs from four websites and procedures for creating a multilingual dataset. There are different types of text preprocessing, such as converting uppercase letters into lowercase letters, stopword removal, punctuation mark removal, lemmatization, correction of commonly misspelled words, and reduction of replicated characters. We hypothesized that the application of different combinations of preprocessing methods influenced the text classification results. Two experiments had conducted to test the hypothesis. Both experiments results showed that using the Support Vector Machine algorithm on a created dataset gives better results than Multinomial Naive Bayes. Performed experiments showed that the proposed algorithms gave a good performance with an overall accuracy of up to 90% but with different accuracy for individual classes due to an imbalanced dataset.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.