A three-phase mapreduce-based algorithm for searching biomedical document databases
Abstract
Retrieving information from large document databases is in the focus of scientific research in recent years. In this paper a parallel algorithm for searching biomedical documents based on the MapReduce technique is presented. The algorithm consists of three phases: preprocessing phase, document representation phase and searching phase. In the first phase, lemmatization and elimination of stop words are performed. In the second phase, each of documents is represented as list of pairs (word, tf-idf index of word). The third phase represents the main searching procedure. It uses a specially designed ranking criterion, which is based on a combination of the term frequency - inverse document frequency (tf-idf) index and the indicator function for each query word. Four different versions of ranking criteria are proposed and analyzed. The algorithm performances are tested on different subsets of the large and well-known PubMed biomedical document database. The results obtained by the experiments indicate that the proposed parallel algorithm succeeds in finding high quality results in a reasonable time. Comparing to the sequential variant of the algorithm, the experiments show that the parallel algorithm is more efficient since it finds high quality solutions in significantly less time.