One might look to truncated prefixes and suffixes in light-10 and could be easily recognized that the removed affixes are focused around nouns rather than verbs. To sum up our talk in this part, it is concluded in the literature that light stemming approaches are better than heaving stemming techniques. Unlike root-based techniques, which employ Arabic rules to extract roots, light stemmers attempt to remove the most frequent prefixes and suffixes with or without using some Arabic corpus [15] [16] [17] and [18] . verbs should be identified from nouns. The first run, which was called Light 10, makes use of the light 10 stemmer and represents as a baseline, to which other two experiments would be compared. Table 3 shows the average precision obtained for the three runs, while Figure 2 shows the comparison of the three curves of the average precision at 11 recall points of the 75 queries for the three experiments. So when a word is to be stemmed, a probability for each possible combination of (prefix, suffix and stem template) is computed and the stem with the higher probability is chosen. In this paper, the Stanford Arabic POS Tagger has been used [33] . As shown in the figure, both the proposed stemmers (ExtendedS and LingStem) are consistently better than light 10. During stemming, the algorithm uses that corpus statistics to choose the most appropriate stem. Several snippet codes were written for this purpose and for attempting to extract which prefixes and suffixes would be able to suppress some of the drawbacks that are found on the best known light stemming approach. The first stemmer is a developed version for the best and most known stemmer for Arabic, which is light 10. But, it should be noted that the proposed Extended-Light stemmer, which was described above, can be implemented by its own or with this second proposed linguistic stemmer. Both of the two stemmers were widely used in Arabic IR. triconsonantal root, which is not a word in itself but contains the idea or A similar technique has been also used by Hmeidi, et al., [27] who tested both Dice and Manhattan coefficients for measuring similarity between bi-grams of words of documents and queries. Thus, we believe that the process of stemming should not be dependent on a specific approach. Standard Arabic and the modern dialects use different strategies to form the The second experiment tested the proposed Extended-Light stemmer alone as it was described earlier in the paper. This technique is different from the one that has been used by Ababneh, et al. Using morphological analysis for classifying words into POS tags could be employed for determining which technique is to be used. It is typically found in news papers and includes many technical terms that are not originated in the language. Thus, one of the major aims of the proposed Extended-Light is to consider some of the neglected prefixes and suffixes of both verbs and nouns in light-10. Modern Standard Arabic has ten commonly used forms. For example, the word (meaning: we will surely support them) can be decomposed as follows: (antefix: , prefix: , root: , suffix: and postfix: ). As a result for this phenomenon, both light 10 and the proposed Extended-Light stemmer fail to group many verbs that have a single meaning into the same cluster, while the proposed linguistic stemmer does. The reported experiments in this work concluded that stemming in such a way outperformed full-word stem. The third experiment, which was called LingStem, is conducted to show the impact of using the proposed linguistic stemmer, in which nouns are stemmed in a way different to verbs. But, each of the two paradigms has some pros and cons. Consequently, the latter problem affects performance solely as it reduces the possibility of matching between posted queries and index documents. antefixes, prefixes, suffixes) to these roots, resulting in a large number of possible words in Arabic for each root. However, a considerable number of the reported papers concluded that the effectiveness of stem-based methods is much better than those based on roots. For example, a word like (meaning: the proper noun Kamil), which matches the pattern , will be preserved in Ababnehs study as it matches the entry in the stated list. and in the examples above will be stemmed as and , in which only the definite article was removed. Khoja and Garside [9] and Buckwalter [10] stemmers are examples for heavy stemming algorithms, which attempts to pull out the root of the input Arabic words. This feature of light stemming approaches is important for Arabic nouns, which represent a tremendous part of Arabic words. [22] in two points. This is may be the major reason for using only small text collections for experimenting the approaches in both Al-Shammari and Lin [28] and Mansour et al. as a template to discuss word formation. Examples also include the conjunction between the letter (FAA) and the letter (BAA) to form as in (meaning: By the motherland). Motivated by the reported results in the literature, in which light stemming is the best known approach to Arabic, the first approach is a new light stemmer that has been developed on the top of the best reported light stemmer in Arabic review. Examples include Dice Coefficient, Mutual Information, etc. The prefixes and are also not included in light 10 although they are often attached to nouns. From that perspective, the proposed stemmer is made up of some clues/subcomponents, each of which has a certain role to accomplish in the process. Table 3. However, unlike the popular Semitic languages, words are often written in a cursive (non-concatenative), rather than discontinuous, longhand style [2] but with spaces to delimit words from each others. In order to extract such a strong feature, different similarity and association measures have been used. At first, each word is matched against some predefined lists of noun and verbal patterns. This is totally different from the assumption behind light 10, which stated that stemmed words may be consisting of 2 or 3 letters under certain conditions. An explanation for this fact is that the majority of the Arabic words cannot be determined by only preceded words. In particular, Arabic titles with their description were used as queries. Due to orthographic variations for some characters in Arabic, the process of letter normalization often renders some different forms of some letters with a single Unicode representation. For instance, suffixes in the Extended-Light stemmer includes the letter (pronounced as TAA), which is used in Arabic for singular first person masculine, second person masculine and third person feminine, as in (meanings can be: I played, you played and she played). Heavy stemming always tries to pull out the stem or the root from the input word. The new stemmers are compared to best reported light stemmer, which is light-10. This is can be achieved by firstly matching the word with the prefixes listed in the defined set. It may also, meaning Arabic suffix, include postfixes which are used to indicate pronouns (i.e. dialects generally preserve the bulk of these forms, but may lack some of them If the word is classified as a verb then a root-based stemmer, particularly Khoja, will be employed. The technique is based on some Arabic morphological rules and syntactic knowledge. Diacritics and vowels are usually omitted in Arabic script. The main rhym in Arabic as it was previously illustrated is the pattern (f--l), in which the pattern preserves f, and l in the same order. The varieties of Arabic differ in gender is differentiated, yielding paradigms of 13 forms. Mustafa and Al-Radaideh [25] reported that the use of di-grams is better the using tri-grams but, the richness of the Arabic language makes the use of such approach is not a good option for indexing. their use of particles. As a number of words can be formed of a single root or pattern, the opposite process, which is known as Stemming, is the task of rendering all the conflated forms of a word into a single form known as stem. During the same phase, the preceding words to the word under processing, are employed also to identify verbs from nouns with a hypothesis that some words, especially those are imperfect verbs (like , ) and stop words precede nouns, e.g. It was also noticed that most strippable prefixes and suffixes in light 10 are mainly focused around nouns. First, in that study only single list of patterns for all words were used. Buckwalter depends on the use of some stem tables that include prefixes, possible stems and suffixes. It should be noted that the prefix can be also used with nouns. The answer for this question falls in the use of the proposed heuristic rules, which remove only certain prefixes and suffixes under certain conditions. Khoja depends on some stored predefined patterns and list-driven roots. MSA is also used in official speech and communication and it is the formal language of the media and education across the Arabic world. During stemming some rules are applied based on these dictionaries to extract roots. to create a wider range of tenses. Their techniques for removing prefixes and suffixes and their stemmers are based on very restricted Arabic rules. Classical Arabic was the language of old Arabic-speaking people, e.g. However, in spite of the implementation of such a similar approach in a few number of studies, but our work is different. Following this, a new linguistic stemmer has been also proposed. The work presented above shows the importance of stemming to highly morphological languages such as Arabic. The work is different from best known approach to stemming in two points. Traditionally, Arabic grammarians have used the root f--l 'do' In their CLIR (Cross-Langauge Information Retrieval) experiments, Darwish and Oard [15] implemented a brute removal of the most common suffixes and prefixes but with no particular rules when these affixes are to be removed. The authors claimed that 95% accuracy is achieved but only few words were chosen to test the algorithm.
Lets consider the following nouns: , , , and (meanings respectively: the republic of the Sudan, the festival, spatial, the US leader Barak Obama). On the other hand, stemming may erroneously group words with different meanings and concepts into a single stem. For instance, Consider the words (meaning: cell-phone), (meaning: path), and (meaning: pleasure). [29] . Add your e-mail address to receive free newsletters from SCIRP. In fact, light stemming approaches are the most dominant among the existing approaches for stemming Arabic. All experiments were conducted using the Lucene IR System that uses the Okapi IBM BM25 weighting. For instance, a word like (meaning: Table 1. template taCaaCaC, and verbs of this form often have the meaning of a reciprocal It is the native language for more than four hundred millions [1] centered in the Arabic region, which includes North Africa and Middle East countries.
[22] proposes to match each word with a set of predefined Arabic patterns. On the other hand, light stemming attempts to stem the input word lightly by stripping off affixes (i.e. The researchers used a TREC (Text Retrieval Conference) corpus to decompose each word presented in its possible stems. Accordingly, if a word is tagged as a verb, then it will be stemmed using Khoja, which is a root-based stemmer. Shalabi, et al., [14] in their root-based stemmer, proposed to extract root and patterns based on excessive letter positions. Although the argument here indicates that this is a good feature can be accounted to light 10 stemmer but, in contrast it is the major reason for the under-stemming problem, in which words with the same meanings may be clustered into different groups, as it was described earlier. known as a "form" or "measure". Step 2: in the second step, the algorithm truncates the prefixes. Nevertheless, Xu, et al., [26] stated contradictory results in which tri-grams are found to be better than bi-grams. For examples, tags like NN (produced by the tagger for nouns), DTNN (for a definite article attached with a noun) and PTNNS (for plural nouns that are attached to a definite article) are all collapse a single tag noun, while the different categories of verbs like VB (for the surface form of verbs) and VBG (for present verbs) are classified into a single tag called verb. Three official runs were conducted. The next subsection describes the second technique that has been also proposed in this paper for stemming Arabic words. Parallel corpora, which contain several monolingual sub-collections in different languages, have also been explored [18] . On the other hand, for a word like (meaning: for one hour), the letter will be eliminated as the number of the remaining letters, after removing the letter , is greater than 3. At first, we did a very deep analysis to identify the types of problems that may occur when using light 10 stemmer, besides those already discussed. In the experiments, the Arabic topics were used. Table 2 shows some examples for Arabic words that were stemmed with both the proposed Extended-Light and light 10 stemmers. Using only a single list is an invalid assumption because many verbs and nouns may share the same pattern. The major difficulty here is that if a prefix or a suffix found, the decision of the removal of these affixes should be taken after applying some rules in order to avoid removing an affix which is a part of the word under stemming. This happens when a word does not match any of the rhymed patterns or it has entries in both noun and verb pattern lists. Developing Two Different Novel Techniques for Arabic Text Stemming. Arabicization is the process of writing words from other languages into Arabic letters, e.g. The major pattern from which the majority of the Arabic words derived, is the pattern (transliterated as f--l), which correspond to tri-literal roots. This relatively worst performance was caused by the fact that light 10 does clustering words with the same meaning (those are semantically related to each others) to different conflation classes, although the language, meaning Arabic, conflates many words from a single stem or a single verb. In the proposed stemmer prefixes are definite articles, conjunctions (prepositions, clitics and clitics attached to prepositions) or some letters that are often added to verbs, such as , according to additive verb rules in Arabic. The imperfective conjugation of Standard Arabic has a system As in many other Semitic languages, Arabic verb formation is based on a (usually) An explanation to this phenomenon in light 10 is that the stemmer avoids removal of letters like since many proper nouns begin with this letter. , (the Arabic letter ). This is done after removing the longest prefixes and suffixes that match. To achieve this goal, the proposed linguistic stemmer is a combined approach that considers the analysis level of the words that are to be stemmed with the proposed Extended-Light stemmer. When there is a part of the word under stemming matches a suffix, the algorithm removes that suffix. Due to this classification, the term Arabic refers to both MSA and Dialectical Arabic [3] [5] . This could have an impact on reducing the need of using a POS tagger. On one hand, the use of the proposed Extended-Light stemmer would result in reducing the impact of the under-stemming problem because the stemmer has been extended to include more prefixes and suffixes that were not covered by light 10. In the same study, the authors also developed a light stemmer which strips off the most common prefixes and suffixes and the removal is based on some heuristic rules. Artificial intelligence techniques such as Genetic Algorithms (GA) [30] and Back-Propagation Neural Network (BPNN) with multi-class classification [31] have been also investigated. Lucene is an experimental information retrieval system that has being extensively used in previous editions of the CLEF, NTCIR and TREC joint evaluation experiments. Each of these templates is associated with a As a result for this analysis, it was concluded that the set of the stated affixes in light 10 stemmer is not enough to perform the best stemming technique. Such affixes make the stemmer able to group variety of verbs and/or words into the same conflation class, unlike light 10, which always suffers from the under-stemming problem. At the end of this step, words are clustered into two different classes: verbs and nouns. It contains 383,872 documents compiled from Agence France Presse (AFP) Arabic Newswire during the time period of 1994 to 2000. They also attempt to manage arabicized words. Results showed that light stemmer is better than clustering-based stemmer. On the other hand, since only nouns (not every word as in light 10) will be stemmed by the proposed Extended-Light stemmer, the effect of the under-stemming difficulty will be reduced also as the problem is originated from stemming verbs to different clusters. For instance, from the three consonants trilateral root (meaning: to farm), several words can be formulated such as: (meaning: farmed), (meaning: farmer), (for singular feminine in nominative, accusative and genitive cases), (for dual masculine in nominative case), (meaning: farm), etc. On the other hand, linguistic rules often take the input word and attempt to remove its prefixes and suffixes after matching them with a pre-stored list of affixes. The most widely used in the set is light 10 and each stemmer is different from others in the total number of prefixes and suffixes that are to be removed. The three curves of the average precision at 11 recall points of the 75 queries. (2019) Developing Two Different Novel Techniques for Arabic Text Stemming. This would help in shaping which stemming approach is to be used. However, the major two approaches are heavy stemming (known also as root-based stemming) and light stemming. This experiment run was called ExtendedS. The problem, however, may even be much worse, when such a word is erroneously grouped with the verb (meaning: to stuff). Texts in documents had been tokenized on white space and punctuation marks. Second, when further analysis is done on light 10, it is noticed that Arabic verbs are partially ignored in the listed prefixes and suffixes. These two key set of patterns are different as the set of patterns that are used for nouns in Arabic are not similar to the used ones for verbs. The modern For instance, a word like (meaning: Sudanese) has been formed by adding the definite prefix () and the plural masculine suffix (), resulting in . The next section describes the proposed stemmers in more details.