Lemmatization and selection of specific types
Important prerequisities: download the file at the following link, create a directory and decompress its content in it.
This example creates a data set in which each record contains the different words, with the corresponding lemmas, that are in the documents that were already parsed and written in an ADaMSoft data set.
The next step (dataset dict=lemmas out=selected_lemmas) select only the words associated to the lemmas of type: NOM (names), ADJ (adjectives), VER (verbs), ADV (adverbs) and CON (conjunctions).
Moreover each of the selected word is substituted with the corresponding lemma and, in case the verbs is in its conditional form, a new record is added that contains such information.
In order to run this example just paste the following statements in the Command area and press the button: EXECUTE. To view the content of the resulting data set go in the PATH tab and refer to the one named Selected_Lemmas.
Important prerequisities: download the file at the following link, create a directory and decompress its content in it.
This example creates a data set in which each record contains the different words, with the corresponding lemmas, that are in the documents that were already parsed and written in an ADaMSoft data set.
The next step (dataset dict=lemmas out=selected_lemmas) select only the words associated to the lemmas of type: NOM (names), ADJ (adjectives), VER (verbs), ADV (adverbs) and CON (conjunctions).
Moreover each of the selected word is substituted with the corresponding lemma and, in case the verbs is in its conditional form, a new record is added that contains such information.
In order to run this example just paste the following statements in the Command area and press the button: EXECUTE. To view the content of the resulting data set go in the PATH tab and refer to the one named Selected_Lemmas.
define directory_treetagger=WRITE HERE THE PATH OF THE DIRECTORY WHERE THE PREREQUISITIES FILES WERE EXTRACTED, BY USING THE CHARACTER "/" INSTEAD OF "\"; PROC WORDTAGGER DICT=Documents_words OUT=Lemmas; var word; TREETAGGEREXE &directory_treetagger/tree-tagger.exe; PARAMETERFILE &directory_treetagger/italian-utf8.par; file_encoding utf8; RUN; dataset dict=Lemmas out=Selected_Lemmas; if (lemma.equals(" boolean write_record=false; if (type.startsWith("NOM")) write_record=true; if (type.startsWith("ADJ")) write_record=true; if (type.startsWith("VER")) write_record=true; if (type.startsWith("ADV")) write_record=true; if (type.startsWith("CON")) write_record=true; if (!lemma.equalsIgnoreCase("<unknown>")) write_record=true; if (write_record) output; if (type.startsWith("VER:COND")) { lemma="conditional_verb"; output; }; run; |