Table of Contents
Introduction and Definitions
-
- Text Normalization:
- Every NLP process starts with a task called Text Normalization.
- Text Normaliization is the process of transforming text into a single canonical form that it might not have had before.
- Importance: Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it.
- Steps:
- Segmenting/Tokenizing words in running text.
- Normalizing word formats.
- Segmenting sentences in running text.
-
- Methods for Normalization:
-
- Case-Folding: reducing all letters to lower case.
Possibly, with the exception of capital letters mid-sentence.
- Lemmatization: reducing inflections or variant forms to base form.
Basically, finding the correct dictionary headword form.
- Case-Folding: reducing all letters to lower case.
-
- Morphology:
- The study of words, how they are formed, and their relationship to other words in the same language.
-
- Morphemes: the small meaningfuk units that make up words.
- Stems: the core meaning-bearing units of words.
- Affixes: the bits and pieces that adhere to stems (often with grammatical functions).
-
- Word Equivalence in NLP:
- Two words have the same
- Lemma, if they have the same:
- Stem
- POS
- Rough Word-Sense
cat & cats -> same Lemma
- Wordform, if they have the same:
- full inflected surface form
cat & cats -> different wordforms
- full inflected surface form
- Lemma, if they have the same:
-
- Types and Tokens:
-
- Type: an element of the vocabulary.
It is the class of all _tokens containing the same character sequence. - Token: an instance of that type in running text.
It is an instance of a sequence of characters that are grouped together.
- Type: an element of the vocabulary.
-
- Notation:
-
- N = Number of Tokens.
- V = Vocabulary = set of Types.
- \(\|V\|\) = size/cardinality of the vocabulary.
-
- Growth of the Vocabulary:
- Church and Gale (1990) suggested that the size of the vocabulary grows larger than the square root of the number of tokens in a piece of text:
- \[\|V\| > \mathcal{O}(N^{1/2})\]
Tokenization
-
- Tokenization:
- It is the task of chopping up a character sequence and a defined document unit into pieces, called tokens.
It may involve throwing away certain characters, such as punctuation.
-
- Methods for Tokenization:
-
- Regular Expressions
- A Flag: Specific squences of characters.
- Delimiters: pecific separating characters.
- Dictionary: exlicit definitions by a dictionary.
-
- Categorization:
- Tokens are categorized by:
- Character Content
- Context
within a data stream.
- Categories:
- Identifiers: names the programmer chooses
- keywords: names already in the programming language.
- Operators: symbols that operate on arguments and produce results.
- Grouping Symbols
- Data Types
- Categories are used for post-processing of the tokens either by the parser or by other functions in the program.
Word-Normalization (Stemming)
-
- Stemming:
- is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.
- The stem need not map to a valid root in the language.
-
Basically, Stemming is a crude chopping of affixes
-
Example: “automate”, “automatic”, “automation” -> “automat”.
-
- Porter’s Algorithm:
- The most common English stemmer.
- It is an iterated series of simple replace rules.
-
- Algorithms:
-
- The Production Technique: we produce the lookup table, that is used by a naive stemmer, semi-automaically.
- Suffix-Stripping Algorithms: those algorithms avoid using lookup tables; instead they use a small list of rules to navigate through the text and find theroot forms from word forms.
- Lemmatisation Algorithms: the lemmatization process starts determining the part of speech of a word and, then, applying normalization rules to for each part-of-speech.
- Stochastic Algorithms: those algorithms are trained on a table of root form-to-inflected form relations to develop a probablistic model.
The model looks like a set of rules, similar to the suffic-stripping list of rules.
Sentence Segmentation
-
- Sentence Segmentation:
- It is the problem of diving a piece of text into its component sentences.
-
- Identifiers:
- Identifiers such as “!”, “?” are unambiguous; they usually signify the end of a sentence.
- The period “.” is quite ambiguous, since it can be used in other ways, such as in abbreviations and in decimal number notation.
-
- Dealing with Ambiguous Identifiers:
- One way of dealing with ambiguous identifiers is by building a Binary Classifier.
On a given occurrence of a period, the classifier has to decide between one of “Yes, this is the end of a sentence” or “No, this is not the end of a sentence”. - Types of Classifiers:
- Decision Trees
- Logistic Regression
- SVM
- Neural-Net
- Decision Trees are a common classifier used for this problems.