Ahmad Badary

Introduction and Definitions

Tokenization

Word-Normalization (Stemming)

Sentence Segmentation

Introduction and Definitions

Text Normalization:

Every NLP process starts with a task called Text Normalization.

Text Normaliization is the process of transforming text into a single canonical form that it might not have had before.

Importance: Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it.
Steps:
1. Segmenting/Tokenizing words in running text.
2. Normalizing word formats.
3. Segmenting sentences in running text.
Methods for Normalization:
- Case-Folding: reducing all letters to lower case.
  
  Possibly, with the exception of capital letters mid-sentence.
- Lemmatization: reducing inflections or variant forms to base form.
  
  Basically, finding the correct dictionary headword form.
Morphology:

The study of words, how they are formed, and their relationship to other words in the same language.
- Morphemes: the small meaningfuk units that make up words.
- Stems: the core meaning-bearing units of words.
- Affixes: the bits and pieces that adhere to stems (often with grammatical functions).
Word Equivalence in NLP:
Two words have the same
- Lemma, if they have the same:
  - Stem
  - POS
  - Rough Word-Sense
    
    cat & cats -> same Lemma
- Wordform, if they have the same:
  - full inflected surface form
    
    cat & cats -> different wordforms
Types and Tokens:
- Type: an element of the vocabulary.
  It is the class of all _tokens containing the same character sequence.
- Token: an instance of that type in running text.
  It is an instance of a sequence of characters that are grouped together.
Notation:
- N = Number of Tokens.
- V = Vocabulary = set of Types.
- \(\|V\|\) = size/cardinality of the vocabulary.
Growth of the Vocabulary:

Church and Gale (1990) suggested that the size of the vocabulary grows larger than the square root of the number of tokens in a piece of text:

\[\|V\| > \mathcal{O}(N^{1/2})\]

Tokenization

Tokenization:

It is the task of chopping up a character sequence and a defined document unit into pieces, called tokens.
It may involve throwing away certain characters, such as punctuation.
Methods for Tokenization:
- Regular Expressions
- A Flag: Specific squences of characters.
- Delimiters: pecific separating characters.
- Dictionary: exlicit definitions by a dictionary.
Categorization:
Tokens are categorized by:
- Character Content
- Context
  within a data stream.
Categories:
- Identifiers: names the programmer chooses
- keywords: names already in the programming language.
- Operators: symbols that operate on arguments and produce results.
- Grouping Symbols
- Data Types
Categories are used for post-processing of the tokens either by the parser or by other functions in the program.

Word-Normalization (Stemming)

Stemming:

is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

The stem need not map to a valid root in the language.

Basically, Stemming is a crude chopping of affixes

Example: “automate”, “automatic”, “automation” -> “automat”.
Porter’s Algorithm:

The most common English stemmer.

It is an iterated series of simple replace rules.
Algorithms:
- The Production Technique: we produce the lookup table, that is used by a naive stemmer, semi-automaically.
- Suffix-Stripping Algorithms: those algorithms avoid using lookup tables; instead they use a small list of rules to navigate through the text and find theroot forms from word forms.
- Lemmatisation Algorithms: the lemmatization process starts determining the part of speech of a word and, then, applying normalization rules to for each part-of-speech.
- Stochastic Algorithms: those algorithms are trained on a table of root form-to-inflected form relations to develop a probablistic model.
  The model looks like a set of rules, similar to the suffic-stripping list of rules.

Sentence Segmentation

Sentence Segmentation:

It is the problem of diving a piece of text into its component sentences.
Identifiers:

Identifiers such as “!”, “?” are unambiguous; they usually signify the end of a sentence.

The period “.” is quite ambiguous, since it can be used in other ways, such as in abbreviations and in decimal number notation.
Dealing with Ambiguous Identifiers:

One way of dealing with ambiguous identifiers is by building a Binary Classifier.
On a given occurrence of a period, the classifier has to decide between one of “Yes, this is the end of a sentence” or “No, this is not the end of a sentence”.
Types of Classifiers:
- Decision Trees
- Logistic Regression
- SVM
- Neural-Net
Decision Trees are a common classifier used for this problems.

Text Processing

Table of Contents

Introduction and Definitions

Tokenization

Word-Normalization (Stemming)

Sentence Segmentation