Table of Contents



Introduction and Definitions

  1. Text Normalization:
    Every NLP process starts with a task called Text Normalization.
    Text Normaliization is the process of transforming text into a single canonical form that it might not have had before.
    Importance: Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it.
    Steps:
    1. Segmenting/Tokenizing words in running text.
    2. Normalizing word formats.
    3. Segmenting sentences in running text.
  2. Methods for Normalization:
    • Case-Folding: reducing all letters to lower case.

      Possibly, with the exception of capital letters mid-sentence.

    • Lemmatization: reducing inflections or variant forms to base form.

      Basically, finding the correct dictionary headword form.

  3. Morphology:
    The study of words, how they are formed, and their relationship to other words in the same language.
    • Morphemes: the small meaningfuk units that make up words.
    • Stems: the core meaning-bearing units of words.
    • Affixes: the bits and pieces that adhere to stems (often with grammatical functions).
  4. Word Equivalence in NLP:
    Two words have the same
    • Lemma, if they have the same:
      • Stem
      • POS
      • Rough Word-Sense

        cat & cats -> same Lemma

    • Wordform, if they have the same:
      • full inflected surface form

        cat & cats -> different wordforms

  5. Types and Tokens:
    • Type: an element of the vocabulary.
      It is the class of all _tokens containing the same character sequence.
    • Token: an instance of that type in running text.
      It is an instance of a sequence of characters that are grouped together.
  6. Notation:
    • N = Number of Tokens.
    • V = Vocabulary = set of Types.
    • \(\|V\|\) = size/cardinality of the vocabulary.
  7. Growth of the Vocabulary:
    Church and Gale (1990) suggested that the size of the vocabulary grows larger than the square root of the number of tokens in a piece of text:
    \[\|V\| > \mathcal{O}(N^{1/2})\]

Tokenization

  1. Tokenization:
    It is the task of chopping up a character sequence and a defined document unit into pieces, called tokens.
    It may involve throwing away certain characters, such as punctuation.
  2. Methods for Tokenization:
    • Regular Expressions
    • A Flag: Specific squences of characters.
    • Delimiters: pecific separating characters.
    • Dictionary: exlicit definitions by a dictionary.
  3. Categorization:
    Tokens are categorized by:
    • Character Content
    • Context
      within a data stream.
    Categories:
    • Identifiers: names the programmer chooses
    • keywords: names already in the programming language.
    • Operators: symbols that operate on arguments and produce results.
    • Grouping Symbols
    • Data Types
    Categories are used for post-processing of the tokens either by the parser or by other functions in the program.

Word-Normalization (Stemming)

  1. Stemming:
    is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.
    The stem need not map to a valid root in the language.

    Basically, Stemming is a crude chopping of affixes

    Example: “automate”, “automatic”, “automation” -> “automat”.

  2. Porter’s Algorithm:
    The most common English stemmer.
    It is an iterated series of simple replace rules.
  3. Algorithms:
    • The Production Technique: we produce the lookup table, that is used by a naive stemmer, semi-automaically.
    • Suffix-Stripping Algorithms: those algorithms avoid using lookup tables; instead they use a small list of rules to navigate through the text and find theroot forms from word forms.
    • Lemmatisation Algorithms: the lemmatization process starts determining the part of speech of a word and, then, applying normalization rules to for each part-of-speech.
    • Stochastic Algorithms: those algorithms are trained on a table of root form-to-inflected form relations to develop a probablistic model.
      The model looks like a set of rules, similar to the suffic-stripping list of rules.

Sentence Segmentation

  1. Sentence Segmentation:
    It is the problem of diving a piece of text into its component sentences.
  2. Identifiers:
    Identifiers such as “!”, “?” are unambiguous; they usually signify the end of a sentence.
    The period “.” is quite ambiguous, since it can be used in other ways, such as in abbreviations and in decimal number notation.
  3. Dealing with Ambiguous Identifiers:
    One way of dealing with ambiguous identifiers is by building a Binary Classifier.
    On a given occurrence of a period, the classifier has to decide between one of “Yes, this is the end of a sentence” or “No, this is not the end of a sentence”.
    Types of Classifiers:
    • Decision Trees
    • Logistic Regression
    • SVM
    • Neural-Net
    Decision Trees are a common classifier used for this problems.