Jump to content

Natural Language Processing

From Wikiversity

Introduction

[edit | edit source]

Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.

The field is divided in two major categories:

The technologies needed for both are very different, "speech" being addressed normally by Electronic or Telecommunications Engineers, while "text" being more addressed by Computer Scientists. Additionally, Speech recognition and speech generation are normally architectured as a layer over a Text recognition / generation engine.

To follow a course on NLP, you must know about Formal languages and Grammars, and be able to program fluently. For the practicals, you must learn a Logic programming language; Prolog is recommended. A rule based programming language can serve as well, although it perhaps is more difficult to master.

Document Decomposition

[edit | edit source]
Abstract Syntax Tree Document

If you look on this text above

  • the document consists of sections,
  • each section consists of paragraphs,
  • each paragraph consists of sentences,
  • each sentence consist of words and
  • words consist of characters/symbols.

Furthermore the text contains more information we could identify, e.g. a header or links. In general a document can contain other document elements images, audio, video or interactive elements like forms or geometric construction that can be manipulated by the reader.

Abstract Syntax Tree

[edit | edit source]

A tree can represent the decomposition of text into substructures. The decomposition of text into sections is e.g. the top level of an Abstract Syntax Tree (AST).

Learning Activities

[edit | edit source]
  • (Abstract Syntax Tree) Draw the top level of the Abstract Syntax Tree (AST) that decomposes this text into sections,
  • (Language Section Decomposition) Look into specific languages like HTML, Markdown, Latex, ... and identify syntax elements that are available to determine headers in a texts,
  • (Regular Expression) write a regular expression e.g. in Javascript that extracts headers (e.g. "Introduction") and the level of the header (i.e. section, subsection, ... )
    • in HTML the level 1 is defined with h1, h2, ... tags, e.g. for section header "introduction" and a subsection "My Subsection" with the level 2
  <h1>Introduction</h1>
   Text 
  <h2>My Subsection </h2>
   Text of subsection
  • in Latex the levels are defined with \section{...}, \subsection{...}, ... command, e.g. for section header "introduction" and a subsection "My Subsection" with the level 2 are defined in Latex with the following commands
  \section{Introduction}
   Text 
  \section{My Subsection}}
   Text of subsection

See also

[edit | edit source]