Applied Programming/RegEx

From Wikiversity
Jump to navigation Jump to search

This lesson introduces regular expressions.

Objectives and Skills[edit | edit source]

Objectives and skills for this lesson include:

  • Understand regular expression use and syntax
  • Read directories and files
  • Apply regular expression processing to large data files

Readings[edit | edit source]

  1. Wikipedia: Regular expression
  2. Wikipedia: Metacharacter
  3. Wikipedia: Kleene star
  4. Wikipedia: Greedy algorithm
  5. Python for Everyone: Regular expressions

Multimedia[edit | edit source]

  1. YouTube: Regular Expressions
  2. YouTube: Matching Patterns
  3. YouTube: Matching Patterns part 2
  4. YouTube: Regex Pattern in Java
  5. YouTube: Regular Expressions (RegEx) Tutorial Playlist
  6. YouTube: Learn Regular Expressions: Groups

Examples[edit | edit source]

Activities[edit | edit source]

  1. Complete one or more of the following tutorials:
  2. Review Wikitech: Analytics/Archive/Data/Pagecounts-all-sites. Use Wikimedia Dumps to download two files of hourly log data for all Wikimedia projects. The filename of each log file is in the format pageviews-yyyymmdd-hh0000. The format for each log file is:
        domain_code page_title count_views total_response_size
    Wikiversity's domain_code is en.v. Sample data files are available at Sample Data 1 and Sample Data 2. Write a program that reads all pageview log files in the current directory and uses RegEx groups to parse the data. For en.v records only, create a dictionary using page_title as the key and the sum of count_views as the value.
  3. After reading all files and summing count_views, display the top 100 pages and corresponding count_views sorted in descending order by count_views, and alphabetically in the case of a tie.
  4. The format for a page_title is Title[/Subpage...]. The title without subpages may be considered the overall learning project. Iterate over the dictionary and use RegEx to separate titles from subpages. Create a separate dictionary with a key for for each learning project and the sum of the page and its subpage count_views as the value. Display the top 100 learning projects and corresponding count_views sorted in descending order by count_views, and alphabetically in the case of a tie.
  5. For each of the above, use separate functions for each type of processing. Reuse functions where possible, such as in sorting and searching. Avoid using global variables by passing parameters and returning results. Include appropriate data validation and parameter validation. Add program and function documentation, consistent with the documentation standards for your selected programming language.

Lesson Summary[edit | edit source]

  • Regular expressions, also known as regexes, comprise a language within themselves in order to express patterns that match substrings and text.[1]
    • Most programming languages include a standard library implementation of regex.[1]
    • There are several dialects of regex: the most popular being Perl-like regex and POSIX-curated regex.[1]
  • The phrase regular expressions is often used to mean the specific, standard textual syntax for representing patterns that matching text need to conform to. Each character in a regular expression is understood to be a metacharacter, with its special meaning, or a regular character, with its literal meaning.[1]
  • Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis.[1]
    • Other applications include data validation, data scraping (especially web scraping), data wrangling, simple parsing, the production of syntax highlighting systems, and many other
  • While regexes would be useful on Internet search engines, processing them across the entire database could consume excessive computer resources depending on the complexity and design of the regex.[1]
  • Metacharacters and literal characters can be used to identify textual material of a given pattern, or process a number of instances of it. Pattern-matches can vary from a precise equality to a very general similarity (controlled by the metacharacters).[1]
  • Capturing groups (...) are useful to capture a sub-expression (or full expression) of a regular expression match as a numbered group which can then be back-referenced by that number.[1]
  • The caret '^' and the dollar sign '$' are examples of regex metacharacters, both referred to as anchors. The caret matches from the beginning of a line or string, while the dollar sign represents the end of a line or string.[1]
  • The language is separated into metacharacters that are imbued with a special meaning by the engine and literal characters that are to be taken as is.[2]
    • If you want to literally use a symbol that is a metacharacter, you must escape it first, usually with a backslash ('\'). [2]
  • Depending on the regex processor there are 14 metacharacters. If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash in order to drop their special meaning and be treated literally inside an expression: [2]
    • the open/close square brackets, "[" and "]”
    • the backslash "\"
    • the caret "^"
    • the dollar sign "$"
    • the period or dot "."
    • the vertical bar or pipe symbol "|"
    • the question mark "?"
    • the asterisk "*"
    • the plus-sign "+"
    • open/close curly braces, "{" and "}"
    • open/close parenthesis, "(" and ")".
  • In many regular expression engines, the . (dot) character matches any character, not just a dot.[2]
  • In many programming languages, strings are delimited using quotes. Using an escape character is one method to avoid delimiter collision. Example : "He said : \"Hello\"".[2]
  • In mathematical logic and computer science, the Kleene star (or Kleene operator), which is widely used for regular expressions, is a unary operation, either on sets of strings or on sets of symbols or characters.[3]
  • The Kleene star (*) is a quantifier used for matching zero or more characters, whereas the Kleene plus (+) is used for matching one or more characters.[3]
    • These operations are named after the mathematician Stephen Cole Kleene who formulated the basic conceptual theory behind regex.[1]
  • A greedy algorithm is one that chooses the "local optimum" at every possible stage in the solving of a problem, hoping to find a holistic solution that is maximally optimized or at least passable.[4]
    • In the context of regular expressions, a greedy match is one that matches as many characters as it can given the pattern. A lazy match is one that matches just the first occurrence of the pattern encountered.[4]
    • A greedy algorithm could fail to produce the optimal solution, and may even produce the unique worst possible solution.[4]
  • The matching pursuit is an example of greedy algorithm applied on signal approximation.[4]
  • In Python, dictionaries cannot be iterated through since the values rely on their keys. Because of this, it is often useful to make a sorted list of tuples consisting of the dictionaries content. Thomas Cokelaer's Blog contains the syntax for creating a sorted list of tuples easily.
  • Usually, regex patterns are expressed in Python code using raw string notation. For example, r"\n".[5]
    • Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals.[5]
    • The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'.[5]
  • The difference between re.findall() and re.finditer():[6]
    • re.findall(): return all non-overlapping matches of pattern in string, as a list of strings.
    • re.finditer(): return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.
  • The difference between Group(), Groups() and Groupdict():[7]
    • A group() expression returns one or more subgroups of the match.
    • A groups() expression returns a tuple containing all the subgroups of the match.
    • A groupdict() expression returns a dictionary containing all the named subgroups of the match, keyed by the subgroup name.

Key Terms[edit | edit source]

escape character
If you want to literally use a symbol that is a metacharacter, you must escape it first, usually with a backslash ('\'). Using an escape character is one method to avoid delimiter collision.[2]
greedy algorithm
Is an algorithmic paradigm that follows the problem-solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.[4]
Kleene star
A quantifier used for matching zero or more characters. The Kleene plus (+) is used for matching one or more characters.[3]
A character that fulfills a special purpose or function and is no longer to meant be taken literally.[1]
an algebraic structure with a single associative binary operation and an identity element [3]
A regular expression against which text is either matched or captured.[1]
A modifier that follows another regex token, enumerating the number of times you expect it to appear.[1]
regex processor
A processor that translates a regular expression in the above syntax into an internal representation which can be executed and matched against a string representing the text being searched.[1]
regular expression
is a sequence of characters that define a search pattern. Usually, this pattern is used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.[8]
string-matching algorithms
is an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text.[9]
The dot (.) metacharacter is a wildcard, a generic character that stands in for anything.[1]

See Also[edit | edit source]

References[edit | edit source]