# Web Science/Part2: Emerging Web Properties/MoocIndex

From Wikiversity

--MOOC-Index

## Contents

# lesson|How big is the World Wide Web[edit]

- learningGoals=

- Scientific: Some questions are hard (impractical or even impossible) to answer.
- understand the limitations and benefits of scientific models.
- be able to name 3 different point of views (or model choices) we could have when we want to study the World Wide Web (Software system, Text Corpus, Graph).
- be aware of 3 different Modelling techniques (Descriptive, Generative and Predictive Models).
- The best way to understand the web is to be able to describe it properly via various models.

## unit|Problems with the question about the size of the Web[edit]

- furtherReading=

- learningGoals=

- The question will remain unanswered during the lesson and the entire course.
- question of size is underspecified because a measure is needed.
- measure depends heavily on the choice of how we model the web.
- We have not yet defined what we mean when we say World Wide Web.

- video=File:Problems_with_the_Question_of_%22How_big_is_the_World_Wide_Web%22.webm

## unit|3 ways to study the Web[edit]

- furtherReading=# World Wide Web
- learningGoals=

- The web as a software system.
- The web as a collection of text documents.
- The web as a graph of interlinked documents.
- Even when choosing 1 point of view we have fundamentally different ways of modelling.

- numThreads=2
- numThreadsOpen=2
- video=File:3_ways_to_look_at_the_World_Wide_Web.webm

## unit|A simplistic descriptive model[edit]

- furtherReading=# Descriptive statistics
- learningGoals=

- understand that only the model is described.
- description of the model can be used for interpretation.
- within the descriptive model one chooses measures to describe the object of study.
- understand the notion of a modelling choice
- be able to criticise a descriptive model and the modelling choices

- video=File:Introduction_to_descriptive_modelling_of_the_World_Wide_Web.webm

## unit|An unrealistic, simplistic generative model[edit]

- furtherReading=# Scientific modelling
- learningGoals=

- Can be used to try to give a reason why something works.
- need to be run more than once!
- understand the notion of a modeling parameter
- will be compared to the descriptive model of our object of study.

- numThreads=3
- numThreadsOpen=2
- video=File:Introduction_to_Generative_Models_of_the_World_Wide_Web.webm

## unit|Summary, further reading, homework[edit]

- furtherReading=

- a
- b

- learningGoals =

- understand why it is important to have web models. (e.g. for information retrieval, spam detection, understanding epidemic flow of information on the web)
- be aware that models are not the reality.
- even if a generative model yields same statistics and distributions as a predictive model it might fail badly

- video=File:Under_construction_icon-blue.svg

# lesson|Simple statistical descriptive Models for the Web[edit]

- learningGoals=

- Formulating a research hypothesis and test it by means of simple descriptive statistics
- Reading diagrams

## unit|Counting Words And Documents[edit]

- furtherReading=

- tba

- learningGoals=

- Understand why we selected simple English Wikipedia as a toy example for modeling the web
- Understand that a task already as simple as counting words includes modeling choices
- Be familiar with the term “unique word token”
- Know some basic tools to count words and documents

- video=File:Counting-Words-And-Documents.webm

## unit|Typical length of a document[edit]

- furtherReading=# tba
- learningGoals=

- Be familiar with some basic statistical objects like Median, Mean, and Histograms
- Should be able to relate a histogram to its cumulative distribution function

- numThreads=3
- numThreadsOpen=2
- video=File:Typical-length-of-a-document-Histograms.webm

## unit|How to formulate a research hypothesis[edit]

- furtherReading=# tba
- learningGoals=

- Understand the ongoing, cyclic process of research
- Know what falsifiable means and why every research hypothesis needs to be falsifiable
- Be able to formulate your own research hypothesis

- numThreads=1
- numThreadsOpen=1
- video=File:How-to-formulate-a-research-hypothesis.webm

## unit|Number of words needed to understand most of Wikipedia[edit]

- furtherReading=

- learningGoals=

- Understand what a log-log plot is
- Improve your skills in reading and interpreting diagrams
- Know about the word rank / frequency plot
- Should be able to transfer a histogram or curve into a cumulative distribution function

- numThreads=1
- numThreadsOpen=0
- video=File:Number-of-words-needed-to-understand-most-of-Wikipedia.webm

## unit|Linguists way of checking simplicity of text[edit]

- furtherReading=

- tba

- learningGoals=

- Get a feeling for interdisciplinary research
- Know the Automated Readability Index
- Have a strong sense of support for our research hypothesis
- Be able to critically discuss the limits of our models

- video=File:Linguists-way-of-checking-simplicity-of-text.webm

# lesson|Advanced statistical descriptive models for the Web[edit]

- learningGoals=

- fitting a curve
- work with logarithmic plots
- zipfs law
- power law

## unit|The Zipf law for text[edit]

- furtherReading=

- tba

- learningGoals=

- Be able to name some fundamental properties about how frequencies of words in texts are distributed
- Be a little bit more cautious about visual impressions when looking at log-log plots
- Know both formulations of Zipf’s law

- video=File:Questioning-the-Zipf-law.webm

## unit|Visually straight lines on log log plots[edit]

- furtherReading=

- tba

- learningGoals=

- Be able to do a coordinate transformation to change the scales of your plots
- Understand in which scenario power functions appear as straight lines
- Know in which scenarios exponential functions appear as straight lines
- Be even more cautious about your visual impressions

- video=File:Visually-straight-lines-on-log-log-plots.webm

## unit|Fitting a curve on a log log plot[edit]

- furtherReading=# tba
- learningGoals=

- Know the axioms for a distance measure and how they relate to norms.
- Know at least two distance measures on functions spaces.
- Understand why changing to the CDF makes sense when looking at distance between functions.
- Understand the principle of the Kolomogorov-Smirnov test for fitting curves

- numThreads=3
- numThreadsOpen=0
- video=File:Fitting-a-curve-on-a-log-log-plot.webm

## unit|Zipf law powerlaw or pareto law.webm[edit]

- furtherReading=

- tba

- learningGoals=

- Know how to transform a rank frequency diagram to a powerlaw plot.
- Understand how powerlaw and pareto plots relate to each other.
- Be able to explain why a pareto plot is just and inverted rank frequency diagram
- Be able to transform the zipf coefficient to the powerlaw and pareto coefficient and vice versa.
- Understand that building the CDF is basically like building the integral.

- video=File:Zipf-law-powerlaw-or-pareto-law.webm

# lesson|Modelling Similarity of Text[edit]

- learningGoals=

## unit|Similarity Measures and their Applications[edit]

- furtherReading=# tba
- learningGoals=

- Know the properties of a similarity measure
- Be able to relate similarity and distance measures
- Know of two applications for modelling similarity

- numThreads=1
- numThreadsOpen=1
- video=File:Similarity-Measures-and-their-Applications.webm

## unit|Jaccard Similarity for Sets[edit]

- furtherReading=# tba
- learningGoals=

- Understand how text documents can be modeled as sets
- Know the Jaccard coefficient as a similarity measure on sets
- Know a trick how to remember the formula
- Be aware of the possible outcomes of the Jaccard index
- As always be able to criticize your model

- video=File:Jaccard-Similarity-for-Sets.webm

## unit|Cosine Similarity For Vectorspaces[edit]

- furtherReading=# tba
- learningGoals=

- Be familiar with the the vector space model for text documents
- Be aware of term frequency and (inverse) document frequency
- Have reviewed the definitions of base and dimension
- Realize that the angle between two vectors can be seen as a similarity measure

- video=File:Cosine-Similarity-For-Vectorspaces.webm

## unit|Probabilistic Similarity Measures Kullback Leibler Divergence[edit]

- furtherReading=# tba
- learningGoals=

- Be aware of a unigram Language Model
- Know Laplacian (aka +1) smoothing
- Know the query likelihood model
- The Kullback Leibler Divergence
- See how a similarity measure can be derived from Kullback Leibler Divergence

- video=File:Probabilistic-Similarity-Measures-Kullback-Leibler-Divergence.webm

## unit|Comparing Results of Similarity Merasures[edit]

- furtherReading=# tba
- learningGoals=

- Understand that different modeling choices can produce very different results.
- Have a feeling how you could statistically compare the differences of the models.
- Know how you could extract keywords from documents with the tf-idf approach.
- Try to argue which model you like best in a certain scenario.

- video=File:Comparing-Results-of-Similarity-Merasures.webm

# lesson|Generative Models for the Web[edit]

- learningGoals=

## unit|Introduction to generative modelling.webm[edit]

- furtherReading=# tba
- learningGoals=

- Understand the principle methodology for building generative models
- Remember why people are interested in generative models
- Know why descriptive models are needed when evaluating a generative model
- Be aware of one way to create a model for text generation

- video=File:Introduction-to-generative-modelling.webm

## unit|Sampling from a probability distribution[edit]

- furtherReading=# tba
- learningGoals=

- Understand how to sample values from an arbitrary probability distribution
- Have seen yet another application of the cumulative distribution function
- Understand that sampling from a distribution is just a coordinate transformation of the uniform distribution

- video=File:Sampling-from-a-probability-distribution.webm

## unit|Evaluating a generative model[edit]

- furtherReading=# tba
- learningGoals=

- See that it makes sense to compare statistics
- Understand that comparing statistics is not a well defined task
- Be aware of the fact that very different models could lead to the same statistics

- numThreads=1
- numThreadsOpen=0
- video=File:Evaluating-a-generative-model.webm

## unit|Pittfalls when increasing the number of model parameters[edit]

- furtherReading=# tba
- learningGoals=

- See that one can always increase the model parameters
- Know that increasing model parameters often yields a more accurate model
- Be aware of the bigram and mixed models as examples for our generative processes

- video=File:Pittfalls-when-increasing-the-number-of-model-parameters.webm

# lesson|Modeling the Web as a graph[edit]

- learningGoals=

## unit|Reviewing terms from graph theory[edit]

- furtherReading=#
- learningGoals=

- Be familiar with a set theoretic way of denoting a graph
- Know at least 4 different types of graphs
- Have practiced your abilities in reading and writing mathematical formulas

- video=File:Reviewing terms from graph theory.webm

## unit|The standard web graph model[edit]

- furtherReading=# tba
- learningGoals=

- Be able to model web pages as a graph
- Know that the authorship graph is bipartite
- Know what kind of graph the graph of web pages is
- (as always) be aware of the fact that modeling is done by making choices

- video=File:The standard web graph model.webm

## unit|Descriptive statistics of the web graph[edit]

- furtherReading=# tba
- learningGoals=

- Know terms like Size and (unique) volume
- Be able to count the in and out degree of web pages
- Have an idea what kind of law (in & out) degree distributions follow
- Know that degree is not distributed in a fair way
- Know that the Gini coefficient can be used to measure fairness

- video=File:Descriptive statistics of the web graph.webm

## unit|Topology of the web graph[edit]

- furtherReading=# tba
- learningGoals=

- Understand the notion of a path in a (directed) graph
- Know that shortest paths between nodes need not be unique
- Understand the notion of a strongly connected component
- Know about the diameter of a graph
- Be aware of the bow tie structure of the Web

- numThreads=1
- numThreadsOpen=1
- video=File:Topology of the web graph.webm

## unit|Modelling-graphs-with-linear-algebra[edit]

- furtherReading=# tba
- learningGoals=

- Be able to read and build an adjacency matrix of a graph
- Know some basic matrix vector multiplications to generate some statistics out of the adjacency matrix
- Understand what is encoded in the components of the k-th power of the Adjacency matrix of a graph

- numThreads=1
- numThreadsOpen=1
- video=File:Modelling-graphs-with-linear-algebra.pdf.webm