Web Science/Part2: Emerging Web Properties/MoocIndex
Appearance
--MOOC-Index
lesson|How big is the World Wide Web
[edit | edit source]- learningGoals=
- Scientific: Some questions are hard (impractical or even impossible) to answer.
- understand the limitations and benefits of scientific models.
- be able to name 3 different point of views (or model choices) we could have when we want to study the World Wide Web (Software system, Text Corpus, Graph).
- be aware of 3 different Modelling techniques (Descriptive, Generative and Predictive Models).
- The best way to understand the web is to be able to describe it properly via various models.
unit|Problems with the question about the size of the Web
[edit | edit source]- furtherReading=
- learningGoals=
- The question will remain unanswered during the lesson and the entire course.
- question of size is underspecified because a measure is needed.
- measure depends heavily on the choice of how we model the web.
- We have not yet defined what we mean when we say World Wide Web.
- video=File:Problems_with_the_Question_of_%22How_big_is_the_World_Wide_Web%22.webm
unit|3 ways to study the Web
[edit | edit source]- furtherReading=# World Wide Web
- learningGoals=
- The web as a software system.
- The web as a collection of text documents.
- The web as a graph of interlinked documents.
- Even when choosing 1 point of view we have fundamentally different ways of modelling.
- numThreads=3
- numThreadsOpen=3
- video=File:3_ways_to_look_at_the_World_Wide_Web.webm
unit|A simplistic descriptive model
[edit | edit source]- furtherReading=# Descriptive statistics
- learningGoals=
- understand that only the model is described.
- description of the model can be used for interpretation.
- within the descriptive model one chooses measures to describe the object of study.
- understand the notion of a modelling choice
- be able to criticise a descriptive model and the modelling choices
- video=File:Introduction_to_descriptive_modelling_of_the_World_Wide_Web.webm
unit|An unrealistic, simplistic generative model
[edit | edit source]- furtherReading=# Scientific modelling
- learningGoals=
- Can be used to try to give a reason why something works.
- need to be run more than once!
- understand the notion of a modeling parameter
- will be compared to the descriptive model of our object of study.
- numThreads=3
- numThreadsOpen=2
- video=File:Introduction_to_Generative_Models_of_the_World_Wide_Web.webm
unit|Summary, further reading, homework
[edit | edit source]- furtherReading=
- a
- b
- learningGoals =
- understand why it is important to have web models. (e.g. for information retrieval, spam detection, understanding epidemic flow of information on the web)
- be aware that models are not the reality.
- even if a generative model yields same statistics and distributions as a predictive model it might fail badly
- video=File:Under_construction_icon-blue.svg
lesson|Simple statistical descriptive Models for the Web
[edit | edit source]- learningGoals=
- Formulating a research hypothesis and test it by means of simple descriptive statistics
- Reading diagrams
unit|Counting Words And Documents
[edit | edit source]- furtherReading=
- tba
- learningGoals=
- Understand why we selected simple English Wikipedia as a toy example for modeling the web
- Understand that a task already as simple as counting words includes modeling choices
- Be familiar with the term “unique word token”
- Know some basic tools to count words and documents
- video=File:Counting-Words-And-Documents.webm
unit|Typical length of a document
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Be familiar with some basic statistical objects like Median, Mean, and Histograms
- Should be able to relate a histogram to its cumulative distribution function
- numThreads=3
- numThreadsOpen=2
- video=File:Typical-length-of-a-document-Histograms.webm
unit|How to formulate a research hypothesis
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Understand the ongoing, cyclic process of research
- Know what falsifiable means and why every research hypothesis needs to be falsifiable
- Be able to formulate your own research hypothesis
- numThreads=1
- numThreadsOpen=1
- video=File:How-to-formulate-a-research-hypothesis.webm
unit|Number of words needed to understand most of Wikipedia
[edit | edit source]- furtherReading=
- learningGoals=
- Understand what a log-log plot is
- Improve your skills in reading and interpreting diagrams
- Know about the word rank / frequency plot
- Should be able to transfer a histogram or curve into a cumulative distribution function
- numThreads=2
- numThreadsOpen=1
- video=File:Number-of-words-needed-to-understand-most-of-Wikipedia.webm
unit|Linguists way of checking simplicity of text
[edit | edit source]- furtherReading=
- tba
- learningGoals=
- Get a feeling for interdisciplinary research
- Know the Automated Readability Index
- Have a strong sense of support for our research hypothesis
- Be able to critically discuss the limits of our models
- video=File:Linguists-way-of-checking-simplicity-of-text.webm
lesson|Advanced statistical descriptive models for the Web
[edit | edit source]- learningGoals=
- fitting a curve
- work with logarithmic plots
- zipfs law
- power law
unit|The Zipf law for text
[edit | edit source]- furtherReading=
- tba
- learningGoals=
- Be able to name some fundamental properties about how frequencies of words in texts are distributed
- Be a little bit more cautious about visual impressions when looking at log-log plots
- Know both formulations of Zipf’s law
- video=File:Questioning-the-Zipf-law.webm
unit|Visually straight lines on log log plots
[edit | edit source]- furtherReading=
- tba
- learningGoals=
- Be able to do a coordinate transformation to change the scales of your plots
- Understand in which scenario power functions appear as straight lines
- Know in which scenarios exponential functions appear as straight lines
- Be even more cautious about your visual impressions
- video=File:Visually-straight-lines-on-log-log-plots.webm
unit|Fitting a curve on a log log plot
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Know the axioms for a distance measure and how they relate to norms.
- Know at least two distance measures on functions spaces.
- Understand why changing to the CDF makes sense when looking at distance between functions.
- Understand the principle of the Kolomogorov-Smirnov test for fitting curves
- numThreads=4
- numThreadsOpen=1
- video=File:Fitting-a-curve-on-a-log-log-plot.webm
unit|Zipf law powerlaw or pareto law.webm
[edit | edit source]- furtherReading=
- tba
- learningGoals=
- Know how to transform a rank frequency diagram to a powerlaw plot.
- Understand how powerlaw and pareto plots relate to each other.
- Be able to explain why a pareto plot is just and inverted rank frequency diagram
- Be able to transform the zipf coefficient to the powerlaw and pareto coefficient and vice versa.
- Understand that building the CDF is basically like building the integral.
- video=File:Zipf-law-powerlaw-or-pareto-law.webm
lesson|Modelling Similarity of Text
[edit | edit source]- learningGoals=
unit|Similarity Measures and their Applications
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Know the properties of a similarity measure
- Be able to relate similarity and distance measures
- Know of two applications for modelling similarity
- numThreads=1
- numThreadsOpen=1
- video=File:Similarity-Measures-and-their-Applications.webm
unit|Jaccard Similarity for Sets
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Understand how text documents can be modeled as sets
- Know the Jaccard coefficient as a similarity measure on sets
- Know a trick how to remember the formula
- Be aware of the possible outcomes of the Jaccard index
- As always be able to criticize your model
- video=File:Jaccard-Similarity-for-Sets.webm
unit|Cosine Similarity For Vectorspaces
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Be familiar with the vector space model for text documents
- Be aware of term frequency and (inverse) document frequency
- Have reviewed the definitions of base and dimension
- Realize that the angle between two vectors can be seen as a similarity measure
- numThreads=2
- numThreadsOpen=2
- video=File:Cosine-Similarity-For-Vectorspaces.webm
unit|Probabilistic Similarity Measures Kullback Leibler Divergence
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Be aware of a unigram Language Model
- Know Laplacian (aka +1) smoothing
- Know the query likelihood model
- The Kullback Leibler Divergence
- See how a similarity measure can be derived from Kullback Leibler Divergence
- video=File:Probabilistic-Similarity-Measures-Kullback-Leibler-Divergence.webm
unit|Comparing Results of Similarity Merasures
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Understand that different modeling choices can produce very different results.
- Have a feeling how you could statistically compare the differences of the models.
- Know how you could extract keywords from documents with the tf-idf approach.
- Try to argue which model you like best in a certain scenario.
- video=File:Comparing-Results-of-Similarity-Merasures.webm
lesson|Generative Models for the Web
[edit | edit source]- learningGoals=
unit|Introduction to generative modelling.webm
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Understand the principle methodology for building generative models
- Remember why people are interested in generative models
- Know why descriptive models are needed when evaluating a generative model
- Be aware of one way to create a model for text generation
- numThreads=1
- numThreadsOpen=1
- video=File:Introduction-to-generative-modelling.webm
unit|Sampling from a probability distribution
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Understand how to sample values from an arbitrary probability distribution
- Have seen yet another application of the cumulative distribution function
- Understand that sampling from a distribution is just a coordinate transformation of the uniform distribution
- numThreads=1
- numThreadsOpen=1
- video=File:Sampling-from-a-probability-distribution.webm
unit|Evaluating a generative model
[edit | edit source]- furtherReading=# tba
- learningGoals=
- See that it makes sense to compare statistics
- Understand that comparing statistics is not a well defined task
- Be aware of the fact that very different models could lead to the same statistics
- numThreads=1
- numThreadsOpen=0
- video=File:Evaluating-a-generative-model.webm
unit|Pittfalls when increasing the number of model parameters
[edit | edit source]- furtherReading=# tba
- learningGoals=
- See that one can always increase the model parameters
- Know that increasing model parameters often yields a more accurate model
- Be aware of the bigram and mixed models as examples for our generative processes
- video=File:Pittfalls-when-increasing-the-number-of-model-parameters.webm
lesson|Modeling the Web as a graph
[edit | edit source]- learningGoals=
unit|Reviewing terms from graph theory
[edit | edit source]- furtherReading=#
- learningGoals=
- Be familiar with a set theoretic way of denoting a graph
- Know at least 4 different types of graphs
- Have practiced your abilities in reading and writing mathematical formulas
- video=File:Reviewing terms from graph theory.webm
unit|The standard web graph model
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Be able to model web pages as a graph
- Know that the authorship graph is bipartite
- Know what kind of graph the graph of web pages is
- (as always) be aware of the fact that modeling is done by making choices
- video=File:The standard web graph model.webm
unit|Descriptive statistics of the web graph
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Know terms like Size and (unique) volume
- Be able to count the in and out degree of web pages
- Have an idea what kind of law (in & out) degree distributions follow
- Know that degree is not distributed in a fair way
- Know that the Gini coefficient can be used to measure fairness
- video=File:Descriptive statistics of the web graph.webm
unit|Topology of the web graph
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Understand the notion of a path in a (directed) graph
- Know that shortest paths between nodes need not be unique
- Understand the notion of a strongly connected component
- Know about the diameter of a graph
- Be aware of the bow tie structure of the Web
- numThreads=1
- numThreadsOpen=1
- video=File:Topology of the web graph.webm
unit|Modelling-graphs-with-linear-algebra
[edit | edit source]- furtherReading=# tba
- learningGoals=
- Be able to read and build an adjacency matrix of a graph
- Know some basic matrix vector multiplications to generate some statistics out of the adjacency matrix
- Understand what is encoded in the components of the k-th power of the Adjacency matrix of a graph
- numThreads=1
- numThreadsOpen=1
- video=File:Modelling-graphs-with-linear-algebra.pdf.webm