Web Science/Part2: Emerging Web Properties/MoocIndex

From Wikiversity
Jump to navigation Jump to search


lesson|How big is the World Wide Web[edit | edit source]

  • learningGoals=
  1. Scientific: Some questions are hard (impractical or even impossible) to answer.
  2. understand the limitations and benefits of scientific models.
  3. be able to name 3 different point of views (or model choices) we could have when we want to study the World Wide Web (Software system, Text Corpus, Graph).
  4. be aware of 3 different Modelling techniques (Descriptive, Generative and Predictive Models).
  5. The best way to understand the web is to be able to describe it properly via various models.

unit|Problems with the question about the size of the Web[edit | edit source]

  • furtherReading=
  1. http://www.worldwidewebsize.com/
  • learningGoals=
  1. The question will remain unanswered during the lesson and the entire course.
  2. question of size is underspecified because a measure is needed.
  3. measure depends heavily on the choice of how we model the web.
  4. We have not yet defined what we mean when we say World Wide Web.
  • video=File:Problems_with_the_Question_of_%22How_big_is_the_World_Wide_Web%22.webm

unit|3 ways to study the Web[edit | edit source]

  1. The web as a software system.
  2. The web as a collection of text documents.
  3. The web as a graph of interlinked documents.
  4. Even when choosing 1 point of view we have fundamentally different ways of modelling.
  • numThreads=3
  • numThreadsOpen=3
  • video=File:3_ways_to_look_at_the_World_Wide_Web.webm

unit|A simplistic descriptive model[edit | edit source]

  1. understand that only the model is described.
  2. description of the model can be used for interpretation.
  3. within the descriptive model one chooses measures to describe the object of study.
  4. understand the notion of a modelling choice
  5. be able to criticise a descriptive model and the modelling choices
  • video=File:Introduction_to_descriptive_modelling_of_the_World_Wide_Web.webm

unit|An unrealistic, simplistic generative model[edit | edit source]

  1. Can be used to try to give a reason why something works.
  2. need to be run more than once!
  3. understand the notion of a modeling parameter
  4. will be compared to the descriptive model of our object of study.
  • numThreads=3
  • numThreadsOpen=2
  • video=File:Introduction_to_Generative_Models_of_the_World_Wide_Web.webm

unit|Summary, further reading, homework[edit | edit source]

  • furtherReading=
  1. a
  2. b
  • learningGoals =
  1. understand why it is important to have web models. (e.g. for information retrieval, spam detection, understanding epidemic flow of information on the web)
  2. be aware that models are not the reality.
  3. even if a generative model yields same statistics and distributions as a predictive model it might fail badly
  • video=File:Under_construction_icon-blue.svg

lesson|Simple statistical descriptive Models for the Web[edit | edit source]

  • learningGoals=
  1. Formulating a research hypothesis and test it by means of simple descriptive statistics
  2. Reading diagrams

unit|Counting Words And Documents[edit | edit source]

  • furtherReading=
  1. tba
  • learningGoals=
  1. Understand why we selected simple English Wikipedia as a toy example for modeling the web
  2. Understand that a task already as simple as counting words includes modeling choices
  3. Be familiar with the term “unique word token”
  4. Know some basic tools to count words and documents
  • video=File:Counting-Words-And-Documents.webm

unit|Typical length of a document[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Be familiar with some basic statistical objects like Median, Mean, and Histograms
  2. Should be able to relate a histogram to its cumulative distribution function
  • numThreads=3
  • numThreadsOpen=2
  • video=File:Typical-length-of-a-document-Histograms.webm

unit|How to formulate a research hypothesis[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Understand the ongoing, cyclic process of research
  2. Know what falsifiable means and why every research hypothesis needs to be falsifiable
  3. Be able to formulate your own research hypothesis
  • numThreads=1
  • numThreadsOpen=1
  • video=File:How-to-formulate-a-research-hypothesis.webm

unit|Number of words needed to understand most of Wikipedia[edit | edit source]

  • furtherReading=
  1. http://www.courses.vcu.edu/PHY-rhg/astron/html/mod/006/index.html
  2. Falsifiability
  • learningGoals=
  1. Understand what a log-log plot is
  2. Improve your skills in reading and interpreting diagrams
  3. Know about the word rank / frequency plot
  4. Should be able to transfer a histogram or curve into a cumulative distribution function
  • numThreads=2
  • numThreadsOpen=1
  • video=File:Number-of-words-needed-to-understand-most-of-Wikipedia.webm

unit|Linguists way of checking simplicity of text[edit | edit source]

  • furtherReading=
  1. tba
  • learningGoals=
  1. Get a feeling for interdisciplinary research
  2. Know the Automated Readability Index
  3. Have a strong sense of support for our research hypothesis
  4. Be able to critically discuss the limits of our models
  • video=File:Linguists-way-of-checking-simplicity-of-text.webm

lesson|Advanced statistical descriptive models for the Web[edit | edit source]

  • learningGoals=
  1. fitting a curve
  2. work with logarithmic plots
  3. zipfs law
  4. power law

unit|The Zipf law for text[edit | edit source]

  • furtherReading=
  1. tba
  • learningGoals=
  1. Be able to name some fundamental properties about how frequencies of words in texts are distributed
  2. Be a little bit more cautious about visual impressions when looking at log-log plots
  3. Know both formulations of Zipf’s law
  • video=File:Questioning-the-Zipf-law.webm

unit|Visually straight lines on log log plots[edit | edit source]

  • furtherReading=
  1. tba
  • learningGoals=
  1. Be able to do a coordinate transformation to change the scales of your plots
  2. Understand in which scenario power functions appear as straight lines
  3. Know in which scenarios exponential functions appear as straight lines
  4. Be even more cautious about your visual impressions
  • video=File:Visually-straight-lines-on-log-log-plots.webm

unit|Fitting a curve on a log log plot[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Know the axioms for a distance measure and how they relate to norms.
  2. Know at least two distance measures on functions spaces.
  3. Understand why changing to the CDF makes sense when looking at distance between functions.
  4. Understand the principle of the Kolomogorov-Smirnov test for fitting curves
  • numThreads=4
  • numThreadsOpen=1
  • video=File:Fitting-a-curve-on-a-log-log-plot.webm

unit|Zipf law powerlaw or pareto law.webm[edit | edit source]

  • furtherReading=
  1. tba
  • learningGoals=
  1. Know how to transform a rank frequency diagram to a powerlaw plot.
  2. Understand how powerlaw and pareto plots relate to each other.
  3. Be able to explain why a pareto plot is just and inverted rank frequency diagram
  4. Be able to transform the zipf coefficient to the powerlaw and pareto coefficient and vice versa.
  5. Understand that building the CDF is basically like building the integral.
  • video=File:Zipf-law-powerlaw-or-pareto-law.webm

lesson|Modelling Similarity of Text[edit | edit source]

  • learningGoals=

unit|Similarity Measures and their Applications[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Know the properties of a similarity measure
  2. Be able to relate similarity and distance measures
  3. Know of two applications for modelling similarity
  • numThreads=1
  • numThreadsOpen=1
  • video=File:Similarity-Measures-and-their-Applications.webm

unit|Jaccard Similarity for Sets[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Understand how text documents can be modeled as sets
  2. Know the Jaccard coefficient as a similarity measure on sets
  3. Know a trick how to remember the formula
  4. Be aware of the possible outcomes of the Jaccard index
  5. As always be able to criticize your model
  • video=File:Jaccard-Similarity-for-Sets.webm

unit|Cosine Similarity For Vectorspaces[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Be familiar with the vector space model for text documents
  2. Be aware of term frequency and (inverse) document frequency
  3. Have reviewed the definitions of base and dimension
  4. Realize that the angle between two vectors can be seen as a similarity measure
  • numThreads=2
  • numThreadsOpen=2
  • video=File:Cosine-Similarity-For-Vectorspaces.webm

unit|Probabilistic Similarity Measures Kullback Leibler Divergence[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Be aware of a unigram Language Model
  2. Know Laplacian (aka +1) smoothing
  3. Know the query likelihood model
  4. The Kullback Leibler Divergence
  5. See how a similarity measure can be derived from Kullback Leibler Divergence
  • video=File:Probabilistic-Similarity-Measures-Kullback-Leibler-Divergence.webm

unit|Comparing Results of Similarity Merasures[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Understand that different modeling choices can produce very different results.
  2. Have a feeling how you could statistically compare the differences of the models.
  3. Know how you could extract keywords from documents with the tf-idf approach.
  4. Try to argue which model you like best in a certain scenario.
  • video=File:Comparing-Results-of-Similarity-Merasures.webm

lesson|Generative Models for the Web[edit | edit source]

  • learningGoals=

unit|Introduction to generative modelling.webm[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Understand the principle methodology for building generative models
  2. Remember why people are interested in generative models
  3. Know why descriptive models are needed when evaluating a generative model
  4. Be aware of one way to create a model for text generation
  • numThreads=1
  • numThreadsOpen=1
  • video=File:Introduction-to-generative-modelling.webm

unit|Sampling from a probability distribution[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Understand how to sample values from an arbitrary probability distribution
  2. Have seen yet another application of the cumulative distribution function
  3. Understand that sampling from a distribution is just a coordinate transformation of the uniform distribution
  • numThreads=1
  • numThreadsOpen=1
  • video=File:Sampling-from-a-probability-distribution.webm

unit|Evaluating a generative model[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. See that it makes sense to compare statistics
  2. Understand that comparing statistics is not a well defined task
  3. Be aware of the fact that very different models could lead to the same statistics
  • numThreads=1
  • numThreadsOpen=0
  • video=File:Evaluating-a-generative-model.webm

unit|Pittfalls when increasing the number of model parameters[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. See that one can always increase the model parameters
  2. Know that increasing model parameters often yields a more accurate model
  3. Be aware of the bigram and mixed models as examples for our generative processes
  • video=File:Pittfalls-when-increasing-the-number-of-model-parameters.webm

lesson|Modeling the Web as a graph[edit | edit source]

  • learningGoals=

unit|Reviewing terms from graph theory[edit | edit source]

  • furtherReading=#
  • learningGoals=
  1. Be familiar with a set theoretic way of denoting a graph
  2. Know at least 4 different types of graphs
  3. Have practiced your abilities in reading and writing mathematical formulas
  • video=File:Reviewing terms from graph theory.webm

unit|The standard web graph model[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Be able to model web pages as a graph
  2. Know that the authorship graph is bipartite
  3. Know what kind of graph the graph of web pages is
  4. (as always) be aware of the fact that modeling is done by making choices
  • video=File:The standard web graph model.webm

unit|Descriptive statistics of the web graph[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Know terms like Size and (unique) volume
  2. Be able to count the in and out degree of web pages
  3. Have an idea what kind of law (in & out) degree distributions follow
  4. Know that degree is not distributed in a fair way
  5. Know that the Gini coefficient can be used to measure fairness
  • video=File:Descriptive statistics of the web graph.webm

unit|Topology of the web graph[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Understand the notion of a path in a (directed) graph
  2. Know that shortest paths between nodes need not be unique
  3. Understand the notion of a strongly connected component
  4. Know about the diameter of a graph
  5. Be aware of the bow tie structure of the Web

  • numThreads=1
  • numThreadsOpen=1
  • video=File:Topology of the web graph.webm

unit|Modelling-graphs-with-linear-algebra[edit | edit source]

  • furtherReading=# tba
  • learningGoals=
  1. Be able to read and build an adjacency matrix of a graph
  2. Know some basic matrix vector multiplications to generate some statistics out of the adjacency matrix
  3. Understand what is encoded in the components of the k-th power of the Adjacency matrix of a graph
  • numThreads=1
  • numThreadsOpen=1
  • video=File:Modelling-graphs-with-linear-algebra.pdf.webm