# Web Science/Part2: Emerging Web Properties/How big is the World Wide Web

# Web Science/Part2: Emerging Web Properties

## Lessons

- The question will remain unanswered during the lesson and the entire course.
- question of size is underspecified because a measure is needed.
- measure depends heavily on the choice of how we model the web.
- We have not yet defined what we mean when we say World Wide Web.

- The web as a software system.
- The web as a collection of text documents.
- The web as a graph of interlinked documents.
- Even when choosing 1 point of view we have fundamentally different ways of modelling.

- understand that only the model is described.
- description of the model can be used for interpretation.
- within the descriptive model one chooses measures to describe the object of study.
- understand the notion of a modelling choice
- be able to criticise a descriptive model and the modelling choices

- Can be used to try to give a reason why something works.
- need to be run more than once!
- understand the notion of a modeling parameter
- will be compared to the descriptive model of our object of study.

no learning goals defined

- Understand why we selected simple English Wikipedia as a toy example for modeling the web
- Understand that a task already as simple as counting words includes modeling choices
- Be familiar with the term “unique word token”
- Know some basic tools to count words and documents

- Be familiar with some basic statistical objects like Median, Mean, and Histograms
- Should be able to relate a histogram to its cumulative distribution function

- Understand the ongoing, cyclic process of research
- Know what falsifiable means and why every research hypothesis needs to be falsifiable
- Be able to formulate your own research hypothesis

- Understand what a log-log plot is
- Improve your skills in reading and interpreting diagrams
- Know about the word rank / frequency plot
- Should be able to transfer a histogram or curve into a cumulative distribution function

- Get a feeling for interdisciplinary research
- Know the Automated Readability Index
- Have a strong sense of support for our research hypothesis
- Be able to critically discuss the limits of our models

- Be able to name some fundamental properties about how frequencies of words in texts are distributed
- Be a little bit more cautious about visual impressions when looking at log-log plots
- Know both formulations of Zipf’s law

- Be able to do a coordinate transformation to change the scales of your plots
- Understand in which scenario power functions appear as straight lines
- Know in which scenarios exponential functions appear as straight lines
- Be even more cautious about your visual impressions

- Know the axioms for a distance measure and how they relate to norms.
- Know at least two distance measures on functions spaces.
- Understand why changing to the CDF makes sense when looking at distance between functions.
- Understand the principle of the Kolomogorov-Smirnov test for fitting curves

- Know how to transform a rank frequency diagram to a powerlaw plot.
- Understand how powerlaw and pareto plots relate to each other.
- Be able to explain why a pareto plot is just and inverted rank frequency diagram
- Be able to transform the zipf coefficient to the powerlaw and pareto coefficient and vice versa.
- Understand that building the CDF is basically like building the integral.

- Know the properties of a similarity measure
- Be able to relate similarity and distance measures
- Know of two applications for modelling similarity

- Understand how text documents can be modeled as sets
- Know the Jaccard coefficient as a similarity measure on sets
- Know a trick how to remember the formula
- Be aware of the possible outcomes of the Jaccard index
- As always be able to criticize your model

- Be familiar with the the vector space model for text documents
- Be aware of term frequency and (inverse) document frequency
- Have reviewed the definitions of base and dimension
- Realize that the angle between two vectors can be seen as a similarity measure

- Be aware of a unigram Language Model
- Know Laplacian (aka +1) smoothing
- Know the query likelihood model
- The Kullback Leibler Divergence
- See how a similarity measure can be derived from Kullback Leibler Divergence

- Understand that different modeling choices can produce very different results.
- Have a feeling how you could statistically compare the differences of the models.
- Know how you could extract keywords from documents with the tf-idf approach.
- Try to argue which model you like best in a certain scenario.

- Understand the principle methodology for building generative models
- Remember why people are interested in generative models
- Know why descriptive models are needed when evaluating a generative model
- Be aware of one way to create a model for text generation

- Understand how to sample values from an arbitrary probability distribution
- Have seen yet another application of the cumulative distribution function
- Understand that sampling from a distribution is just a coordinate transformation of the uniform distribution

- See that it makes sense to compare statistics
- Understand that comparing statistics is not a well defined task
- Be aware of the fact that very different models could lead to the same statistics

- See that one can always increase the model parameters
- Know that increasing model parameters often yields a more accurate model
- Be aware of the bigram and mixed models as examples for our generative processes

- Be familiar with a set theoretic way of denoting a graph
- Know at least 4 different types of graphs
- Have practiced your abilities in reading and writing mathematical formulas

- Be able to model web pages as a graph
- Know that the authorship graph is bipartite
- Know what kind of graph the graph of web pages is
- (as always) be aware of the fact that modeling is done by making choices

- Know terms like Size and (unique) volume
- Be able to count the in and out degree of web pages
- Have an idea what kind of law (in & out) degree distributions follow
- Know that degree is not distributed in a fair way
- Know that the Gini coefficient can be used to measure fairness

- Understand the notion of a path in a (directed) graph
- Know that shortest paths between nodes need not be unique
- Understand the notion of a strongly connected component
- Know about the diameter of a graph
- Be aware of the bow tie structure of the Web

- Be able to read and build an adjacency matrix of a graph
- Know some basic matrix vector multiplications to generate some statistics out of the adjacency matrix
- Understand what is encoded in the components of the k-th power of the Adjacency matrix of a graph