Jump to content

Web Science/Part2: Emerging Web Properties

From Wikiversity

edit MOOC index

Web Science/Part2: Emerging Web Properties

1. How big is the World Wide Web

Problems with the question about the size of the Web

The question will remain unanswered during the lesson and the entire course.
question of size is underspecified because a measure is needed.
measure depends heavily on the choice of how we model the web.
We have not yet defined what we mean when we say World Wide Web.

3 ways to study the Web

The web as a software system.

The web as a collection of text documents.

The web as a graph of interlinked documents.

Even when choosing 1 point of view we have fundamentally different ways of modelling.

A simplistic descriptive model

understand that only the model is described.

description of the model can be used for interpretation.

within the descriptive model one chooses measures to describe the object of study.

understand the notion of a modelling choice

be able to criticise a descriptive model and the modelling choices

An unrealistic, simplistic generative model

Can be used to try to give a reason why something works.

need to be run more than once!

understand the notion of a modeling parameter

will be compared to the descriptive model of our object of study.

Summary, further reading, homework

no learning goals defined

2. Simple statistical descriptive Models for the Web

Counting Words And Documents

Understand why we selected simple English Wikipedia as a toy example for modeling the web

Understand that a task already as simple as counting words includes modeling choices

Be familiar with the term “unique word token”

Know some basic tools to count words and documents

Typical length of a document

Be familiar with some basic statistical objects like Median, Mean, and Histograms

Should be able to relate a histogram to its cumulative distribution function

How to formulate a research hypothesis

Understand the ongoing, cyclic process of research

Know what falsifiable means and why every research hypothesis needs to be falsifiable

Be able to formulate your own research hypothesis

Number of words needed to understand most of Wikipedia

Understand what a log-log plot is

Improve your skills in reading and interpreting diagrams

Know about the word rank / frequency plot

Should be able to transfer a histogram or curve into a cumulative distribution function

Linguists way of checking simplicity of text

Get a feeling for interdisciplinary research

Know the Automated Readability Index

Have a strong sense of support for our research hypothesis

Be able to critically discuss the limits of our models

3. Advanced statistical descriptive models for the Web

The Zipf law for text

Be able to name some fundamental properties about how frequencies of words in texts are distributed

Be a little bit more cautious about visual impressions when looking at log-log plots

Know both formulations of Zipf’s law

Visually straight lines on log log plots

Be able to do a coordinate transformation to change the scales of your plots

Understand in which scenario power functions appear as straight lines

Know in which scenarios exponential functions appear as straight lines

Be even more cautious about your visual impressions

Fitting a curve on a log log plot

Know the axioms for a distance measure and how they relate to norms.

Know at least two distance measures on functions spaces.

Understand why changing to the CDF makes sense when looking at distance between functions.

Understand the principle of the Kolomogorov-Smirnov test for fitting curves

Zipf law powerlaw or pareto law.webm

Know how to transform a rank frequency diagram to a powerlaw plot.

Understand how powerlaw and pareto plots relate to each other.

Be able to explain why a pareto plot is just and inverted rank frequency diagram

Be able to transform the zipf coefficient to the powerlaw and pareto coefficient and vice versa.

Understand that building the CDF is basically like building the integral.

4. Modelling Similarity of Text

Similarity Measures and their Applications

Know the properties of a similarity measure

Be able to relate similarity and distance measures

Know of two applications for modelling similarity

Jaccard Similarity for Sets

Understand how text documents can be modeled as sets

Know the Jaccard coefficient as a similarity measure on sets

Know a trick how to remember the formula

Be aware of the possible outcomes of the Jaccard index

As always be able to criticize your model

Cosine Similarity For Vectorspaces

Be familiar with the vector space model for text documents

Be aware of term frequency and (inverse) document frequency

Have reviewed the definitions of base and dimension

Realize that the angle between two vectors can be seen as a similarity measure

Probabilistic Similarity Measures Kullback Leibler Divergence

Be aware of a unigram Language Model

Know Laplacian (aka +1) smoothing

Know the query likelihood model

The Kullback Leibler Divergence

See how a similarity measure can be derived from Kullback Leibler Divergence

Comparing Results of Similarity Merasures

Understand that different modeling choices can produce very different results.

Have a feeling how you could statistically compare the differences of the models.

Know how you could extract keywords from documents with the tf-idf approach.

Try to argue which model you like best in a certain scenario.

5. Generative Models for the Web

Introduction to generative modelling.webm

Understand the principle methodology for building generative models

Remember why people are interested in generative models

Know why descriptive models are needed when evaluating a generative model

Be aware of one way to create a model for text generation

Sampling from a probability distribution

Understand how to sample values from an arbitrary probability distribution

Have seen yet another application of the cumulative distribution function

Understand that sampling from a distribution is just a coordinate transformation of the uniform distribution

Evaluating a generative model

See that it makes sense to compare statistics

Understand that comparing statistics is not a well defined task

Be aware of the fact that very different models could lead to the same statistics

Pittfalls when increasing the number of model parameters

See that one can always increase the model parameters

Know that increasing model parameters often yields a more accurate model

Be aware of the bigram and mixed models as examples for our generative processes

6. Modeling the Web as a graph

Reviewing terms from graph theory

Be familiar with a set theoretic way of denoting a graph

Know at least 4 different types of graphs

Have practiced your abilities in reading and writing mathematical formulas

The standard web graph model

Be able to model web pages as a graph

Know that the authorship graph is bipartite

Know what kind of graph the graph of web pages is

(as always) be aware of the fact that modeling is done by making choices

Descriptive statistics of the web graph

Know terms like Size and (unique) volume

Be able to count the in and out degree of web pages

Have an idea what kind of law (in & out) degree distributions follow

Know that degree is not distributed in a fair way

Know that the Gini coefficient can be used to measure fairness

Topology of the web graph

Understand the notion of a path in a (directed) graph

Know that shortest paths between nodes need not be unique

Understand the notion of a strongly connected component

Know about the diameter of a graph

Be aware of the bow tie structure of the Web

Modelling-graphs-with-linear-algebra

Be able to read and build an adjacency matrix of a graph

Know some basic matrix vector multiplications to generate some statistics out of the adjacency matrix

Understand what is encoded in the components of the k-th power of the Adjacency matrix of a graph

Retrieved from "https://en.wikiversity.org/w/index.php?title=Web_Science/Part2:_Emerging_Web_Properties&oldid=1634456"

MOOC