# Web Science/Part2: Emerging Web Properties/Modelling Similarity of Text

Jump to navigation
Jump to search
Understand how text documents can be modeled as sets
Know the Jaccard coefficient as a similarity measure on sets
Know a trick how to remember the formula
Be aware of the possible outcomes of the Jaccard index
As always be able to criticize your model
Be familiar with the vector space model for text documents
Be aware of term frequency and (inverse) document frequency
Have reviewed the definitions of base and dimension
Realize that the angle between two vectors can be seen as a similarity measure
Be aware of a unigram Language Model
Know Laplacian (aka +1) smoothing
Know the query likelihood model
The Kullback Leibler Divergence
See how a similarity measure can be derived from Kullback Leibler Divergence
Understand that different modeling choices can produce very different results.
Have a feeling how you could statistically compare the differences of the models.
Know how you could extract keywords from documents with the tf-idf approach.
Try to argue which model you like best in a certain scenario.

# Modelling Similarity of Text

## Associated units

- Know the properties of a similarity measure
- Be able to relate similarity and distance measures
- Know of two applications for modelling similarity

no further reading defined

You can define further reading here.

In general you can use the edit button in the upper right corner of a section to edit its content.

In general you can use the edit button in the upper right corner of a section to edit its content.