# Web Science/Part2: Emerging Web Properties/Modelling Similarity of Text

From Wikiversity

# Modelling Similarity of Text

## Associated units

- Know the properties of a similarity measure
- Be able to relate similarity and distance measures
- Know of two applications for modelling similarity

- Understand how text documents can be modeled as sets
- Know the Jaccard coefficient as a similarity measure on sets
- Know a trick how to remember the formula
- Be aware of the possible outcomes of the Jaccard index
- As always be able to criticize your model

- Be familiar with the the vector space model for text documents
- Be aware of term frequency and (inverse) document frequency
- Have reviewed the definitions of base and dimension
- Realize that the angle between two vectors can be seen as a similarity measure

- Be aware of a unigram Language Model
- Know Laplacian (aka +1) smoothing
- Know the query likelihood model
- The Kullback Leibler Divergence
- See how a similarity measure can be derived from Kullback Leibler Divergence

- Understand that different modeling choices can produce very different results.
- Have a feeling how you could statistically compare the differences of the models.
- Know how you could extract keywords from documents with the tf-idf approach.
- Try to argue which model you like best in a certain scenario.

no further reading defined

You can define further reading here.

In general you can use the edit button in the upper right corner of a section to edit its content.

In general you can use the edit button in the upper right corner of a section to edit its content.