Jump to content

Web Science/Part2: Emerging Web Properties/Modelling Similarity of Text

From Wikiversity

< Web Science | Part2: Emerging Web Properties

edit MOOC index

Modelling Similarity of Text

Similarity Measures and their Applications

Know the properties of a similarity measure
Be able to relate similarity and distance measures
Know of two applications for modelling similarity

Jaccard Similarity for Sets

Understand how text documents can be modeled as sets

Know the Jaccard coefficient as a similarity measure on sets

Know a trick how to remember the formula

Be aware of the possible outcomes of the Jaccard index

As always be able to criticize your model

Cosine Similarity For Vectorspaces

Be familiar with the vector space model for text documents

Be aware of term frequency and (inverse) document frequency

Have reviewed the definitions of base and dimension

Realize that the angle between two vectors can be seen as a similarity measure

Probabilistic Similarity Measures Kullback Leibler Divergence

Be aware of a unigram Language Model

Know Laplacian (aka +1) smoothing

Know the query likelihood model

The Kullback Leibler Divergence

See how a similarity measure can be derived from Kullback Leibler Divergence

Comparing Results of Similarity Merasures

Understand that different modeling choices can produce very different results.

Have a feeling how you could statistically compare the differences of the models.

Know how you could extract keywords from documents with the tf-idf approach.

Try to argue which model you like best in a certain scenario.

no further reading defined

You can define further reading here.
In general you can use the edit button in the upper right corner of a section to edit its content.

Retrieved from "https://en.wikiversity.org/w/index.php?title=Web_Science/Part2:_Emerging_Web_Properties/Modelling_Similarity_of_Text&oldid=1639588"

Web Science/Part2: Emerging Web Properties