Word similarity dataset

From Wikiversity
Jump to navigation Jump to search

Word similarity dataset is a dataset with similarities between words, often pairs of words. Such datasets are often used to evaluation tools that provide methods for lexical semantics, such as semantic similarity. Researchers have established several word similarity datasets.


Name Year Pairs Range Reference
Rubenstein-Goodenough (RG) 1965 0.02–3.94 Contextual correlates of synonymy
Miller-Charles (MG) 1991 Contextual correlates of semantic similarity
WordSim-353 2001 353 Placing search in context: the concept revisited
MEN 2012 Multimodal distributional semantics
SimLex-999 2014 999 0.23–9.80 SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation
SimVerb-3500 2016 3500 SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity

Quiz[edit | edit source]

What value has the synonymy judgement between the word pair "coast" and "forest"?


The SimLex-999 may be read in Python with

from pandas import read_csv
import requests
from six import StringIO
from zipfile import ZipFile

url = "https://www.cl.cam.ac.uk/~fh295/SimLex-999.zip"
filename = 'SimLex-999/SimLex-999.txt'
simlex = read_csv(ZipFile(StringIO(requests.get(url).content)).open(filename), sep='\t')
What is the mean similarity rating?


External links[edit | edit source]