- In the query likelihood retrieval model, we rank documents by the probability that the query text could be generated by the document language model.
- We calculate the probability that we could pull the query words out of the “bucket” of words representing the document.
- This is a model of topical relevance,in the sense that the probability of query generation is the measure of how likely it is that a document is about the same topic as the query.
- Smoothing refers to the process of adjusting the maximum likelihood estimator to account for inaccuracy due to data sparseness.
- Jelinek-Mercer Smoothing is a linear interpolation of the document and collection word probabilities, where the coefficient λ determines the weighing balance between the two terms
- Linearly interpolated between document language model and the collection language model
- For lambda, we choose different optimal values for different queries. Experiments have shown that a small value of lambda, around 0.1, works well for long queries and a higher value around 0.7 for short queries.
- The CACM collection dataset has been acquired from http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/
- The CACM collection is a collection of titles and abstracts from the journal CACM.
- The collection consists of the following files:
- cacm.all - Text of documents
- cite.info - Key to citation info
- common_words - Stop words used by smart
- qrels.text - List of relevance judgements
- query.text - Original text of the query
- CACM HTML documents are obtained from: https://github.com/kaanosm/inb344/tree/845ae8c8c6e5e193e4f8e9c399ddc9f3c82e39f0/week%201/Resources
- 64 queries , 3204 HTML documents