Query-Likelihood-Retrieval-Model

In the query likelihood retrieval model, we rank documents by the probability that the query text could be generated by the document language model.
We calculate the probability that we could pull the query words out of the “bucket” of words representing the document.
This is a model of topical relevance,in the sense that the probability of query generation is the measure of how likely it is that a document is about the same topic as the query.

Jelinek-Mercer Smoothing

Smoothing refers to the process of adjusting the maximum likelihood estimator to account for inaccuracy due to data sparseness.
Jelinek-Mercer Smoothing is a linear interpolation of the document and collection word probabilities, where the coefficient λ determines the weighing balance between the two terms
Linearly interpolated between document language model and the collection language model
For lambda, we choose different optimal values for different queries. Experiments have shown that a small value of lambda, around 0.1, works well for long queries and a higher value around 0.7 for short queries.

The CACM collection dataset has been acquired from http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/
The CACM collection is a collection of titles and abstracts from the journal CACM.
The collection consists of the following files:
cacm.all - Text of documents
cite.info - Key to citation info
common_words - Stop words used by smart
qrels.text - List of relevance judgements
query.text - Original text of the query
CACM HTML documents are obtained from: https://github.com/kaanosm/inb344/tree/845ae8c8c6e5e193e4f8e9c399ddc9f3c82e39f0/week%201/Resources
64 queries , 3204 HTML documents

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Baseline_Runs_Output		Baseline_Runs_Output
CASM-Files		CASM-Files
CorpusGeneration		CorpusGeneration
Display/Retrieved_Docments_with_snippets		Display/Retrieved_Docments_with_snippets
Evaluation/OutputFiles		Evaluation/OutputFiles
Indexing/IndexTextFiles		Indexing/IndexTextFiles
QueryEnchancement		QueryEnchancement
Retrieval/OutputFiles		Retrieval/OutputFiles
Utility		Utility
DisplayResult.py		DisplayResult.py
GenerateTokenizedCorpus.py		GenerateTokenizedCorpus.py
Indexer.py		Indexer.py
PerformanceEvaluation.py		PerformanceEvaluation.py
QueryEnrichment.py		QueryEnrichment.py
README.md		README.md
RetrievalModels.py		RetrievalModels.py
SQLM analysis - stopping.xlsx		SQLM analysis - stopping.xlsx
SQLM analysis.xlsx		SQLM analysis.xlsx