Information Retrieval Project

This project aims to develop a search engine for retrieving text documents based on user queries. The system allows users to enter their queries, and it represents relevant documents. The project is divided into three phases:

Phase 1: Creating a Basic Information Retrieval Model

In this phase, the focus is on creating a simple information retrieval model. The documents need to be indexed to utilize the positional index for retrieving relevant documents. The main tasks in this phase include:

Data preprocessing
Creating a spatial index
Query processing and retrieval

Before building the positional index, it's necessary to preprocess the texts. The required steps in this phase are:

Token extraction
Text normalization
Stop-word removal
Stemming

Phase 2: Extending the Information Retrieval Model

In this phase, the goal is to extend the information retrieval model and represent documents as vectors to rank search results based on their relevance to the user's query. The steps involved in this phase are:

Document modeling in the vector space
Query representation in the vector space
Calculating the similarity between the query vector and document vectors
Ranking the search results based on similarity scores

The documents are represented using an tf-idf scheme. The formula is depicted below:

Then when user gives a query, extract the specific query vector (calculate the weights of query words). Then, using a similarity measure, attempt to find the documents that have the highest similarity (minimum distance) to the input query. Display the results in order of similarity. Various distance metrics can be considered for this task, with the simplest being cosine similarity, which calculates the angle between two vectors. The formula depucted below:

Phase 3: Machine learning applied in document retrieval

In this phase, the search engine developed in the previous phases is further enhanced. To handle large volumes of input documents, we employ clustering techniques to compare the query with a subset of documents within a cluster. Additionally, news categorization is implemented to map each news article to specific categories, allowing users to identify the news categories of search results.

K-means

In this stage, the documents are clustered using the K-means clustering algorithm. Multiple runs of the algorithm can be performed, and the best clustering can be selected based on the RSS criterion.

KNN

For document categorization, the k-nearest neighbors algorithm with different values of k is utilized. The category of a document is determined based on its nearest neighbors.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
IR-Extra-Phase2.ipynb		IR-Extra-Phase2.ipynb
IR-Phase3-KNN.ipynb		IR-Phase3-KNN.ipynb
Phase3-IR-k_means.ipynb		Phase3-IR-k_means.ipynb
README.md		README.md
cosine similarity.png		cosine similarity.png
heaps.py		heaps.py
make_index_and_queries.py		make_index_and_queries.py
query_cosine.py		query_cosine.py
tfidf.png		tfidf.png
tfidf.py		tfidf.py
zifs.py		zifs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval Project

Phase 1: Creating a Basic Information Retrieval Model

Phase 2: Extending the Information Retrieval Model

Phase 3: Machine learning applied in document retrieval

K-means

KNN

About

Releases

Packages

Languages

shakibaam/Information-Retrieval-Project

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval Project

Phase 1: Creating a Basic Information Retrieval Model

Phase 2: Extending the Information Retrieval Model

Phase 3: Machine learning applied in document retrieval

K-means

KNN

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages