This Python-based project offers a comprehensive solution for analyzing and exploring the speeches delivered in the Greek Parliament from 1989 to 2020. It provides a powerful search engine along with various analytics tools including keyword extraction, Latent Semantic Indexing (LSI), clustering, Named Entity Recognition (NER), and pairwise similarity computation.
-
Speech Search Engine: Utilizes advanced search algorithms to enable users to efficiently search through the vast repository of parliamentary speeches.
-
Keyword Extraction: Automatically identifies and extracts key terms and phrases from the speeches, aiding in content summarization and analysis.
-
Pairwise Similarity Computation: Calculates the similarity between speeches, allowing for comparative analysis and identifying similarities between different parliamentary sessions.
-
Latent Semantic Indexing (LSI): Applies LSI to uncover hidden relationships between terms and documents, facilitating a deeper understanding of the content's underlying structure.
-
Clustering: Groups speeches into clusters based on similarity, providing insights into thematic trends and patterns over time.
-
Named Entity Recognition (NER): Identifies and categorizes named entities such as people, organizations, and locations mentioned in the speeches, enriching the data for further analysis.
Each task is accompanied by its own README.md file. These files provide detailed instructions for setting up the environment, running the scripts, and interpreting the results specific to each task. Before diving into the analysis, it's recommended to refer to the respective README.md file to ensure a smooth setup process and to understand the expected output formats.
A demonstration of the application can be found at the following link: https://youtu.be/9pT36bUvwyo
- https://github.com/hb20007/hands-on-nltk-tutorial/blob/main/7-1-NLTK-with-the-Greek-Script.ipynb
- https://github.com/skroutz/greek_stemmer
- https://github.com/stopwords-iso/stopwords-el
- https://spacy.io/models/el
- https://en.wikipedia.org/wiki/Named-entity_recognition
- https://github.com/eellak/gsoc2018-spacy
- https://stackoverflow.com/questions/76206507/spacy-where-are-terminologies-defined
- https://whoosh.readthedocs.io/en/latest/schema.html
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
For any questions, you can email at gatmatzidi@gmail.com
This project originated as a university endeavor for the Information Retrieval course within the Bachelor's degree program in Computer Science at Aristotle University of Thessaloniki. We extend our gratitude to the instructor and fellow students whose feedback and collaboration contributed to the development and refinement of this project.