Skip to content

erogluegemen/TDK-Dataset

Repository files navigation

TDK Dataset

old me irl xd

This dataset contains a collection of Turkish dictionary definitions extracted from the official website of the Turkish Language Association (TDK). It provides comprehensive definitions for a wide range of Turkish words and phrases.

The dataset is intended to be a valuable resource for researchers, linguists, language enthusiasts, and anyone interested in the Turkish language. It can be used for various purposes, such as natural language processing tasks, language analysis, and educational projects.

Please note that the dataset is provided for informational purposes and should not be used for official or legal purposes. The definitions are based on the TDK's official dictionary and may not cover all possible meanings or variations in usage.

Disclaimer: This dataset is created by an individual and is not an official TDK dataset. It is provided "as is" without any warranty or guarantee of accuracy. Users should exercise their own judgment and discretion when using the data.

Sources:

The data in this dataset is sourced from the official website of the Turkish Language Association (TDK) - the authoritative body responsible for the development and regulation of the Turkish language. The dataset was obtained by utilizing the publicly available API provided by TDK, which allows access to dictionary definitions for Turkish words and phrases.

The scraping and data extraction process involved accessing the TDK website's API endpoint for retrieving definitions based on unique word identifiers. This process was done programmatically, adhering to the terms of service and guidelines provided by TDK.

Please note that this dataset is created by an individual and is not an official TDK dataset. The data is provided for informational purposes and should not be used for official or legal purposes. The definitions are based on the TDK's official dictionary and may not cover all possible meanings or variations in usage.

For more information about the TDK and their official language resources, please visit their website at https://sozluk.gov.tr/ .

Collection Methodology

The following steps were involved in the data collection process:

  1. Identify the unique word identifiers: A range of word identifiers or IDs was determined to cover a substantial number of words in the Turkish language.
  2. Access the TDK API: The API endpoint for dictionary definitions was accessed using the identified word identifiers.
  3. Retrieve and store the data: The data, including word definitions and associated metadata, was retrieved from the API response and stored in a structured format.
  4. Data cleaning and processing: The collected data was cleaned and processed to ensure consistency and proper formatting. The web scraping process was designed to be respectful and compliant with the terms of service and guidelines provided by TDK. The data collection focused solely on retrieving dictionary definitions, and no other personal or sensitive information was accessed or stored.

You can find the more informations here:
Kaggle: https://www.kaggle.com/datasets/erogluegemen/tdk-turkish-words
HuggingFace: https://huggingface.co/datasets/erogluegemen/TDK_Turkish_Words
Medium: https://erogluegemen.medium.com/sequential-vs-parallel-processing-in-python-ef0ef3cc34c9