The HashTable Tokenizer is a C++ implemented hash table designed for efficient tokenization. This project provides a solution to convert text into numeric tokens and vice versa, facilitating the intermediate processing of large language modele. The implementation uses a custom hash function with separate chaining to resolve collisions, optimized for quick retrieval and insertion of token-word pairs.
- Custom Hashing: Implements a unique hashing function for distributing words efficiently across the hash table.
- Separate Chaining: Utilizes separate chaining to handle collisions, ensuring fast access even with hash conflicts.
- No STL Libraries: Developed without the use of Standard Template Library (STL) to meet project constraints.
- Efficient Tokenization: Supports operations to tokenize text to numeric tokens and retrieve text from tokens.
- Memory Management: Careful handling of memory allocation and deallocation to prevent leaks and ensure optimal performance.
Ensure you have a C++ compiler installed on your system (GCC recommended). This project does not require any external libraries.
To compile the project, navigate to the project directory and use the provided Makefile:
make
After compilation, you can run the program using:
./a.out < input_file
Where input_file
is a file containing a sequence of commands as described in the "Commands" section.
This project accepts the following commands:
M m
: Initializes a new hash table with sizem
.INSERT word
: Inserts a word into the tokenizer.READ filename
: Reads words from a specified file.TOKENIZE word
: Returns the token associated with the word.RETRIEVE t
: Retrieves the word associated with tokent
.STOK string_of_words
: Tokenizes a string of words.TOKS string_of_tokens
: Turns a string of tokens back into words.PRINT k
: Prints the keys stored at positionk
in the hash table.
The input files should end with the string "EXIT", which will terminate the program. The output for each command is either a success or failure message, a token, a word, or a series of tokens or words depending on the command.
For a detailed explanation of the program's design and implementation, refer to the design document included in the repository.
To test the project with Valgrind for memory leaks:
valgrind ./a.out < test01.in
Replace test01.in
with your test input file.
Contributions to the HashTable Tokenizer are welcome. Please feel free to fork the repository, make changes, and submit pull requests.