07 Apr 2018: JATE 2.0 Beta.11 released. The main changes include: 1) migration to Solr 7.2.1. WARNING: the index files created by this version of Solr is not compatible with the previous versions; 2) fixing a couple of minor bugs documented in the Issues page; 3) added two more example configrations for the TTC corpora; 4) added two new algorithms, Basic and ComboBasic; 5) improved introduction page.
02 Apr 2018: JATE 2.0 Beta.9 released. The main change is migration to Solr 6.6.0 (thanks to MysterionRise) - JATE is now based on Solr 6.6.0. WARNING: the index files created by this version of Solr is not compatible with the previous versions. Please consider this before upgrading!
- Introduction
- Cite JATE
- Reasons for using JATE
- Support
- Contributing
- Other downloads
- License
- Contact
- Release history
JATE (Java Automatic Term Extraction) is an open source library for Automatic Term Extraction (or Recognition) from text corpora. It is implemented within the Apache Solr framework (currently Solr 7.2.1), currently supporting more than 10 ATE algorithms, and almost any kinds of term candidate patterns. The integration with Solr gives JATE potential to be easily customised and adapted to different document formats, domains, and languages.
JATE is not just a library for ATE. It also implements several text processing utilities that can be easily used for other general-purpose indexing, such as tokenisation, advanced phrase and n-gram extraction. See Reasons for using JATE
Please support us by citing JATE as below:
If you use the version from this Git repository: Zhang, Z., Gao, J., Ciravegna, F. 2016. JATE 2.0: Java Automatic Term Extraction with Apache Solr. In The Proceedings of the 10th Language Resources and Evaluation Conference, May 2016, Portorož, Slovenia
If you use the old JATE 1.11 available here (no longer supported except an outdated JATE 1.0 wiki page): Zhang, Z., Iria, J., Brewster, C., and Ciravegna, F. 2008. A Comparative Evaluation of Term Recognition Algorithms. In Proceedings of The 6th Language Resources and Evaluation Conference, May 2008, Marrakech, Morocco.
A wide range of ATE tools and libraries have been developed over the years. In comparison, there are five reasons why JATE is unique:
- Free to use, for commercial or non-commercial purposes.
- Built on the Apache Solr framework to benefit from its powerful text analysis libraries, high compatibility and scalability, and rigorous community support. As examples, you can plug in the Tika library to process different document formats, use different text preprocessing (e.g., character filtering, HTML entity conversion), tokenisation and normalisation methods available through Lucene, or index your documents and boost your queries with extracted terms easily thanks to its integration with Solr.
- Highly configurable linguistic processors for candidate term extraction, such as noun phrases, PoS patterns, and n-grams.
- 10 state of the art ATE scoring and ranking algorithms.
- A set of highly configurable, complex text processing utilities that can be used as Solr plugins for general purpose text indexing and retrieval. For example, sentence splitter, statistical tokeniser, lemmatiser, PoS tagger, phrase chunker and n-gram extractors that are sentence context aware and stopwords removable, etc.
For terminology practitioners, this means you can quickly build highly customisable ATE tools that suit your data and domain, at no cost. For terminology researchers and developers, this means that you have many necessary building blocks for developing novel ATE methods, and a uniform environment where you can evaluate and compare different methods. For general information retrieval users, you have a range of advanced text processing utilities that you can easily plug into your existing Solr or Lucene based indexing and retrieval applications.
JATE is currently maintained by a team of two members, who have other full-time roles but use as much their spare time as possible on this work. We try our best to respond to your queries but we apologise for any potential delays for this reason. However there are many ways you can contribute to JATE to potentially make it better. Currently you can obtain support from us in the following ways:
- A wiki page to get you started.
- A Google Group to ask questions about JATE.
- An issues page to report bugs - only bug reporting please. For any questions please use the Google Group above.
- Contact the team directly - please use this only if your query does not fall into any of the above categories.
JATE is a research software that originates from an EPSRC funded project 'Abraxas'. As you may appreciate, since the project termination, there is no more funding to support the software and therefore all subsequent development and its current maintenance have been undertaken voluntarily by the team. JATE is far from perfect and yet we are trilled to see it becoming one of the most popular free text mining tools in the community, thanks to your support. 1We are also keen to make it better and therefore, we would be grateful for your contributions in many forms:
We would be grateful if you tell us a little more of your use cases with JATE: are you using JATE to conduct cutting-edge research in another (or the same) subject area? Or are you using JATE to enable your business applications? By gathering as many detailed use cases as possible, you are helping us make a compelling case to apply for fundings from various institutions to support the development and maintenance of JATE. Please get in touch with us by email and share your story with us - it costs you no money but just a little of your time!
We are keen to collaborate with any partners (academia or industry) to develop new project ideas. This can be, but not limited to, any of the following:
- further development of JATE, by adding new algorithms, text processing capacities, user friendly interface, support for other programming languages etc.
- integration with other, existing implementations of ATE methods, frameworks, or platforms.
- using ATE for downstream applications, such as ontology engineering, information retrieval etc.
Please get in touch with us by email to discuss your ideas.
We welcome bug fixes, improvements, new features etc. Before embarking on making significant changes, please open an issue and ask first so that you do not risk duplicating efforts or spending time working on something that may be out of scope. To contribute code, please follow:
1. Fork the project, clone your fork, and add the upstream to your remote:
$ git clone git@github.com:<your-username>/jate.git
$ cd jate
$ git remote add upstream https://github.com/ziqizhang/jate
$ git checkout master
$ git fetch upstream
$ git merge upstream/master
$ git checkout -b <feature-branch-name>
4. Please try to commit your changes in logical chunks and reference the issue number in your commit messages:
$ git commit -m "Issue #<issue-number> - <commit-message>"
$ git push origin <feature-branch-name>
6. Open a Pull Request against the upstream master branch. Please give your pull request a clear title and description and note which issue(s) your pull request fixes.
Important: By submitting a patch, you agree to allow the project owners to license your work under the LGPLv3 license.
A crucial resource for developing ATE methods is data, and particularly 'annotated' data that consists of text corpora as well as a list of expected 'real' terms to be found within the corpora. We call this 'gold standard'. This is critical for evaluating and improving the performance of ATE in particular domains.
If you would like to share any data you have created please also get in touch by email. We will acknowledge your credits and share a download within the Other downloads section, subject to your consent.
This Git repository only hosts the most recent version of JATE. You can obtain some of the previous versions below:
- JATE 1.11: download here
- Other JATE 2.0 based versions in the Maven central repository
We share datasets used for the development and evaluation of ATE below.
- Ziqi Zhang's research data page contains 4 datasets used for ATE research.
JATE is licensed with LGPL 3.0, which permits free commercial and non-commercial use. See details here.
The team member's personal webpages contain their email contacts:
- JATE2.0 Beta.11 version - 7 Apr 2018
- JATE2.0 Beta version - 20 May 2016
- JATE2.0 Alpha version - 04 April 2016