Resources, Tools, Repositories

Free NLP components

FSU Jena has started putting their NLP components for UIMA for free download on their website.


The BioLexicon is a large-scale lexical resource, especially designed to contain and manage data from bio-databases. The lexicon data model has been conceived to be compliant to the ISO-ratified international standards for lexicons; its associated data categories, also reflects the ISO Data Category conceptual model available in the ISO Registry.


The Gene Regulation Ontology (GRO) is a methodologically rigorously crafted formal ontology which covers the domain of gene regulation. It integrates knowledge that is partially assembled in alternative ontological resources we relied upon, viz. the Gene Ontology (GO), the Sequence Ontology (SO), Chemical Entities of Biological Interest (ChEBI), INOH Molecule Role (IMR), INOH Event Ontology (IEV), the NCBI Taxonomy and TransFac.

E.coli Corpora

Four E. coli relevant corpora are available:

Text Analytics Toolkit

The BOOTStrep Text Analytics Toolkit is a collection of almost forty human language technology modules which cover virtually all phases of text analytics such as text segmentation (sentence splitting, tokenisation), morpho-lexical analysis (stemming, lemmatisation, acronym and abbreviation resolution, term recognition), syntactic analysis (part-of-speech tagging, coordination resolution, chunking, parsing), semantic analysis (named entity recognition and interpretation, relation and event extraction) and discourse-level analysis (co-reference resolution). While some modules could be employed on an as-is basis (e.g., statistical term recognition systems such as TerMine, or rule-based parsers such as Enju), in particular machine learning-based systems had to be re-trained. As a consequence, this required the creation of training material and, therefore, BOOTStrep partners had to develop several text corpora annotated with formal text structure, syntactic, semantic and discourse information.


The BOOTStrep BioFactStore is a database for biological researchers and developers, which contains factual information (empirical assertions, statements) about gene regulation in E. coli. The BioFactStore is a merger of RegulonDB (, the most authoritative manually curated database of regulatory networks in many species, and the automatically harvested factoids relating to gene regulation in E. coli from the Knowledge Reaper (see (5)). It not only reports on selected types of events and the involved agents and patients but also on the polarity of the relation and the physical contact data. Modality (certainty of the information) is covered as well as additional parameters such as environment parameters: