This article is part of the supplement: 7th German Conference on Chemoinformatics: 25 CIC-Workshop

Open Access Poster presentation

Large scale chemical patent mining with UIMA and UNICORE

Alexander Klenner1*, Sandra Bergmann2, Marc Zimmermann1 and Mathilde Romberg2

Author Affiliations

1 Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, 53754, Germany

2 Forschungszentrum Juelich GmbH, Juelich, 52425, Germany

For all author emails, please log on.

Journal of Cheminformatics 2012, 4(Suppl 1):P19  doi:10.1186/1758-2946-4-S1-P19


The electronic version of this article is the complete one and can be found online at: http://www.jcheminf.com/content/4/S1/P19


Published:1 May 2012

© 2012 Klenner et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Poster presentation

Finding information about annotated chemical reactions for drugs and small compounds is a crucial step for pharmaceutical industries. This data often is presented in form of unstructured documents (especially patents) and manual extraction of this information is a time- and cost inefficient effort.

In our project UIMA-HPC [1], we describe the combined usage of

    U
nstructured
    I
nformation
    M
anagment
    A
rchitecture (UIMA) and
    Un
iform
    I
nterface to
    Co
mputing
    Re
cources (UNICORE) for large-scale chemical patent mining. Our approach will incorporate existing software such as chemoCR for image processing (image-to-structure) and OCR for text reconstruction. All components are wrapped inside the UIMA framework pipeline. Using the UIMA framework ensures compatibility between different components of the pipeline and makes it possible to connect arbitrary annotation modules into this system. Scale-out for large document collections is achieved by the UNICORE framework on
    H
igh
    P
erformance
    C
lusters, which enables parallelization of all UIMA nodes. The aim is a fully annotated pdf collection where all biomedical entities (compound names, reaction schemes, etc.) are connected by references and thus can be easily browsed and searched by the user. Planned schematic workflow is shown in Figure 1.

thumbnailFigure 1. Planned workflow of our UIMA framework. 'Recognition' and 'annotation' are CPU intensive parts that are parallelized on demand using the UNICORE framework. 'Merging' checks for cross-annotations (entity in text and image). Finally, an annotated PDF is presented as output.

Funding

BMBF grant 01IH1101.

References

  1. [http://www.uima-hpc.org] webcite