|Scooped by luiy|
If you use LA-PDFText in your project, please cite us as follows:
Ramakrishnan, C., A. Patnia, E. Hovy and G. Burns (2012). "Layout-Aware Text Extraction from Full-text PDF of Scientific Articles."Source Code for Biology and Medicine 7(1): 7. [http://www.scfbm.org/content/7/1/7/abstract]
The Portable Document Format (PDF) is the almost universally used file format for online scientific publications. It is also notoriously difficult to read and handle computationally, presenting challenges for developers of biomedical text mining or biocuration informatics systems that use the published literature as an information source. To facilitate the effective use of scientific literature in such systems we introduce Layout-Aware PDF Text Extraction (LA-PDFText).
See Overview for a list of commands that you can execute with this tool. This includes simple and more detailed text extraction from PDF files.
LA-PDFText has been developed by members of the Biomedical Knowledge Engineering group @ the Information Sciences Institute. It is intended for use both scientists and NLP engineers interested in getting access to text within specific sections of research articles. The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. This means that the system works quite well for most applications (and might occasionally make mistakes and extract the wrong text), but it is always possible to 'hack' your own rules and improve performance.
For questions about future development or support of the current tool, please contact Gully Burns (email@example.com). For discussions concerning the work contributing to this project, please contact any of the research team: Gully Burns, Cartic
Ramakrishnan (firstname.lastname@example.org) or Ed Hovy (email@example.com).