e-Xploration
28.9K views | +8 today
Follow
e-Xploration
antropologiaNet, dataviz, collective intelligence, algorithms, social learning, social change, digital humanities
Curated by luiy
Your new post is loading...
Your new post is loading...
Scooped by luiy
Scoop.it!

lapdftext - Layout-Aware Text #Extraction from Full-text PDF of Scientific Articles | #semantic #scientometrics

lapdftext - Layout-Aware Text #Extraction from Full-text PDF of Scientific Articles | #semantic #scientometrics | e-Xploration | Scoop.it
luiy's insight:

Publications

 

If you use LA-PDFText in your project, please cite us as follows:

Ramakrishnan, C., A. Patnia, E. Hovy and G. Burns (2012). "Layout-Aware Text Extraction from Full-text PDF of Scientific Articles."Source Code for Biology and Medicine 7(1): 7. [http://www.scfbm.org/content/7/1/7/abstract]

 

Introduction

 

The Portable Document Format (PDF) is the almost universally used file format for online scientific publications. It is also notoriously difficult to read and handle computationally, presenting challenges for developers of biomedical text mining or biocuration informatics systems that use the published literature as an information source. To facilitate the effective use of scientific literature in such systems we introduce Layout-Aware PDF Text Extraction (LA-PDFText).

See Overview for a list of commands that you can execute with this tool. This includes simple and more detailed text extraction from PDF files.

 

LA-PDFText has been developed by members of the Biomedical Knowledge Engineering group @ the Information Sciences Institute. It is intended for use both scientists and NLP engineers interested in getting access to text within specific sections of research articles. The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. This means that the system works quite well for most applications (and might occasionally make mistakes and extract the wrong text), but it is always possible to 'hack' your own rules and improve performance.

 

For questions about future development or support of the current tool, please contact Gully Burns (gully@usc.edu). For discussions concerning the work contributing to this project, please contact any of the research team: Gully Burns, Cartic

 

Ramakrishnan (rcartic@gmail.com) or Ed Hovy (hovy@isi.edu).

more...
No comment yet.
Rescooped by luiy from The New Global Open Public Sphere
Scoop.it!

La sphère publique du XXIe siècle, par Pierre Lévy

La sphère publique du XXIe siècle, par Pierre Lévy | e-Xploration | Scoop.it

La sphere publique du 21eme siècle


Via Pierre Levy
more...
No comment yet.