Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books [r-libre/202]

Kaser, Owen et Lemire, Daniel (2007). Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books. Dans Spencer, Bruce; Story, Margaret-Ann et Stewart, Darlene (dir.), Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research (CASCON '07). Riverton, NJ, É.-U. : IBM.

Fichier(s) associé(s) à ce document :

PDF - gutheader_CASCON2007.pdf

Télécharger

Catégorie de document :	Communications dans des actes de congrès/colloques
Évaluation par un comité de lecture :	Oui
Étape de publication :	Publié
Résumé :	Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.
Déposant:	Lemire, Daniel
Responsable :	Daniel Lemire
Dépôt :	16 juill. 2007
Dernière modification :	16 juill. 2015 00:47

Actions (connexion requise)

RÉVISER