heureCLÉA: Project Report

http://www.dreamstime.com/-image17718917

heureCLÉA was a BMBF (German Ministry of Science and Research) funded project that ran from 2013 to 2016. In this novel interdisciplinary research initiative two teams worked hand in hand: a group of literary scholars based at Hamburg University (PI: Jan Christoph Meister) and a group of Computer Scientists based at Heidelberg University (PI: Michael Geertz). heureCLÉA’s overall goal was to explore the possibilities of bridging the often stated methodological gap between qualitative, hermeneutically inspired text analysis in Literary Studies and automated, machine learning based approaches in Computer Science that model textual phenomena statistically. This gap, as we were able to prove, does not really exist: our two approaches towards text are not mutually exclusive, they rather represent complementary positions in a methodological continuum.

During the active project phase this website was used to document our ongoing work and incremental research output. The site now serves as a long-term project archive, which documents the major results we jointly achieved in a highly innovative and stimulating cooperation across two disciplines.

Main results

heureCLÉA featured two main work packages: (1) corpus creation/collaborative manual annotation, and (2) machine learning/automating annotations.

(1) Annotated corpus

Using a collaborative manual annotation approach, heureCLÉA produced an annotated corpus of considerable complexity:

  • 21 German-language short-stories comprising a total of 80,000 tokens were annotated manually
  • a set of 57 narratological concepts was marked up using the annotation tool CATMA
  • annotated concepts spanned six distinct categories of temporal phenomena and narrative levels
  • a total of 32,000 annotation instances was generated.

Annotations were done by trained annotators who were provided with complex annotation guidelines. The annotations produced satisfy one of two criteria, namely either

  • the criterion of voluntarily achieved inter-annotator agreement, or
  • the criterion of “informed disagreement” in cases where annotators chose different annotations, but were at the same time able to provide (and also captured in a meta-annotation) their narratological rationale for the relevant choice of categories.

Adherence to these principles makes the heureCLÉA-based annotations highly relevant for further investigation both in the fields of literary studies and of natural language processing.

(2) Machine learning

In parallel, three of the six manually annotated narratological categories were automated, using the human generated markup as a training corpus for the machine. These categories were tense, temporal signals, and temporal order. Two of these automated annotation functionalities, which produce robust results, were then integrated into the text analysis and annotation tool CATMA. Implementation of the third automatic annotation function will follow shortly.

In addition to these pre-defined project goals a number of further results was achieved. These include

  • the development and methodological specification of a collaborative annotation workflow that fits the needs of both narratological and hermeneutic/interpretive annotation, and
  • the generation of new theoretical insights into the interdependency of specific narrative phenomena hitherto not noticed in narratological research.

heureCLÉA’s results are either publicly available (on this website as well as on www.catma.de) or can be made available upon request. All code and software developed in heureCLÉA is open source. Reuse of the annotated corpus is planned in a number of projects, including the collaborative text exploration platform development in initiative forTEXT (Hamburg).