Below you'll get a first idea about our corpus and our research. Note that a more comprehensive documentation of the corpus, both technical and philological, is currently being prepared for publication on this web page (as well as in papers). Stay tuned!
What's behind the acronym?
KOLIMO stands for "Corpus of Literary Modernism". More precisely, KOLIMO is a digital comparative corpus of German narrative literary Modernism. We've made the corpus for our own style analysis, but since we're convinced that such a resource should be shared with the research community, the corpus compilation has been carried out with publication in mind. However, do note that KOLIMO is NOT a "corpus tool". The web interface allows researchers to run queries, but not conduct analysis proper. We're preparing a download option on this web site.
"Beta" means that we're not fully there yet, but feel confident enough to share what we have with you.
- KOLIMO is now available in its present (beta) version in zip folders. You can download headers and text bodies in XML, or opt for separate folders for headers and plain texts (TXT), as well as for an overview of all contents in table-format. Check them out here: KOLIMO public
- The web interface offers an overview of KOLIMO's content and meta data (as of now). It also allows full text queries, and will facilitate a POS-view.
- Please note that KOLIMO still contains more "noise" than we'd like. For once, the noise is in the metadata - as these things go, the metadata are sometimes faulty (e.g., typos in authors' names, or titles of texts in the authors' place), sometimes missing (e.g., dates of publication), sometimes quite rough approximations (for some cases we've needed to compute an "average" date of publication), or need a more consistent schema (we're still working on our classification of narrative genre). Our text-documents are also still noisy (see below), and we're working on the issue of "representativeness" in text sampling.
- The web interface is newly minted and may not always run smoothly. Please let us know in case you experience any kind of technical difficulty. The community's feedback is essential to making the database better. Here's our address for feedback: email@example.com
What does "digital comparative narrative corpus of German literary Modernism" mean?
- First of all, you'll note that although KOLIMO is a "literary corpus", it does contain a great deal of non-literary text.
- Also, despite being a "Modernism corpus" (Modernism referring to the period roughly stretching from 1880 to 1930), KOLIMO contains thousands of texts from older periods.
- And finally, KOLIMO is designed as a corpus of narrative texts. These come in many forms, i.e. genres, and there are gray areas.
We've designed KOLIMO as such for a reason:
- In order to say anything about "literature", we need to compare literary artifacts to non-literary ones (i.e. texts that were published as non-literary ones and/or that have been judged by us as non-literary texts on philological grounds*).
- Along the same lines, we need texts from before 1880 to say anything about "Modernism".
- We know that poetry and dramatic texts have properties quite different from narrative texts. With our main aim being style analysis within the narrative text types that ought to be as precise as possible, we decided to keep the main genre constant. KOLIMO will thus eventually contain just narrative texts. We're working on it. However, in the current beta version, you'll note that there is still some poetry in there, as well as some drama. Such texts will be removed in our next round of cleanup.
So, KOLIMO is a comparative narrative corpus. For this reason we need more than literature, more than Modernism, but just narrative texts.
KOLIMO is for style analysis
KOLIMO is the backbone of the ongoing corpus-stylistic project Q-LIMO (Quantitative and Qualitative Analysis of Literary Modernism). As such, it is intended to be an eventually representative corpus of Modern German narrative Literature. It is also intended to be successively enriched by more metadata, as well as multi-layered linguistic and literary-aesthetic annotations.
Our aim in making KOLIMO is to enable quantitative and qualitative-hermeneutic style analyses for German narrative texts from the exciting period at the turn of the 20th Century - across such variables as narrative genre, author, and time. For instance, we are interested in the specifics of Franz Kafka's style - what kinds of words, phrases, and aesthetic figures are indicative of his writing? Can we trace any possible influences from non-literary discourse domains (such as juridical discourse), from other literary writers (such as the contemporary Robert Walser and, one century earlier, Heinrich von Kleist), or even non-canonical literary genres (such as adventure literature)? Note that KOLIMO contains KAREK, the "Kafka reference corpus". In our research project, where Modernism is the heart of KOLIMO, Kafka has been the heart of Modernism. This is not a value judgement, but one of research interest: In order to say something about Kafka's style, we need more contemporary authors, and also "older ones", for comparison.
KOLIMO is made of existing digital sources, but goes beyond
We have drawn the main bulk of our texts from digital resources that are freely available. You might say: if they are out there, why bother? Well: Although there are several existing repositories (e.g., the TextGrid Repository, the German Text Archive [DTA], Gutenberg-DE and Gutenberg.org), we have conjoined them because like that, they are more than the sum of their parts.
Before KOLIMO, despite a number of initiatives, a digital corpus of German narrative Literary Modernism was missing. Surely, one that aspires to be representative, that is designed for comparative analysis (see above), and that bears consistent and manually improved metadata. What is more, KOLIMO comes with a first set of linguistic annotation. KOLIMO is hence a unique resource. We are happy to make it publicly available. The meta data initially stem from the same sources as the texts - but we've worked hard on improving them, for example adding and validating GNDs to the metadatum "author"; filling in the blanks for "year of publication" - and there were many - by a rough, but practical, procedure. KOLIMO is stored in an XML database, eXist-db (eXist-db App for KOLIMO sources: Query and edit metadata). Each document features a standardized TEI-header that comprises all kinds of useful information for each individual text, including the metadata, as well as selected style markers.
What kind of style markers can I find in KOLIMO?
KOLIMO allows running quantitative profiles of language use. This means essentially "counting stuff at the language surface". We work with the basic assumption that "style" may be assessed by means of frequency counts of those textual features that are easily distinguishable by the computer: characters, syllables, words, sentences, and so on, combining to measures such as word length, sentence length, type-token ratios, most frequent word lists, etc. Abundant stylometric research shows that this straightforward approach to style is quite fruitful. Of course, any more advanced analysis may start here, including cluster analysis of text-similarity based on most frequent word counts (e.g., delta), or entropy measures of lexical variation. Anything goes, as long as we have a hold on the discrete enties that make up the texts (characters, strings etc.).
This being said, we also believe in the value of linguistic (and other types of) annotation: KOLIMO has been tagged for part-of-speech (POS), and we've run readability analyses (Flesh-Index, Wiener Sachtextformel). The readability values are stored in the TEI-headers, as well as basic descriptive statistics about word counts. More types of annotation are coming up (e.g., we're working on metaphor, and have experimented with annotation of speech/thought representation).
We believe in the benefit of precision, for example when annotators collaborate in a controlled fashion, their agreement tested by measures of inter-coder reliability. To ingest new annotations, by multiple annotators, we chose a stand-off XML-format for our database (eXist). What is more, eXist allows for effortless publishing of data on the web and powerful queries (xQuery). We have also been experimenting with annotation applications for eXist.
Our digital style research is essentially an open research process, with hypothesis testing seconded by new and critical perspectives emerging from the analyses.
License and attribution information
KOLIMO is for research purposes only. We compiled KOLIMO as a set of German-language texts, extracting textual and metadata from the resources stated in the TEI header (and below). We corrected some of the existing metadata and added new metadata for (some) authors, (some) dates of publication, and genre, as well as style statistics. The repositories from which we have drawn our data all contain texts whose copyright has expired.
KOLIMO is made available under a Creative Commons license, in line with the licenses of the source repositories. The texts from TextGrid are available under a CC-BY attribution license (see textgrid digitale bibliothek ). However, the Deutsches Textarchiv documents are made available under a CC BY-NC 3.0 license (non-commercial, see creative commons), meaning that commercial use of those texts is prohibited (see nutzungsbedigungen dta ). Gutenberg-DE documents are made available in a way that we understand as a CC-BY-NC-SA license (non-commercial, share alike, see creative commons), meaning that commercial use of those texts is prohibited, and remixing, transformation, or building upon the material only if using the same license as the original. When using the corpus, please ensure to cite the sources of the texts as required by the distinct statements, and give credit to the KOLIMO team listed below, for doing the text compilation, annotation, and working on the metadata.
The "Gutenberg" section was extracted from Gutenberg-DE Edition 13 DVD-ROM (released in November, 2013, see gutenberg-DE) and converted from HTML to XML and TXT; the "Deutsches Textarchiv" section was extracted from the "Referenzkorpus der neuhochdeutschen Sprache" (see Version 8 July 2015 ; and the "TextGrid" section is extracted from the "Texte Korpus Version II" (see textgrid digitale bibliothek).
How to cite KOLIMO beta
When pointing to the corpus in scholarly communication, please cite: Herrmann, Berenike / Lauer, Gerhard (2017): KOLIMO. A corpus of Literary Modernism for comparative analysis. URL: https://kolimo.uni-goettingen.de/about
More information on KOLIMO and its sub-corpus, the Kafka/reference corpus (KAREK), can be found here:
- Berenike Herrmann
- Gerhard Lauer
- Franz Mertins (student assistant)
- Lara Pahlitsch (student assistant)
- Markus Paluch (student assistant)
- Maxi Weß (student assistant)
- Ronald Weller (student assistant)
- Simone Rebora (visiting postdoc fellow)
- Mathias Göbel (eXist/xQuery)
- Gabriela Rotari (graduate student)
- Anastasiia Kysliak (student assistant)
- Christina Schmidt (student assistant)
- Benedict Spermoser (student assistant)