Accessibility to historical documents is mostly limited to scholars. This isdue to the language barrier inherent in human language and the linguisticproperties of these documents. Given a historical document, modernization aimsto generate a new version of it, written in the modern version of thedocument's language. Its goal is to tackle the language barrier, decreasing thecomprehension difficulty and making historical documents accessible to abroader audience. In this work, we proposed a new neural machine translationapproach that profits from modern documents to enrich its systems. We testedthis approach with both automatic and human evaluation, and conducted a userstudy. Results showed that modernization is successfully reaching its goal,although it still has room for improvement.
Quick Read (beta)
Modernizing Historical Documents: a User Study
Pattern Recognition and Human Language Technology Research Center
Universitat Politècnica de València - Camino de Vera s/n, 46022 Valencia, Spain
Pattern Recognition and Human Language Technology Research Center
Universitat Politècnica de València - Camino de Vera s/n, 46022 Valencia, Spain
Accessibility to historical documents is mostly limited to scholars. This is due to the language barrier inherent in human language and the linguistic properties of these documents. Given a historical document, modernization aims to generate a new version of it, written in the modern version of the document’s language. Its goal is to tackle the language barrier, decreasing the comprehension difficulty and making historical documents accessible to a broader audience. In this work, we proposed a new neural machine translation approach that profits from modern documents to enrich its systems. We tested this approach with both automatic and human evaluation, and conducted a user study. Results showed that modernization is successfully reaching its goal, although it still has room for improvement.
Historical documents are an important part of our cultural heritage. However, the nature of human language, which evolves with the passage of time, and the linguistic properties of these documents—due to the lack of a spelling convention, orthography changes depending on the time period and author—increase the difficulty of comprehending them. For this reason, historical documents are mostly accessible to scholars.
Modernization aims to tackle this language barrier and increase the accessibility of historical documents to a broader audience. With this purpose, it generates a new version of a historical document, written in the modern version of the document’s original language. creftype 1 shows an example of modernizing a document. In this case, part of the language structures and rhymes have been lost. However, the modern version is easier to read and comprehend by a broader audience.
While normalizing orthography to account for the lack of a spelling convention has been extensively research for years (Laing, 1993; Baron and Rayson, 2008; Porta et al., 2013; Hämäläinen et al., 2018), modernization of historical documents is a young research field. One of the first related works was a shared task for translating historical text to contemporary language (Tjong Kim Sang et al., 2017). The task was focused on normalizing the document’s spelling. However, they also approached document modernization using a set of rules. Domingo et al. (2017) proposed a modernization approach based on statistical machine translation (SMT). A neural machine translation (NMT) approach was proposed by Domingo and Casacuberta (2018). Finally, Sen et al. (2019) augmented the training data by extracting pairs of phrases and added them as new training sentences.
In this work, we followed a machine translation (MT) approach to tackle the modernization problem. Similarly to Domingo and Casacuberta (2018), we profited from modern documents to enrich the modernization systems. However, we applied a data selection technique to take better profit of these documents, selecting only the most relevant sentences for each task. We evaluated our approach both automatically and with the help of 4 scholars specialized in classic Spanish literature. Additionally, we conducted a user study with 42 people to assess whether or not modernization is able to decrease the difficulty of comprehending historical documents. Our main contributions are as follows:
We proposed a new NMT approach that successfully profits from modern documents to enrich its modernization systems.
We tested our proposal using 3 datasets from different languages and time periods.
We assessed the quality of our proposal using both automatic and human evaluation, conducted by 4 scholars specialized in classic Spanish literature.
First time, to the best of our knowledge, in which an NMT modernization approach behaves similarly or better than an SMT modernization approach.
We conducted a study with 42 users to assess whether modernization successfully decreases the difficulty of comprehending historical documents.
The rest of this document is structured as follows: creftype 2 presents the modernization approach. Then, in creftype 3, we describe the experimental framework of our work. After that, in creftype 4, we present and discuss the evaluation conducted in order to assess our approach. creftype 5 describes and presents the user study. Finally, in creftype 6, conclusions are drawn.
2 Modernization approaches
In this section, we present the state-of-the-art SMT modernization approach and our NMT-based proposal. Both approaches rely on MT which, given a source sentence , aims at finding the most likely translation (Brown et al., 1993):
2.1 SMT approach
For years, SMT has been the prevailing approach to compute creftype 1, using models that rely on a log-linear combination of different models (Och and Ney, 2002): namely, phrase-based alignment models, reordering models and language models; among others (Zens et al., 2002; Koehn et al., 2003).
In this approach, modernization is tackled as a conventional translation task: training an SMT system from a parallel corpora in which, for each sentence of the original document, its corresponding modernized version is available. For training this system, the language of the original document is considered as the source language, and its modernized version as the target language.
2.2 NMT approach
NMT models creftype 1 with a neural network which usually follows an encoder-decoder architecture, in which the source sentence is projected into a distributed representation at the encoding step. Then, at the decoding step, the decoder generates its most likely translation—word by word—using a beam search method (Sutskever et al., 2014).
The system’s input is a word sequence in the source language. An embedding matrix linearly projects each word to a fixed-size real-valued vector. These words embeddings are, then, fed into a bidirectional (Schuster and Paliwal, 1997) long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) network. As a result, a sequence of annotations is produced by concatenating the hidden states from the forward and backward layers. An attention mechanism (Bahdanau et al., 2015) allows the decoder to focus on parts of the input sequence, computing a weighted mean of annotated sequences. A soft alignment model computes these weights, weighting each annotation with the previous decoding state. Another LSTM network—conditioned by the representation computed by the attention model and the last word generated—is used for the decoder. Finally, a distribution over the target language vocabulary is computed by the deep output layer (Pascanu et al., 2013). The model is trained by applying stochastic gradient descent jointly to maximize the log-likelihood over a bilingual parallel corpus.
As the SMT approach (see creftype 2.1), our proposal tackles modernization as a conventional translation task but using NMT instead of SMT. Additionally, since NMT systems need larger quantities of training data, and a frequent problem when working with historical documents is the scarce availability of parallel training data (Bollmann and Søgaard, 2016), we created synthetic data in order to profit from modern documents to enrich the NMT models. First, we applied feature decay algorithm (Biçici and Yuret, 2015) to select those documents which are closer to the ones we have to modernize. After that, we followed a backtranslation approach (Sennrich et al., 2015) to create a parallel synthetic corpus. Backtranslation has become the norm when building state-of-the-art NMT systems—especially in resource-poor scenarios (Poncelas et al., 2018). Given a monolingual corpus in the target language and an MT system trained to translate from the target language to the source language, the synthetic data is generated by translating the monolingual corpus with the MT system—the resulting data is used as the source part of the corpus, and the monolingual data as the target part.
3 Experimental framework
In this section, we describe the MT systems, corpora and evaluation metrics from our experimental framework.
3.1 MT systems
SMT systems were trained with Moses (Koehn et al., 2007), following the standard procedure: we estimated a 5-gram language model—smoothed with the improved KneserNey method—using SRILM (Stolcke, 2002), and optimized the weights of the log-linear model with MERT (Och, 2003). SMT systems were used both for the SMT modernization approach and for generating synthetic data (see creftype 2).
We built NMT systems using OpenNMT-py (Klein et al., 2017). We used long short-term memory units (Gers et al., 2000), with all model dimensions set to . We trained the system using Adam (Kingma and Ba, 2014) with a fixed learning rate of and a batch size of . We applied label smoothing of (Szegedy et al., 2015). At inference time, we used beam search with a beam size of 6. In order to reduce vocabulary, we applied joint byte pair encoding (BPE) (Sennrich et al., 2016) to all corpora, using merge operations. NMT systems were trained using synthetic data and, then, were fine-tuned with the training data.
(Tjong Kim Sang et al., 2017): A collection of different versions of the Dutch Bible. Among others, it contains a version from 1637—which we consider as the original version—and another from 1888—which we consider as the modern version (using 19 century Dutch as if it were modern Dutch).
(Domingo and Casacuberta, 2018): the well-known 17 century Spanish novel by Miguel de Cervantes, and its correspondent 21 century version.
(Sen et al., 2019): contains the original 11 century English text The Homilies of the Anglo-Saxon Church and a 19 century version—which we consider as modern English.
As reflected in creftype 1, the corpora sizes are small. Thus, the use of synthetic data to profit from modern documents and increase the training data (see creftype 2.2). As modern documents, we made use of the collection of Dutch books available at the Digitale Bibliotheek voor de Nederlandse letteren111http://dbnl.nl/., for Dutch; and OpenSubtitles (Lison and Tiedemann, 2016)—a collection of movie subtitles in different languages—for Spanish and English.
Modernization adopted evaluation metrics from MT. In order to assess our proposal, we made use of:
Translation Error Rate (TER)
(Snover et al., 2006): number of word edit operations (insertion, substitution, deletion and swapping), normalized by the number of words in the final translation.
BiLingual Evaluation Understudy (BLEU)
(Papineni et al., 2002): geometric average of the modified n-gram precision, multiplied by a brevity factor.
We used sacreBLEU (Post, 2018) in order to ensure consistent BLEU scores. Additionally, we applied approximate randomization tests (Riezler and Maxwell, 2005)—with repetitions and using a -value of —to determine whether two systems presented statistically significance.
In order to assess the quality of our modernization approaches, we started by performing an automatic evaluation. Then, with the help of 4 scholars, we conducted a human evaluation.
4.1 Automatic evaluation
creftype 2 presents the results of the experimental session. All approaches significantly improved the modernization quality. Differences between the SMT and NMT approaches were only statistically significant for Dutch Bible. In that case, the NMT approach yielded the best results: an overall improvement of points according to TER and points according to BLEU; and an improvement of and points according to TER and BLEU respectively, with respect to the SMT approach.
To the best of our knowledge, this is the first time that an NMT modernization approach is able to achieve these kinds of results. Domingo and Casacuberta (2018) already tried to profit from modern documents to enrich the neural models. However, their approach only improved the modernization quality in some cases—and never enough to reach the quality of the SMT approach—while in others it lowered it significantly. Our approach was based on theirs, but we used a data selection technique to help us filtered the monolingual data in order to generate synthetic data more suitable for each task.
4.2 Human evaluation
The human evaluation was performed by 4 scholars specialized in classic Spanish literature. For this reason, it was conducted using El Quijote. We randomly selected 100 sentences, checking that modernizations were different to the original sentences. We showed each sentence together with its modernization—50 sentences modernized with the SMT approach and another 50 with the NMT approach— and asked the scholars to give a rating according to the quality of the following aspects: fluency, lexical meaning, syntax, semantic and modernization. To avoid any bias, we shuffled the sentences and did not give any detail to the evaluators about how modernizations had been produced. creftype 3 shows the results of the evaluation.
While the automatic evaluation (see creftype 4.1) did not show any significant differences between the SMT and NMT approaches, the human evaluators slightly preferred SMT over NMT. Scores vary considerably depending on the evaluator—scholar and scholar gave higher scores than scholar and scholar. However, all evaluators agreed that fluency is the strongest point of both approaches. In general, scores are above the average, which seems to correlate with the automatic evaluation.
When we asked evaluators about their opinion, they commented that the main problems were related with punctuation and diacritical marks. They also mentioned that, sometimes, part of the sentence was lost in the modernization—a known issue related with NMT (Wu et al., 2016). Additionally, scholar commented that, overall, the quality of the modernization was acceptable. However, scholar commented that if they had to correct the mistakes, they would prefer to do the modernization from scratch.
5 User study
In order to assess whether modernization is able to decrease the difficulty of comprehending historical documents and, thus, making them accessible to a broader audience, we conducted a user study using El Quijote. 42 participants took part in this study. Considering that El Quijote is well-known in Spain, we asked participants about their familiarity with it. LABEL:fi:users shows some information about the user’s age and their familiarity with El Quijote.