The DECRYPT Project: Automatic Decryption of Historical Manuscripts

Thousands of enciphered historical manuscripts are buried in libraries and archives. Examples of such material are diplomatic correspondence and intelligence reports, private letters and diaries as well as manuscripts related to secret societies, or other (religious) groups in the margins of society. The bulk of these historical manuscripts will remain undeciphered unless we can automate the processes involved in decoding them. Our aim is to develop resources and computer-aided tools for decoding of historical source material by using AI and cross-disciplinary research involving computational linguistics, cryptology, history, linguistics and philology.

Within the DECRYPT project, we release resources and tools with open access to facilitate research in historical cryptology, allowing collection, analysis and decryption of historical ciphertexts. Resources are collections of encrypted sources, and historical texts and language models. The tools facilitate the processing of the encrypted sources from transcription to decryption incl. cryptanalysis. Below we list resources and tools that we develop in more detail, which are described in our scientific papers listed here

RESOURCES

The DECODE database contains a collection of digitized images of ciphertexts and encryption keys along with metadata information about their provenance, location, transcription, and possible cryptanalysis or commentary. The database enables search and all records in the database are open to the public. Due to license restrictions, some images of records are private and cannot be visualised or downloaded. Users with an account to the database can also upload and download ciphertexts and keys with metadata information and related documents. 

HistCorp is a collection of historical corpora and other useful resources and tools for researchers working with historical text. Currently, you may download historical corpora for fifteen different European languages: Czech, Dutch, English, French, German, Greek, Hungarian, Icelandic, Italian, Latin, Portuguese, Slovene, Spanish, and Swedish. We also provide language models derived from these historical sources which may be downloaded from the Language Models section. You may also create your own language models, by uploading historical sources of your choice. Furthermore, we provide tools for the automatic processing of historical texts. So far, a tool for spelling normalisation, where the historical spelling is automatically transformed to a modern spelling, is provided. You may enter a text or upload a file to have it normalised, or you may download the necessary tools to do the normalisation locally on your own computer.

TOOLS

The TRANSCIPT tool is aimed for the transcription of images in an interactive fashion by using image processing with clustering. Under development. 

The CODEBREAKER is an interactive online tool for semi-automatic decryption of transcribed images containing ciphertext. The tool has been developed for simple and homophonic substitution ciphers. Under development. 

CrypTool 2 (CT2) is an open-source, e-learning platform for cryptography and cryptanalysis, offering a visual programming GUI to experiment with cryptographic procedures. CrypTool 2 provides a variety of cryptanalytical tools to analyze or break classical (as well as modern) ciphers and can be downloaded for Windows.

The DECODER maps an uploaded and transcribed ciphertext and a key of your choice, and decodes the ciphertext given the key. The  plaintext output is compared to language models, and a list of probable languages is provided.  

The ENCODER encodes a text with the most common historical encryption methods of your choice. Under development. 

ANACODE provides the analysis of transcribed ciphertexts in various ways using standard methods for cryptanalysis, such as n-gram frequencies, n-gram distances, index of coincidence, entropy measures, and pattern dictionaries. Under development. 

ANAKEY analyses a transcribed key with respect to its structure and gives a summary of the code structure, the symbols system, the encoded plaintext entities (alphabets, syllables, words, nulls) and the cipher type (simple, homophonic or polyphonic substitution) on the alphabet and nomenclature level. Under development.

RELATED

We submit ciphers from the DECODE database to MysteryTwister C3 (MTC3), which is an international Crypto Cipher Contest offering a broad variety of challenges, a moderated forum and an ongoing hall-of-fame of the best cipher crackers.

CryptoBooks is a bibliography of literature about cryptology from the 15th century until now, that one of our team members, Nils Kopal created.

ACKNOWLEDGMENTS

We are grateful to our colleagues in the HICRYPT network and the Histocrypt community for valuable discussions and suggestions concerning the need for infrastructural support for historical cryptology, especially and in alphabetical order: Camille Desenclos, Kevin Knight, Anne-Simone Rous, and Gerhard Strasser. We would like to thank all users and contributors who are willing to share these fascinating historical sources.

CONTRIBUTION

We collect encrypted sources and we’re working on the development of tools for transcription and decryption. Volunteers, especially historians, philologist, and cryptologists, as well as highly motivated students planning to write their bachelor's or master's thesis in the area of historical cryptography or digital humanities are welcome to join our team.

COMMENTS

We would be happy to hear about your comments and suggestions for improvements.

The DECRYPT team
Beáta Megyesi, PI
Email: decode@cl.lingfil.uu.se

This work is supported by the Swedish Research Council, grant 2018-06074: DECRYPT - Decryption of historical manuscripts.