Large-scale Optical Character Recognition of Pre-modern Chinese Texts

Posted on January 11, 2019 by dsturgeon

This paper appears in International Journal of Buddhist Thought and Culture 28(2) (December 2018). [Full paper]


Optical character recognition (OCR) – the fully automated transcription of text appearing in a digitized image – offers transformative opportunities for the scholarly study of written materials produced prior to the digital age. Digitization, in the sense of photographic reproduction, is a largely straightforward, mechanical process, and one with significant value in its own right for purposes of preservation as well as access to rare materials. As a result, hundreds of millions of pages of pre-modern Chinese works have been digitized by libraries and academic institutions around the world – a significant portion of this increasingly being made freely available online.

To make use of this material efficiently, transcriptions of the textual content of these images are needed. Given the enormous volume of image data in existence – and its continual production as digitization continues – this task is only feasible if it can be fully automated: performed by software without manual intervention. Individually, reliable transcriptions produced by OCR offer enormous time savings to researchers, making it possible to efficiently navigate materials in ways not possible without digital transcription. In aggregate, however, these transcriptions make possible entirely new ways of exploring historical materials – making it possible to rapidly identify material that one suspects may exist somewhere, without knowing in advance where that might actually be. It is also a prerequisite also to virtually any type of statistical analysis of these materials – the potential utility of which continues to increase as a larger and larger proportion of the extant corpus is transcribed.

This paper introduces a procedure for OCR of pre-modern Chinese written materials, both printed and handwritten, describing the complete process from digitized image through to automated transcription and manual correction of remaining errors, with particular attention to issues arising in this domain. The process described has been applied to over 25 million pages of pre-modern Chinese works, and the paper also introduces the Chinese Text Project platform used to both make these results available to scholars as well as provide a distributed, crowdsourced mechanism for facilitating manual corrections at scale as well as further analysis of these materials.

Noise removal

Character pitch identification

Seal isolation

About the Author: DH