Recital | Benjamin Hervy

During a post-doc at Polytech Nantes, I worked on crowdsourcing project called Recital. This project aims at transcribing 25 000 pages from a XVIII^th century corpus related to the Comédie Italienne accounting. Due to the diversity of pages (format, writing, etc.), machine learning methods fail to perform good accuracy on optical character recognition and labelling. However, labelling and transcribing text from image is particularly suited to HITs (Human Intelligence Task), and that’s why we investigated crowdsourcing.

Screenshot of an accounting daily page to be labeled and transcribed

Figure of the workers activity on Recital platform since the launch in 2017

On November 2021, the platform had registered more than 1 million contributions from +1800 workers. It is interesting to note that stimulating activity was not so easy (newsletter, forum, remote hackathons, etc.).

Under the supervision of Guillaume Raschia, I investigated the ETL (Extract-Transform-Load) pipeline to process workers’ contribution through our dedicated web platform. My work included natural language processing, databases, data visualization, and web development among other things.

Screenshot of the Recital dashboard with the final visualization

Screenshot of the Recital dashboard with statistics and faceting search

Once the processing pipeline and the database are in place, we can design a dashboard to provide analytics and data visualizations. This part of the work was mainly done by colleague Olivier Aubert.

Most of the work consisted in developing processing algorithms involving data matching or natural language processing techniques to perform entity recognition or quantifying output data quality. In addition, we worked on data traceability to provide final experts some insights on the whole processing pipeline, from the original source to the reconstructed values.