Ongoing crowdsourcing project about Comédie-Italienne

During a post-doc at Polytech Nantes, I worked on crowdsourcing project called Recital. This project aims at transcribing 25 000 pages from a XVIIIth century corpus related to the Comédie Italienne accounting. Due to the diversity of pages (format, writing, etc.), machine learning methods fail to perform good accuracy on optical character recognition and labelling. However, labelling and transcribing text from image is particularly suited to HITs (Human Intelligence Task), and that’s why we investigated crowdsourcing.

On November 2021, the platform had registered more than 1 million contributions from +1800 workers. It is interesting to note that stimulating activity was not so easy (newsletter, forum, remote hackathons, etc.).

Under the supervision of Guillaume Raschia, I investigated the ETL (Extract-Transform-Load) pipeline to process workers’ contribution through our dedicated web platform. My work included natural language processing, databases, data visualization, and web development among other things.

Once the processing pipeline and the database are in place, we can design a dashboard to provide analytics and data visualizations. This part of the work was mainly done by colleague Olivier Aubert.

Most of the work consisted in developing processing algorithms involving data matching or natural language processing techniques to perform entity recognition or quantifying output data quality. In addition, we worked on data traceability to provide final experts some insights on the whole processing pipeline, from the original source to the reconstructed values.