During a post-doc at Polytech Nantes, I worked on crowdsourcing project called Recital. This project aims at transcribing 25 000 pages from a XVIIIth century corpus related to the Comédie Italienne accounting. Due to the diversity of pages (format, writing, etc.), machine learning methods fail to perform good accuracy on optical character recognition and labelling. However, labelling and transcribing text from image is particularly suited to HITs (Human Intelligence Task), and that’s why we investigated crowdsourcing.
Under the supervision of Guillaume Raschia, I investigated the ETL (Extract-Transform-Load) pipeline to process workers’ contribution through our dedicated web platform. My work included natural language processing, databases, data visualization, and web development among other things.
Most of the work consisted in developing processing algorithms involving data matching or natural language processing techniques to perform entity recognition or quantifying output data quality. In addition, we worked on data traceability to provide final experts some insights on the whole processing pipeline, from the original source to the reconstructed values.