Came across this pre-print paper about using machine learning to assess quality of community health data sets, and wanted to share here! Seems highly relevant to the DataKind - Medic project on data confidence and engineering data quality assessments. Here’s the abstract:
Machine learning has tremendous potential to provide tar-geted interventions in low-resource communities, howeverthe availability of high-quality public health data is a signifi-cant challenge. In this work, we partner with field experts at anon-governmental organization (NGO) in India to define andtest a data collection quality score for each health worker whocollects data. This challenging unlabeled data problem is han-dled by building upon domain-expert’s guidance to design auseful data representation that is then clustered to infer a dataquality score. We also provide a more interpretable version ofthe score. These scores already provide for a measurement ofdata collection quality; in addition, we also predict the qual-ity for future time steps and find our results to be very accu-rate. Our work was successfully field tested and is in the finalstages of deployment in Rajasthan, India.