Hidden characters in Unicode ruining SMS reports

binod · June 17, 2022, 11:15am

In an SMS project, we have received complaints from multiple sources that valid SMS reports are sometimes not being accepted as reports and are going to the messages tab.

Sample SMS:

ज‍ 12345 2‌

Here,

ज‍: Form code
12345: Patient ID
2‌: Days

The SMS message looks fine, but when using a Unicode text analyzer, we can see there are hidden (non-printing) characters:

Zero-width joiner - U+200D
Zero-width non-joiner - U+200C

In Nepali Unicode, these non-printing characters are used to enable/disable the transformation of certain other characters and ligatures.
Examples:
Without ZWJ: प+र्+यो = पर्यो (incorrect)
Using ZWJ: प+र्+ZWJ+यो = पर्‍यो (correct)

Without ZWNJ: अहम्+को = अहम्को (incorrect)
With ZWJ: अहम्+ZWNJ+को = अहम्‌को (correct)

The CHWs use various phone models (mostly keypad/feature phones) and we don’t know which key combinations are being used to enter these invisible characters. Although it is most likely a user error, it can happen unknowingly to the user. Also, it is not easy to identify the problem because everything looks normal in the CHT app. Since this is not specific to a few users or a few instances, should we consider handling this in the CHT?

Can we ignore the invisible characters (there could be more) in fields other than the text fields?

diana · June 17, 2022, 11:56am

Great find and awesome write-up @binod

I agree that we should add some support for these characters. Probably the easiest way to deal with them is to create a list of characters, and remove them before parsing (that way parsing regexes won’t require any changes).

Please open a CHT-core issue to get this scheduled.

binod · June 20, 2022, 9:59am

Thanks, @diana. It has been just reported here.