Skip to content

Consider Unicode spaces when counting words

Jeremy Soller requested to merge vladimiroff:wc-utf8 into master

Created by: vladimiroff

In order to support the Unicode Derived Core Property White_Space when counting words, this change iterates over chars instead of bytes. Otherwise, for instance, if a file containing the Greek letter ς (U+03C2) is being iterated over as bytes the second byte of that letter will get recognized as U+00A0 (NBSP) and therefore wc will return wrong count of words.

A test file with several lines, different kind of spaces, a lot of Unicode symbols for different languages and of course an emoji is also added to confirm a correct behavior, which might be useful for other utils when it comes to testing correct Unicode support.

Merge request reports