Consider Unicode spaces when counting words
Created by: vladimiroff
In order to support the Unicode Derived Core Property White_Space
when
counting words, this change iterates over chars instead of bytes.
Otherwise, for instance, if a file containing the Greek letter ς
(U+03C2
) is being iterated over as bytes the second byte of that
letter will get recognized as U+00A0 (NBSP)
and therefore wc
will
return wrong count of words.
A test file with several lines, different kind of spaces, a lot of Unicode symbols for different languages and of course an emoji is also added to confirm a correct behavior, which might be useful for other utils when it comes to testing correct Unicode support.