Some of my colleagues are regular users of Clojure, a JVM-based Lisp. I do like Lisp-based languages. I’ve attempted last year’s Advent of Code in Racket, and it was a great experience (although I am apparently not as great a coder as I thought I was š¢). Anyway.
JVM-based languages are normally not great for scripting, on account of the longish startup time of the virtual machine. The aforementioned colleagues have sung the praises of Babashka, which is a version of Clojure suitable for scripting “where you would use bash otherwise”.
I’ve just encountered such a problem, and wanted to use Babashka to write a little script to solve it for me.
The problem: Mixed-encoding files
The current project I’m working on uses Spring and Thymeleaf for templating. Internationalization in Thymeleaf is done via properties files. Properties files used to be encoded in ISO-8859-1 by default, but this has changed in Java 91. However, not everyone has gotten the memo: Apparently, IntelliJ still tries to encode properties files in ISO-8859-1 sometimes. When different developers with different IDE settings edit the same file, this can result in a terrible, no-good mess where you have a file that contains characters in two different encodings, namely UTF-8 and ISO-8859-1. At that point, you can’t really do anything besides manually editing the file and replacing the offending characters. I’ve provided an example file containing an ISO-8859-1 ‘Ć¼’ character, as well as an UTF-8 ‘Ƥ’ character. Just try running iconv
or uconv
on it ā both will choke, or emit replacement characters.
Some Background: Unicode in a š„
Unicode is complicated, so we’ll need to explain some concepts to better understand the underlying
problem. First, characters. Directly from the Unicode
standard: “Characters are the abstract
representations of the smallest components of written language that have semantic value”. Examples
would be the well-known š© emoji, or regular letters. Then, there’s codepoints. Unicode codepoints
are represented by a “U+”, followed by the hexadecimal value of the codepoint. Each character
corresponds to a single code point. However, this “character” may not be visible, such as the
zero-width space, or control characters. Some characters serve only to modify other characters. For
example, an Ć can be represented by the codepoint sequence U+4F
, U+308
, which is the codepoint
for a letter O followed by the codepoint for a combining diaeresis2. Some characters might be
visible individually, such as the regional indicators š© and the regional indicator šŖ, which together
make a flag (š©šŖ). The unicode standard does not restrict which combining characters can be combined,
and what the results are3. You can change emoji color tone with other combining characters.
There’s lots of possibilities, which also means that fonts can’t simply match unicode characters to
their own glyphs (or “user-perceived characters”). There’s lots more to know, like text direction or
grapheme clustering, but we’ll skip over that for now.
After these preliminaries about the character set, let’s talk about the encoding. Because Unicode has such a huge range, writing a codepoint takes 3 bytes in a fixed-length encoding. Three is an inconvenient number for computers, and so the UTF-32 encoding was created, where every codepoint is simply encoded as 4 bytes. However, regular text mostly uses the lower reaches of the unicode standard, with most of the world’s modern languages representable with codepoints from U+0
to U+FFFF
, the so-called “basic multilingual plane” (BMP). Occasionally, you will use a character that’s not in the BMP, but rarely. I estimate that 99% of this post’s text consists of characters from the BMP. Thus, it’s much more efficient to use a variable-length encoding, such as UTF-16 (used by JavaScript) or UTF-8 (used everywhere).
UTF-16 encodes all codepoints from the BMP as two byte code units. For codepoints that fall outside the BMP, the codepoint is split into two two-byte code units that fall into regions of the BMP that are specifically reserved for this case. When decoding, if you encounter a codepoint that falls into this reserved region, read the next byte and combine them correctly to get the original codepoint. Quite simple, really.
UTF-8 encodes all codepoints from ASCII (so 0-127) in one byte having the most significant bit 0. Every higher code point is encoded as multiple bytes, with the first byte having a 1-prefix corresponding to the number of following bytes followed by a zero. All following bytes have a prefix of 10
. The codepoint is zero-padded and packed into the remaining free bits. It’s easy to see that some bytes are not valid UTF-8 sequences: Anything not having a leading zero must be followed by another byte with a 10
prefix.
ISO-8859-1, on the other hand, is a single-byte encoding, where every byte corresponds to a codepoint. There’s no “invalid encoding” as such, only undefined codepoints.
And now we’re slowly getting back to our original problem.
Mixed-encoding files: ļæ½
The problem with mixed-encoding files is that generally, text editors presume a single valid encoding for a file. Which makes sense, because it’s impossible to correctly find the character set of a single or multiple bytes. For whole files, it’s much easier. So if you’ve somehow generated a mixed-encoding file, your best way is to edit it, replace all “incorrect” characters with some ASCII equivalent, save it, open it in the correct character set, replace the ASCII replacements with the correct character in the correct character set, and save it again.
I had the revelation that it should be possible to try and decode mixed-encoding files when you know the possible fallback encoding. Anything that fails to decode as UTF-8, you try to decode in the fallback encoding and print the result. And this seems to work for the simple pairing of UTF-8 and ISO-8859-1, but I think that using other “fallback encodings” besides ISO-8859-1 should work as well the way I’ve implemented it.
The way I’ve approached it is very dirty: I open the file as a byte stream, and then create two readers on that byte stream, one with UTF-8 encoding, and one with the fallback encoding. I read the UTF-8 reader character-wise, until I receive the UTF-8 replacement character ļæ½, which points to an invalid UTF-8 sequence. I reset the stream to the location it was before I’ve gotten the replacement character, and read the character from the fallback reader with the fallback encoding. Now I can continue reading UTF-8 data happily.
Clojure implementation
Finally, the source! This should also be runnable as regular Clojure source code, but I’ve added Babashka as as the shebang to make it faster to start. I’m using the Clojure command line parsing library, which is nice, except for the fact you can’t specify required arguments.
I was also surprised by the fact that the Java readers return the UTF-8 replacement character instead of throwing an exception, but that was easy to adapt to. The magic 15
argument to the mark calls in lines 28 and 39 are just the amount of bytes we want the stream to cache ā it should be large enough to cover the size of the largest character in our fallback encoding we expect, 15 should be plenty here.
Besides the normal startup cost for coding in a non-core language of mine, this little thing was surprisingly pleasant. I also found it surprising that a similar utility doesn’t exist yet.
|
|
-
See Stackoverflow ↩︎
-
To make it more confusing, Ć can also be represented by the codepoint
U+D6
. This serves to make conversion from legacy character sets more compatible. ↩︎ -
I’m unsure how this works with Emoji standardisation, as emoji often consist of multiple characters. Specification for the result of composition seems to be outside the bailiwick of the Unicode consortium, but š¤·. ↩︎