[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Scheme-reports] dealing with non-BMP characters [was: Reformulated numeric-tower ballot]

On 05/16/2014 02:01 PM, John Cowan wrote:
> Here's the current list of Schemes that support the full numeric and
> character towers:  Racket, Gauche, MIT, Gambit, Chicken (with eggs),
> Scheme48/scsh, Kawa, Chibi, Guile, Chez, Vicare, Larceny, Ypsilon, Mosh,
> IronScheme, STklos, KSi.  In addition, the Java/CLR based Schemes (other
> than Kawa) almost do: they support characters up to U+FFFF.

You're giving Kawa *slightly* more credit than it deserves when it comes
to handling non-BMP characters: It's somewhat schizophrenic when it comes
to dealing with surrogates.  The character type handles non-BMP characters,
and read-char, peek-char, and write-char convert these properly.

However, string-ref and string-set! just work on 16-bit code units -
including raw surrogates.  The substring operations are use code unit
offsets too.  This is IMO a bug.

Fixing this while maintaining Java compatibility isn't easy.  string-ref
is easy if you don't mind giving up O(1) performance.  Since Kawa's string
type is the java.lang.CharSequence interface, one could define new string
type(s) that remain compatible with CharSequence, but have the needed extra
tables to allow O(1) indexing of non-BMP strings.  However, many Java APIs
assume or return java.lang.String (which does implement CharSequence),
and you don't know a priori if these contain non-BMP characters. Also,
Kawa's "immutable string" type is just java.lang.String, and I'd hate to
give that up.

One idea is to accept O(N) string-ref (at least on java.lang.String), but have
the compiler optimize iteration using string-ref: I.e. when the string is
loop-invariant, but the index is an iteration variable.  In that case the compiler
can add a parallel index variable with using offsets in the char array.  This isn't
trivial, and it doesn't help with substring operations.

Another idea is to have a small cache that maps codepoint indexes to 16-bit
code units, for immutable strings.  This should perhaps be thread-local.

With string-set! we have the further complication that it can change the length
of the underlying char array.  The solution is to just drop the type of mutable
fixed-length strings, and make all mutable strings be variable-length, perhaps
using a gap-buffer.

Kawa's XQuery implementation correctly indexes indexes in terms of codepoints,
but of course performance is hurt - and it doesn't have to deal with updates.
	--Per Bothner
per@x   http://per.bothner.com/

Scheme-reports mailing list