[Scheme-reports] Regarding the String & Character Spans Library

Hello, I recently took an interest in the direction R7RS-large would be taking with the strings library and so I began to investigate CharacterSequencesCowan ( http://trac.sacrideo.us/wg/wiki/CharacterSpansCowan ) and its predecessor. After mulling it over, I've come to the conclusion that the spans library isn't really something I want nor need as a language user. While I absolutely agree that R7RS-large needs a string library, and SRFI-13 needs to be updated, I don't think it's the right primitive for the job, and instead think using string cursors directly when needed is a better approach.

Consider the added costs of type-checking to the string cursor procedures in order to determine if the first input argument is either a span or a string (string-cursor-start, string-cursor-end, string-cursor-ref, string-cursor-next, string-cursor-prev, etc.). These procedures are going to be called from your inner most loops and so it would be best if they were as efficient as possible and didn't need to check types. The original proposal of course had separate cursor procedures for both strings and span, but even if this were the case, I still don't think spans are a good idea.

Whenever you need to do some string processing, you're going to need to start with actual string objects. Whether you use indices, cursors, or spans, the results of this processing is likely going to need to be converted back to string objects eventually so that it can be displayed to the terminal, or written to a file/network port, or perhaps stored back in some sort of runtime data structure. For this later case, your data structures could simply store span objects that wrap newly created strings, but this is unneeded baggage when say you only need to mutate strings on occasion. And if I really want to store a cursor or pair of cursors into a string, I could just just store the string and the cursors into my data structure. I don't think it's worth the cost of adding a new span type and a host of span-related functions to the language to support this edge case.

Now, let's consider how spans might work when doing some string processing. Consider a pathname-split procedure that takes a string containing a file path and splits it into its directory path, file name, and extension components. Let's say the file path I'm splitting is "~/src/utilities/frobnicate.scm", and so the expected components would be "~/src/utilities/", "frobnicate" and ".scm". If I'm implementing this procedure using spans, I would end up with three span objects, but the end cursor from one span would need to be be duplicated as the start cursor for the subsequent span. This requires more storage space for the duplicate string (it's the same string for all spans) and cursor cons cells, than if I were to just have variables for the cursors that I needed. It's just not as tuned memory-wise as it could be. And in order to duplicate these cursors across spans, I may have to call string-cursor-end on the previous span, which then requires the previously mentioned type check and dispatch. I can avoid all this if I use only string cursors, and the string cursor procedures always expect string objects and so don't need to do any type checking. Now, you might consider there might be some performance gains by using spans so as not to create new substrings of the file path components. But what am I going to do with them? I'm probably going to append the filename and extension to a new directory path, or I'm going to take the take directory path by itself, and I'm going to fire it off to a file system procedure, or I'm going to save the string out to a file. So I need to convert it to a string anyway. It would be better if the string library just offered a string-join/cursor procedure that took six arguments, one for each string, and two for each pair of cursors, for the substrings I want to join for this use case.

If I'm working with very large strings, and want to have views into this string to avoid creating copies, again I'm probably just going to use string cursors directly to avoid the overhead of working with spans, and so I would want a string library highly tuned for cursors. Alternatively, if I'm mutating subsections of a very large string, I'd consider using a different data structure altogether, such as a rope.

Now, I don't think the concept of a span is a bad idea, I just don't think it's a good fit for Scheme. For example, the string_view template class was recently accepted for standardization into a post-C++14 library TR, and it's pretty much the same idea as the character span. The reason it works well in C++ is largely because of C++'s compile-time polymorphism for templates, the so the generic template functions you write will work with any type that implements the same interface or concept, with essentially zero runtime overhead. Furthermore, the compilers are very mature and can inline everything, optimize out loop invariants, merge duplicated pointer variables into the strings within objects allocated on the stack, and so on.

As was mentioned in the previous thread, Scheme is more monomorphic. And so adding a new span type that acts like a string and trying to make it a first-class citizen that interoperates well with the rest of the libraries just puts more burden on language implementors.

I think a better approach would be along one of two paths:

1) Bring in most of SRFI-13, update it for consistency with the rest of the R7RS language, and then create variants of the procedures with a suffix of "/cursor" appended to their identifier names that work with string cursors in place of indices. For example, you would have string-take/cursor and string-drop/cursor procedures, but no string-take-right/cursor and string-drop-right/cursor procedures, as the later two only make sense with character counts from the end of a string.

Overall, this path would also help ease porting old code.

2) If the above is too large of an API surface, or if you really want to force people to uses cursors over indices, bring in SRFI-13 but change all of the procedures to work with cursors instead of indices. This is more or less what Chibi Scheme's string library does or is in the process of becoming from what I've seen.

Some final thoughts on string cursors: while I think they're significantly better over the old paradigm, I still think there is room for improvement. Cursors eliminate the O(n) overhead of using indices to be sure, but there can still be a significant amount of constant time overhead due to the decoding of the UTF-8 or UTF-16 code units into code points. This entirely unnecessary when doing string or substring comparisons--you can just compare the code unit sequences directly. I'm not decided on if it would make sense to expose a couple of lower-level procedures to give users access to the underlying code units pointed to by string cursors, or if that's too much effort for the payoff. After all, if you really need the performance for the string library, maybe it's better for implementors to just write such procedures as builtin native code functions through the FFI?

Anyway, sorry for the long rant, but I hope it offers a constructive critique.