These are the changes within CSL to allow for utf-8 encoded strings.

(1) output streams now have a character position and a byte position.
The character position is not incremented for a 10xxxxxx byte (second
or subsequent byte of an utf-8 sequence) but the byte one is. However
the character position is incremented by several if a TAB is sent.
I am not going to cope with Unicode combining characters or cases such
as zero-width-spece: they will all be counted as having one character
position to fill. Ha ha backspace as well!

(2) lengthc and widelengthc both exist. The first counts the number of
bytes in the item presented, the second the number of characters.
All widelengthc actually does is to ignore 10xxxxxx bytes.

(3) for prin2 style printing and the calculation as to whether a string
or symbol will go beyond end of line it is necessary to disregard any
10xxxxxx characters.

(4) for prin1/print printing within strings if a character that is
represented by a two-byte utf-8 sequence is found it gets displayed as
#xxx;. A 3-byte sequence becomes #xxxx; while 4-byte sequences all
end up as #xxxxxx;. Well doing things this way is ultra conservative and
guarantees re-inputability even if the external file goes via a path where
only plain old ASCII is supported. An alternative stance is that multi-
byte sequences can remain as they are and are then single multi-byte
characters in the output. Also when mapping I will turn 3-byte utf-8
sequences into #xxxxxxx; not #xxxx; in part because that is easier for me
when I need to deal with case folding (case folding may change the number
of hex digits needed to denote a character).

(5) Also when printing strings, if the string contains a sub-string of the
form #WORD; that might encode a special character that will be displayed
as #hash;WORD; so as to avoid corruption on re-input. This is required even
if I allow myself to put multi-byte characters into the output. I will do
this regardless of whether the WORD is one that is the name of a character.

(6) I have options in the print code for folding case (consider the
Common Lisp *print-case* variable). When case folding is requested then
utf-8 sequences will need to be extracted, wide-character case folding
performed and then display as in all the cases above dealt with. A question
arises as to whether folding the case of a Unicode character can ever change
then number of utf-8 bytes needed to encode it. The first example that makes
me worry (even though it does not breach the ideal) is "latin letter y
with diaeresis" where the lower case version is u+00ff and the upper case
is u+0178: distinctly not adjacent. But both of them use the same number
of utf-8 bytes. A program utf8check.c looked for problems and in the
default locales on linux, Macintosh and native windows it showed up no
worries, but under cygwin it notes u+023a (latin capital letter a with stroke)
might fold down into u+02c56 (latin small letter a with stroke). There are
half a dozen other similar cases! 

(7) I note to myself that case folding and re-inputability are orthogonal
issues, printing on a string with say a c-cedilla might need to
end up displaying a C-cedilla if case folded, #xe7; if reinputable or
#xc7; if both case folding and re-inputability are required.

(8) printing symbols needs similar width calculations as that used for
strings, however issue (5) above does not arise since the relevant text
would end up rendered as ...!#WORD!;... and the escape before the semicolon
keeps that unambiguous. So the main issue is that multi-byte case folding
may be needed, multi-byte characters must be displayed as #xxxx; sequences
and exclamation mark escapes are needed before any multi-byte character. Or
on an alternative reading I can just display a "!" followed by the multi-byte
sequence. In the CSL code this second behaviour is almost already available
controlled by a flag bit called escape_exploding.

(9) In character handling the tolower and toupper functions have to me
multi-byte aware. Conversion between integers and character objects and
symbols ditto. I am making the value of !$eof!$ a symbol whose name is
a string built up from the utf-8 sequence f7/bf/bf/bf (as it where U+1fffff)
which is well outside the valid range of Unicode characters. I treat it
as having the integer representation -1 as well. I still cache symbols
for u+00 through to u+ff, but codes above that go through intern each time
they are needed. The changes to character functions (such as case folding
etc) end up reasonably extensive even though none seem terribly deep.
A question that is raised is the extent to which this model wll make it
reasonable to have strings identified with arrays of characters, in that
the offset to the nth character now becomes dependent on previous characters
and the prospects for writing characters (as distinct from bytes) into
strings seems horrid.

(10) The family of functions "explode" need to split things into characters
not bytes... I will implement that by first converting to a utf-8 encodeded
stream of bytes and then collecting byte sequences for characters. This is
not quite the same as merely using princ/prin but redirecting output to
a list of items! Well explodec/explodecn can be like princ but consolidating
multi-byte characters into single integers or big characters: that is not too
bad. But explode/exploden will want to collect (eg) !#alpha; as (!! !#alpha)
or (33 945) rather than (eg) (!! !# a l p h a !;). Now the case that
gives me pause is a symbol whose name is !#word!; - that is OK it should
explode to (!! !# w o r d !! !;) but the string variant is harder in that
(!" !# w o r d !; !") could compress back into something odd, so the
unexpected result needs to be (!" !# h a s h !; w o r d !; !").
In a string prin1 takes a string "#alpha;" into "#03b1;" because (at present?)
I do not use an inverse code-to-name lookup. But explode needs to
turn it into a list of lenth 3, (!" !#alpha; !") rather than (!" !#
!0 !3 b !1 !;). It seems probable that this behaviour might be sensible for
print as well as explode if there is a stage where I believe that all the
editors that people use can cope with multi-byte characters.

(11) I need to work out how to cope with compress and then more general
input from utf-8 sources.

(12) on Windows I suspect that reading a wide character will sometimes
return a high surrogate, and I will need to read the following low
surrogate and combine the data.





