-
Notifications
You must be signed in to change notification settings - Fork 645
Egg #5: Unicode Support
This post is a first draft of a proposal for adding Unicode support for strings in Idris.
The goal of this document is to outline different approaches along with their advantages and disadvantages. Hopefully this will lead to a discussion and the best alternative being chosen.
There are two possible ways to add Unicode capabilities to Idris strings:
-
Unicode strings are defined in Idris source code as in
Data.Text
by @ziman. -
A new primitive type for Unicode strings is introduced and primitive operations are extended to this new string type.
The first approach has the advantage of requiring fewer primitive
operations (no new ones will have to be added) and that more string
operations could be directly implemented in Idris code. Furthermore,
Data.Text
already implements a number of String operations, and these
could be reused.
The main advantage of the second method is that the different backends would be free to choose how they implement Unicode strings. This allows them to re-use their platforms native Unicode string libraries if these exist (e.g. Java and JavaScript). Furthermore, interoperability with other code on the platform becomes easier, because the conversion step between Idris' internal representation of Unicode strings and the platform's representation is avoided.
A brief overview of how strings currently work in Idris, to put the proposed changes in context.
-
Literal strings in the Idris source code are represented as Haskell
String
s during compilation. The C back-end escapes non-ASCII characters when these are written out as C-string literals. -
Literal characters in the Idris source code are represented as Haskell
Char
s during compilation. These are mapped toint
s in the C back-end. -
Idris has built-in types
String
andChar
. Occurrences of these types in Idris code are mapped in TT to the constructorsStrType
andAType ITChar
of the data typeConst
.AType
stands (as far as I understand) for arithmetic type. -
There are the following primitive functions on
String
s andChar
s which may or may not be implemented by the different back-ends. These are given insrc/IRTS/Lang.hs
.-- Primitive operators. Backends are not *required* to implement all -- of these, but should report an error if they are unable data PrimFn = -- ---- 8< ---- | LStrHead | LStrTail | LStrCons | LStrIndex | LStrRev | LStrConcat | LStrLt | LStrEq | LStrLen | LIntStr IntTy | LStrInt IntTy | LFloatStr | LStrFloat | LPrintNum | LPrintStr | LReadStr | LChInt IntTy | LIntCh IntTy -- ---- >8 ----
-
libs/prelude/Prelude/Strings.idr
defines functions onString
in terms of the primitive string functions described above.libs/prelude/Prelude/Chars.idr
uses the primitive functionLChInt
andLIntCh
and defines additional functions for characters based on these. -
Functions for reading from and writing to file handles are defined in
libs/prelude/Prelude.idr
andlibs/prelude/IO.idr
. These are defined by calls via the foreign function interface and make use of the fact thatString
s andChar
s are easily convertible toFString
andFInt
.
This alternative consists of the following tasks:
-
Change the current
String
type toByteString
. -
Move
Data.Text
to the main Idris repo and make theString
type whatData.Text.Text
is now.Data.Text.Text
represents Unicode strings as UTF-8 encoded bytestrings. -
Let literal strings be polymorphic in the same way that literal numbers are.
-
Let characters represent Unicode codepoints and change primitive functions which use characters (e.g. consing characters onto strings) accordingly.
-
The FFIs must handle strings as UTF-8 sequences and marshall these to their plattform's strings when needed.
-
Idris source files will be expected to be UTF-8 encoded (everything else is a compile time error).
-
Literal strings and characters are thus UTF-8 encoded.
-
Unicode escapes: It may be necessary to give characters by escaping some code points. Such code points can be given by
'\uXXXX'
whereXXXX
is the code point given as hexadecimal number.
-
The Idris built-in type
String
will be Unicode capable. To make this as obvious as possible it will be renamed toUStrType
in the internal representation. -
A new built-in type
ByteString
will be introduced. Internally this will map to theBStrType
. -
A
Char
will represent a Unicode code point.
-
The built-in primitive functions will need versions for Unicode and byte strings. Some will only need to be implemented for one of the two.
-
Both Unicode and byte string versions:
LStrHead
,LStrTail
,LStrCons
,LStrIndex
,LStrRev
,LStrConcat
,LStrLt
,LStrEq
,LStrLen
-
Only Unicode (byte string versions may be done in user space if needed):
LIntStr IntTy
,LStrInt IntTy
,LFloatStr
,LStrFloat
,LPrintNum
, -
Only byte string:
LPrintStr
,LReadStr
It is expected that we will have more user space functions that wrap the respective primitive versions for convenience.
-
-
Characters will work with the new Unicode characters.
LChInt IntTy
,LIntCh IntTy
No such functions are required to work with byte characters, these are just bytes ;)
-
Conversion functions: Additionally, the following primitive functions will be introduced to convert between Unicode and byte strings:
LFromUtf8
: Takes a byte string which must be legal UTF-8 and returns a Unicode string (or produces an error if the byte string is not legal UTF-8).LToUtf8
: Takes a Unicode string and returns the UTF-8 representation as a byte string.Other conversion functions (e.g.
LFromUtf16
) are possible but need not be primitives (although this may be more efficient for back-ends which use these encodings internally for their strings).
I think that conversion between Unicode strings and FFI strings should
also be handled by the respective back-ends, because conversion could be
necessary (e.g. C) or not (e.g. JavaScript). This could be done implicitly
by the back-end or with new built-in conversion functions LUStrToFStr
and LFStrToUStr
.
Additionally functions in the prelude, especially in Prelude.idr
and
IO.idr
and Prelude/Char.idr
will have to be updated. The functions
dealing with IO
in Prelude.idr
and IO.idr
will need to work with
ByteString
s and convert to and from Unicode correctly. The character
functions will have to be beefed up to deal with all of unicode. (I
hope that I can steal most of these things from Rust's
implementation.) I believe that most functions Prelude/String.idr
can stay the way they are.
This still leaves us with the question of how we will encode Unicode strings in the back-ends we control (C and LLVM).
The choice of the internal representation is fairly orthogonal to the rest of this proposal: Since the representation is opaque in user-space it can be swapped out later.
That said, I believe that Unicode strings should be UTF-8 encoded. UTF-8 has the following advantages:
-
It is the most space efficient encoding.
-
It is the standard in the *nix world and on the net. This avoids re-encodings.
-
It is not byte-order sensitive.
-
It is backward compatible: 7-bit ASCII strings are valid UTF-8. No null bytes.
The main disadvantage is that UTF-8 does not allow O(1) string indexing.
I think this might actually be an advantage, because indexing does not seem to be a very common operation with strings. Furthermore, giving up O(1) indexing allows us to choose more advanced internal representations (e.g. Ropes (PDF) or something similar) which will perform other common string operations (e.g. appending and prepending) more efficiently.
Rust, Go and Ruby currently use UTF-8 as their internal representation.
-
Operations on strings should be abstracted into one or more typeclasses. This will allow us to swap in a different string type (e.g. a lazy string) where appropriate. Many properties are probably already covered by common typeclasses (e.g. Foldable, Traversable, Monoid).
-
We may want to have more (non-primitive) string types (e.g. lazy unicode and byte strings) in the future.
-
We probably want to have more efficient primitive string operations. This will likely include destructive update, I am not sure what the right way to do this is in Idris (Haskell's ST Monad?).
-
Heavier string and unicode operations (e.g. transcoding, normalization, ...) will be implemented in a separate library (and may be backend specific - in C/LLVM we may want to depend on ICU).
-
- Raw memory handling for Idris' effects library
-
- Raw memory blocks with copy-on-write semantics.
-
-
Uses Idris's build in Strings for storing byte strings
-
Uses its byte strings to store encoded strings (with an encoding)
-
Can encode code points to utf-8 and back.
-
Doesn't use Data.Buffer
-
Some additional thoughts on the common unicode string encodings out there:
UTF-8 has the following advantages (see also UTF-8 Everywhere):
-
Space efficient: Common ASCII characters only take one byte. This is very common, because text processed by a computer (textfiles, XML, HTML, ...) often contain a large number of control characters. UTF-8 characters take at most 4 bytes, this is the same 'worst-case' as other encodings.
-
Prevalent: UTF-8 is the default encoding on *nix systems and also used for much of HTML and XML and other text-based file formats. This means such text must not be encoded when read or decoded when written, which is makes this format more efficient.
-
Not byte-order sensitive.
-
Backwards compatible: 7-bit ASCII is valid UTF-8. No null bytes.
And the following disadvantages:
- No constant time string indexing (see separate point why I think this is acceptable).
UTF-8 is now used as the internal encoding for strings by Rust, Go and Ruby.
UTF-16 was initially the same as UCS-2. It was intended as a fixed-byte-width character encoding but is now also variable width because unicode code-points have 21 bits.
This is what Java and C# use. At least in the case of Java this was motivated by historical reasons, in C# I am not so sure (maybe it was chosen because this encoding is prevelant in Windows).
It can be argued that this encoding has the worst of variable-width and fixed-width encodings: It requires a byte-order-mark and is less memory efficient than UTF-8 for much text that is found on a computer. One also loses constant time character indexing. However, 4-byte characters appear sufficiently infrequently that many implementations don't get them right. (UTF-8 Everywhere, Should UTF-16 be considered harmful?).
UTF-32 is fixed width encoding which uses four bytes per character.
This has the advantage of giving strings which allow O(1) access.
However, this encoding is very space-inefficient. Unicode needs at most 21 bits to represent all characters, which means that 9 of the 32 bits of each character will never be used. This is in addition to the fact that many characters are in the 7-bit ASCII charset and will only need one byte in UTF-8 and two bytes in UTF-16.
Python represents unicode strings in one of three ways: Either
as latin-1
encoded, UCS-2
(if there is at least one character
that is not in latin-1
, or UCS-4
(same as UTF-32) if there is
at least one characer that is not in UCS-2
.
The advantage of this encoding is that all characters have the same width. However, this encoding has the following disadvantages:
-
UCS-4 is very space inefficient (see above). If there is a single character that is not in the base plane (such as, for example, an emoji smile), then the entire string has to be UCS-4 encoded.
-
Re-encoding: if a character is appended to a string and this new character is outside of the previously used character set the entire string has not only to be copied but must also be re-encoded.
-
Byte-ordering: Multi-byte encodings other than UTF-8 have the byte ordering issue which increases the error potential.
-
Re-encoding of strings from elsewhere: A lot of text nowadays is UTF-8 encoded and a lot of protocols have UTF-8 as their standard encoding. Reading these strings requires an additional decoding step and writing them back requires an encoding step.
-
Complex: The implementation seems quite complex. The main benefit of this approach seems to be that one has constant time indexing.
-
UTF-8 Everywhere lists arguments for using UTF-8 as the universal string encoding.
-
Should UTF-16 be considered harmful? Discusses some disadvantages of UTF-16
Binary Packages
Tool Support
Community
- Libraries, available elsewhere
- Idris Developer Meetings
- The Zen of Idris
- Non English Resources
Development