-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change character units from UTF-16 code unit to Unicode codepoint #376
Comments
I would suggest to go even one step further. Why editors and servers should know which bytes form a Unicode codepoint. Right now specification states it supports only utf-8 encoding, but with Content-Type header I guess there is an idea of supporting other encodings in the future too. I think it would be even better then to use number of bytes instead of UTF-16 code unit or Unicode codepoint. |
@MaskRay we need to distinguish between the encoding used to transfer the JSON-RPC message. We currently use The column offset in a document assumes that after the JSON-RPC message as been decoded when parsing the string document content needs to be stored in UTF-16 encoding. We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support. If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces. Regarding byte offsets: there was another discussion whether the protocol should be offset based. However the protocol was design to support tools and their UI a for example a reference match in a file could not be rendered using byte offsets in a list. So the client would need to read the content of the file and convert the offset in line / column. We decided to let the server do this since the server very likely has read the file before anyways. |
Source? Isn't the only reason for this is that Java/Javascript/C# uses UTF-16 as their string representation? I'd say there is a good case to made that (in hindsight) UTF-16 was a poor choice for string type in those language as well which makes it dubious to optimize for that case. The source code itself is usually UTF-8 (or just ascii) and as has been said this is also the case when transferring over JSON-RPC so I'd say the case is pretty strong for assuming UTF-8 instead of UTF-16. |
Citation needed? ;) Of the 7 downstream language completers we support in ycmd:
* full disclosure, I think these use code points, else we have a bug! The last is a bit of a fib, because we're integrating Language Server API for java. However, as we receive byte offsets from the client, and internally use unicode code points, we have to reencode the file as utf 16, do a bunch of hackery to count the code units, then send the file, encoded as utf8 over to the language server, with offsets in utf 16 code units. Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets. I don't know for certain all of them, but certainly the main one (Vim) is not able to provide utf 16 code units; they would have to be calculated. Anyway, the point is that it might not be as simple as originally thought :D Though I appreciate that a specification is such, and changing it would be breaking. Just my 2p |
Not that SO is particularly reliable, but it happens to support my point, so I'm shamelessly going to quote from: https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion
|
Emacs uses some extended UTF-8 and its functions return numbers in units of Unicode codepoints. https://github.com/emacs-lsp/lsp-mode/blob/master/lsp-methods.el#L657 @vibhavp for Emacs lsp-mode internal representation |
I am sorry in advance if I am telling something stupid right now. I have a question to you guys. My thought process is that if there is a file in different encoding than any utf, and we use other encoding than utf in JSON-RCP (which can happen in future) then why would there be any need for the client and server to know what Unicode is at all?
That's it. It is easy to provide line-byte offset. So why would it be better to use Unicode codepoints instead of bytes? Let's say for example we have file encoded in iso-8859-1 and we use the same encoding for JSON-RPC communication. There is a character ä (0xE4) that can be represented at least in two ways in Unicode: U+00C4 (ä) or U+0061 (a) U+0308 (¨ - combining diaeresis). Former is one unicode codepoint, latter is two, and both are equally good and correct. If client uses one and server another we have a problem. Simply using line-byte offset here we would avoid these problems. @dbaeumer I think we misunderstood each other or at least I did. I didn't mean to use byte offset from beginning of the file which would require client to convert it but to still use {line, column} pair. But count column in bytes instead of utf-16 code units or unicode codepoints. |
Are you serious? UTF-16 is one of worst choice of old days due to lack of alternative solutions. Now we have UTF-8, and to choose UTF-16, you need a real good reason rather than a very brave assumption on implementation details of every softwares in the world especially if we consider future softwares. This assumption is very true on Microsoft platforms which will never consider UTF-8. I think some bias to Microsoft is unavoidable as leadership of this project is from Microsoft, but this is too much. This reminds me Embrace, extend, and extinguish strategy. If this is the case, this is an enough reason to boycott LSP for me. Because we gonna see this kind of Microsoft-ish nonsense decision making forever. |
Just to be clear, I don't work for Microsoft, and generally haven't been a big fan of them (being a Linux user myself). But I feel compelled to defend the LSP / vscode team here. I really don't think there's a big conspiracy theory here. From where I stand, it looks to me like Vscode and LSP teams are doing their very best to be inclusive and open. The UTF-8 vs UTF-16 choice may seem like a big and important point to some, but to others, including myself, the choice probably seems somewhat arbitrary. For decisions like these, it is natural to write into the spec something that confirms to your current prototype implementation for choices like these, and I think this is perfectly reasonable. Some may think that is a mistake. As this is an open spec and subject to change / revision/ discussion, everyone is free to voice their opinion and argue what choice is right and whether it should be changed... but I think such discussions should stick to technical arguments there's no need to resort to insinuations of a Microsoft conspiracy theory (moreover, these insinuations are really unwarranted here, in my opinion). |
I apology for involving my political view in my comment. I was over-sensitive due to traumatic memory from Microsoft in old days. Now I see this spec is in progress and subject to change.
I think this is fine. An optional field which designates encoding mode of indices beside the index numbers. If the encoding mode is set to |
This is causing us some implementation difficulty in clangd, which needs to interop with external indexes. |
Yup, same problem here working on reproto/reproto#34. This would be straight forward if "Line/character can be measured in units of Unicode codepoints" as stated in the original description. |
As mention in one of my first comments this needs to be backwards compatible if introduced. An idea would be:
If no common encoding can be found the server will not functioning with the client. So at the end such a change will force clients to support the union set of commonly used encodings. Given this I am actually not sure if the LSP server Eco system will profit from such a change (a server using an encoding not widely adopted by clients is of limited use from an Eco system perspective). On the other hand we only have a limited number of clients compared to a large number of servers. So it might not be too difficult to do the adoption for the clients. I would appreciate a PR for this that for example does the following:
|
What about using byte indices directly? Using codepoints still requires to go through every single character. |
@jclc using byte indices is not a bad idea, but I want to outline the implications of such a choice: Either servers or clients need to communicate which encoding ranges are sent in, and one of them needs to adapt to the others requirements. Since clients are less numerous, it would seem the more economic choice for this responsibility to fall on them.
This depends a bit on the language, but rows are generally unambiguous. They can be stored in such a way that we don't have to decode all characters up until that row (e.g. when using a specialized rope structure). With this approach we only have to decode the content of the addressed rows. Some transcoding work will happen unless the internal encoding of both server and client matches. Edit: The reason I have a preference for codepoints over bytes is that they are inherently unambiguous. All languages dealing with unicode must have ways for traversing over strings and relating the number of codepoints to indexes regardless of what specific encodings are well-supported. |
I think every problems arise from lack of precise definition of "character" in LSP spec. The term "character" has been used everywhere in the spec, but the term is not actually well-defined independently.
In my opinion, the first thing we have to do is defining term "character" precisely, or replacing the term "character" with something else. Lack of precise definition of term "character" increases ambiguity and potential bugs. As far as I know, Unicode defines three considerable concepts of text assemblies.
And the closest concept to human's perceived "character" is "Grapheme Cluster" as it counts number of glyphs rather than code. As @udoprog pointed out, transcoding cost is negligible, so accept the cost and choose logically ideal one -- Grapheme Cluster counting. This is better than Code Point and less ambiguous in my opinion. Furthermore, Grapheme Cluster count is very likely being tracked by code editors to provide precise line/column(or character offset) information to end users. Tracking of Grapheme Cluster count wouldn't be a problem for them. There will be two distinctive position/offset counting mode (1)
In LSP3, servers should support both of If Grapheme Cluster counting is unacceptable, UTF-8 Code Unit (==encoded byte count) counting can be considered instead. Character offset becomes irregular indexing number, but it'll be consistent with other part of the spec. |
@eonil Regarding grapheme clusters: The exact composition of clusters is permitted to vary across (human) languages and locales (Tailored grapheme clusters). They naturally vary from one revision to another of the unicode spec as new clusters are added. Finally, iterating over grapheme clusters is not commonly found in standard libraries in my experience. |
@udoprog I see. If grapheme clusters are unstable across implementations, I think it should not be used. |
Hi, thank you for your great work. As one of the client implementers, I think it would be easier to implement by referencing it in the form of This is just one opinion, not a strong request. |
Unless I'm missing something, Position.encoding would be transferred on every message in a number of places (redundantly) and make it possible for it to differ for positions in the same mesasge, so I think (as another client author..) I'd prefer a capability and fixing it for the length of the LSP session. |
Thank you for moving this issue forward.
I'm glad to hear code points is still a consideration. You are correct they are not an encoding, but rather what is being encoded. As far as naming goes, this new property is basically quantifying the atomic unit of text during communication so a name that conveys that would work. Alternatively, you could keep the name
|
@hrsh7th as @puremourning correctly pointed out this makes it more complicated since the encoding can change from request to request. So, I am actually against it. |
Aren't code points just UTF-32 code units? Why would we need another name? |
I think that actual clients often have the following functions. function getRangeOfText (range: LSP.Range): string [] {
...
} What I wanted to say was that passing the server information that created the Position to every implementation would be a lot of changes. |
I will add utf-32 as well. |
@michaelmesser A Unicode code point is a 21-bit integer and UTF-32 is a standardization for using a 32-bit integer as its storage mechanism. If they wanted to, the Unicode Consortium could introduce UTF-24 (fixed width, 3 byte/24-bit encoding). Even today, without being formally standardized, a server could choose to use UTF-24. The "issue" is that if the language server spec now or in the future ever transmits a byte offset then it will be 4-byte aligned. If the server implements UTF-24 then its code point implementation would be 3-byte aligned. As long as the LSP maintainers guarantee never transmitting a byte offset under any circumstance now or in the future, then this mismatch won't ever occur and you're correct stating that the UTF-32 code unit offset would be equivalent to the code point offset. |
That's not a thing. Alignment is a power of 2. |
Because it's confusing and requires some smarts on the part of the server implementor. A nice thing about code points is that it doesn't matter which unicode encoding you use for your document, you can always index with code points (perhaps inefficiently) without having to re-encode. So if I see that I'm being sent code points I know: great, I don't have to worry about what text encoding I've chosen. On the other hand, if I see that I'm being sent UTF-32 code units I might think that I need to re-encode my document as UTF-32 in order to handle that position! This isn't true: you just interpret them as code points in your existing document and you're good to go, but it's confusing and isn't saying what you mean. I would much prefer to have code points as an explicit option. |
@dbaeumer, would it be possible to extend the current text with a clarifying explanation of the different encodings? The messages above show it is not always easy to understand the connection between an encoding and a column position. Thank you. This is what clangd has:
|
Add to 3.17 which shipped today. |
Am I right in saying that the vscode client does not actually support anything other than UTF-16, even though the protocol can now carry this info? I was able to find |
@eric-wieser It's going to support it real fast as soon as enough servers decide to not do the UTF-16 dance. |
There is a bit of a chicken and egg problem there though. Servers don't have motivation to implement UTF-8 support if it doesn't lead to any improvement for users because the client doesn't support it, and they can't test it anyway so it would be irresponsible to ship. Of course VS Code isn't the only client out there, but I generally expect it to be the first adopter when it comes to new LSP features since LSP and VS Code are developed by the same people, more or less. There should probably be another issue opened to track implementation of UTF-8 support in VS Code. |
@digama0 as I tried to point out the conversion should when possible be done were the file lives (see #376 (comment)). Doing this generically on the client will have some bad performance implication since it would require that the client reads the file to do the conversion. This could be many in the case of a reference result or when reporting many diagnostics. |
I'm a little confused here; It seems to me that vscode stores editor contents in codepoints internally, as seems to print out the correct column number for inputs in which the extension API would return the wrong position. |
Supporting code-points (i.e. UTF-32 code units) at least would be super, since that would at least give us a format that doesn't require anyone to do re-encoding. |
Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.
Here in
TextDocumentContentChangeEvent
,range
is specified in UTF-16 column offsets whiletext
is transmitted in UTF-8.Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units.
A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.
jacobdufault/cquery#57
Survey: counting method of Position.character offsets supported by language servers/clients
https://docs.google.com/spreadsheets/d/168jSz68po0R09lO0xFK4OmDsQukLzSPCXqB6-728PXQ/edit#gid=0
The text was updated successfully, but these errors were encountered: