Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Language Packs #39178

Closed
dbaeumer opened this issue Nov 27, 2017 · 15 comments
Closed

Proposal: Language Packs #39178

dbaeumer opened this issue Nov 27, 2017 · 15 comments
Assignees
Labels
l10n-platform Localization platform issues (not wrong translations) on-testplan under-discussion Issue is under discussion for relevance, priority, approach
Milestone

Comments

@dbaeumer
Copy link
Member

dbaeumer commented Nov 27, 2017

Language Packs (under construction)

A while ago we opened VS Code's language set to contribution from the community by moving the translation database to Transifex. Since then quite some languages got added. However there is currently no vehicle to install these languages with a stable version of VS Code. The stable version still only ships with the 9 core VS Code languages. Two extra languages (pt_br and hu) have been added to the insider build.

Instead of pre-bundling all new languages with VS Code we should come to a model where these languages can be installed later on like users install additional feature via extensions.

How is VS Code localized

I will first outline how VS Code is localized today. The localization consists of the following parts:

  • the developer tagging strings to be translated.
  • automatic extraction of the strings to be translated.
  • pushing the strings to be translated to Transifex.
  • pulling translation from Transifex.
  • building translation bundles.

Tagging strings to be translated

VS Code uses a tagging approach to mark strings to be translated directly in the source code. It therefore provides a translation function nls.localize. Strings pass to that function as an argument are tagged for translation. Strings in single quotes are in general treated as 'technical' which don't require any translation. Strings in double quotes outside a localize function call are treated as strings that need translation but aren't and are flagged by a linter rule as untranslated. A typical nls.localize call looks like this:

nls.localize('TaskService.ignoredFolder', 'The following workspace folders are ignored since they use task version 0.1.0: ');

We also maintain an npm module that allows for the same approach in extension code. The npm module is called vscode-nls.

During normal compile time the strings inside the nls.localize call stay as they are. This ensures for quick turn around cycles during development time. Furthermore it is important to note that the truth of the strings is in source code (TypeScript and JavaScript). VS Code doesn't maintain resource bundles or property files in other formats.

Extracting strings

Strings to be translated are automatically extracted from the source code during build time. This extraction process does the following things:

  • extracts the key and the value and puts them into a special meta data file (json format).
  • replaces the key with an index
  • removes the value from the call.

The meta data file contains all strings with their key / value pair that are used inside VS Code. It is named nls.metadata.json and it is produced during the build process and ships with VS Code.

The above example looks like this in a version we ship:

nls.localize(17, null);

Pushing to Transifex

The content of the nls.metadata.json is then used to upload the strings to be translated to Transifex. Since VS Code has thousands of strings the translation is grouped into smaller projects to make them easier to handle in Transifex. The following source file describes how these strings are grouped into projects: https://github.com/Microsoft/vscode/blob/master/build\lib\i18n.resources.json#L1

Pulling translations from Transifex

Translations are pull from Transifex and stored alongside the source in the VS Code GitHub repository. They are all under the i18n folder. Storing the translation together with the source code is necessary to be able to version source code and translations together. Otherwise it would for example be very hard to do a recovery build on an older version with exactly the same translations. The translated strings are stored under the i18n folder where the first sub folder is the translated language. The structure underneath the langue folder is isomorphic to the source code folder structure under the src folder. However ts/js source files which don't contain any translatable strings will not have a corresponding i18n.json file. The files in the i18n folder are all machine generated and should never be edited by a developer.

Building translation Bundles

During build time (when we build s shippable version of VS Code) the build process will also generate translation bundles per supported language (currently the 9 code languages). These translation bundles do have the same granularity as the source bundles have. For example there is a workbench.main.js (which bundles most of our workbench code). So there are corresponding workbench.main.nls.${lang}.js files which contain the translated strings.

These translation bundles are optimized for memory footprint and low CPU consumption when looking up strings. This is achieved using the following two techniques:

  • all key values from the source code are replaces with index lookups in arrays (see above). The index used is a sequence number for the nls.localize call per TS/JS source file.
  • unknown translations are replaced with their default value using the normal language mode rule lookup (de_ch -> de -> en). As a result the translation bundles are always complete during runtime and only one set needs to be loaded and no dynamic lookup is happening.

A translation bundle is statically linked to a VS Code version and it is very likely not functioning correctly with a different VS Code version.

This optimization happens for all VS Code core code and our built in extensions. The mechanism and the necessary build tools are also available for outside extensions via the vscode-nls-dev npm module.

Language Packs

It is desirable that language packs come as extensions and are managed by the market place. We don't want to add another channel to host language packs nor do we want to ship all languages in the box (size, language deprecation, ...). To provide such language pack extensions we need to explorer wo things:

  • how would we build such language pack extension
  • do we have special extension related requirements for such a language pack extension

Building Language Packs

Especially the optimization we do when building translation bundles (key -> index replacement and default language lookup during build time) makes it harder for third parties to produce language pack extensions. In general we have three choices:

  1. we implement a second translation bundle solution for non core languages which don't make use of the optimization described above.
  2. we publish all our nls related build tools as standalone npm packages so that third parties can publish translations using the same optimizations as own extensions.
  3. we publish all language extensions.

Option 1

Basically we would still replace the key by an index during build time. However during runtime we would do the following for non core languages:

  • load the nls.metadata.json file into memory
  • reverse translate index to key (this information is part of the nls.metadata.json
  • load non core languages bundle files into memory and build up hash tables (these bundles would be key/value maps)
  • look up the message using the key from a message bundle installed as an extension.
  • if no message is found we follow the language lookup rules (de_ch -> de -> en) and load other language bundles until we find a valid value for a give key (which we do at least for English)

We could think about giving up on the current solution later on and only use a dynamic runtime solution instead of the optimized statically linked build time solution. This would avoid loading the nls.metadata.json file but would leave the key in the generated JS file.

Option 2

We publish all build tools that are currently inside the code VS Code repository as standalone npm package so that third parties can run the same scripts to bundle translation files.

Option 3

We publish all language extensions to the market place during build time. The translations itself would still come from Transifex. However the translations would also go into the i18n folder like our core languages do and our build scripts would generate language extensions and publish them to the market place.

Option 4

We either include all languages available in the normal VS Code build or we have two different VS Code builds. A first which includes the nine core languages and a second call VS Code International which includes all language currently available in Transifex. The advantage would be no changes to build scripts, loader / start up code or extension installation. However we would need to create and manage these additional builds.

Proposal

I am favoring Option 3. Pros are:

  • we can treat core languages the same way and basically ship VS Code only with English out of the box.
  • only minor changes to how we read and manage translation bundles during runtime (we only need additional lookup locations).

Cons are:

  • the language pack must match the VS Code version number. So a language pack produced for version 1.18 will not work for version 1.19.
  • we will add more translation files to our GitHub repository (under the i18n folder). However that folder can fully be ignored during development time. The average size of the files for a language is currently 500KB.

I don't like Option 1 since it will add a second translation bundle runtime story with its own set of bugs (at least for a while; I am convinced that the optimization we do are worth it especially during startup time). If we stick with two different solutions (core / contributed languages) core language and contributed languages will look different and we will always ship all core languages in the box. The only advantage I can see with option 1 is that it would allow to start VS Code with an old outdated language pack since missing strings will dynamically fall back to English during runtime.

Option 2 would be doable and would allow us to treat core and contributed languages the same. However from the experiences with maintaining a separate repository and npm module it might not be worth compared to option 3. In addition we would end up with more outdated language packs when we ship a new version of VS Code.

Language Pack Extensions

Since option 3 (as option 2) language packs are statically 'linked' against a VS Code version we would need the following features for extensions (if not already present):

  • exact version matching
  • auto updating. This means when a user installs a new version of VS Code we should automatically update all language packs to the matching VS Code version. If no language pack version is available we would disable the language pack.
@vscodebot vscodebot bot added the l10n-platform Localization platform issues (not wrong translations) label Nov 27, 2017
@dbaeumer dbaeumer self-assigned this Nov 27, 2017
@dbaeumer dbaeumer added the under-discussion Issue is under discussion for relevance, priority, approach label Nov 27, 2017
@dbaeumer dbaeumer added this to the November 2017 milestone Nov 27, 2017
@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 4, 2017

@egamma added Option 4
/cc @chrisdias

@joaomoreno
Copy link
Member

joaomoreno commented Dec 4, 2017

@dbaeumer Another thought for option 1:

Let's call that index to key translation indexing. Could we index a Marketplace extension at the time the extension is installed/updated? Possibly also after Code itself is updated? Also, we wouldn't load the translation until that indexing has been done? That way, indexing only happens once and loading up the extension would be fast and behave exactly the same as the core languages.

Code can do the indexing in its core at startup and cache it in the user data dir. The first launch would be slow, but consecutive launches would be fast, as long as the extension and Code itself are the same versions.

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2017

@joaomoreno nice idea.

@sandy081 sandy081 modified the milestones: November 2017, December 2017 Dec 6, 2017
@sandy081
Copy link
Member

sandy081 commented Dec 6, 2017

Moved to December to continue discussion

@sandy081
Copy link
Member

@joaomoreno I liked the idea of indexing at run time. Adding to it, can we make them selectable as themes. User can install multiple language extensions and can select the language using a picker. We can do the indexing when the language is selected and reload the window?

@joaomoreno
Copy link
Member

That's trickier since UI labels are present in the main process too (menus).

@dbaeumer
Copy link
Member Author

We do have some sort of picker: F1 > Configure Language

The picked language is a special setting since we need to read it very early in the startup phase to configure the nls plugin of the loader correctly. So it is not stored in the usual settings.

@sandy081
Copy link
Member

That's trickier since UI labels are present in the main process too (menus).

How about prompting to quit and restart VS Code for the changes to apply?

The picked language is a special setting since we need to read it very early in the startup phase to configure the nls plugin of the loader correctly. So it is not stored in the usual settings.

Can we write the selected language into that file after selection?

@dbaeumer
Copy link
Member Author

Yes, that is what is currently happening when you use F1 > Configure Language.

@sandy081
Copy link
Member

Right, F1 > Configure Language is currently opening the file.

My thought is to show a picker of languages to select, once selected, we can write into the file and ask the user to restart. Not sure if this functionality is already exists.

@dbaeumer
Copy link
Member Author

Yes, that could directly write to the file and restart VS Code. I had an item for this but we closed it since users don't change language often. Usually only once.

@dbaeumer
Copy link
Member Author

//cc @aeschli

@dbaeumer
Copy link
Member Author

@aeschli and I discussed this on how to best structure the content of a language pack (at least for translations inside the VS Code core (no extensions)). The conclusion was to pre-compute as much as possible and to have files as large as possible. For the code I therefore propose the following structure:

{
	"data": {
		"vs/code/electron-main/main": [
			"Guten Tag",
			"Gute Nacht"
		]
	},
	"keys": {
		"vs/code/electron-main/main": {
			"goodMorning": 0,
			"goodNight": 1
		}
	},
	"hashes": {
		"vs/code/electron-main/main": "Key sequence hash"
	}
}

As seen the file already contains the messages stored as an array. If the hash value matches the one we will store in nls.metadat.json then we can simply take the array. The assumption behind this is that we not change strings in existing files. We most of the time make new strings in new files. If the hashes don't match we take the key information to recompute the array. We also decided to fall back right to English if a string is missing. So if a user requests de_ch and we are missing strings we will insert the English string and not the German string. This will avoid to many lookups during startup in case we need to recreate the nls bundle.

I did some first performance testing and we should be able to fully regenerate a bundle in ~200ms. This is only a price we pay when a new release is installed. A second startup will not pay that price. And it will only happen for translation we don't ship in the box.

A big unknown is currently translations for bundled extension. The reason is that we don't do any bundling here and therefore there are quite some files to write which is performance wise not good. To not block the startup for too long we could do this in parallel when the workbench already loads. We would only need to synchronize with the extension host startup.

@dbaeumer
Copy link
Member Author

Actually with the work @joaomoreno is doing we could generate the files during the install process at least for Windows and Mac.

@dbaeumer
Copy link
Member Author

@aeschli, @sandy081

I coded the language pack generation for the core and the extensions. Since the language pack linking now happens inside the vscode-nls module it would be preferable to have the following file structure in a language pack:

  • one file for the VS Code core (main, renderer, ...). Basically what is defined in nls.metadata.json
  • one file per extension. I change the build process for the extension so that they generate a nls.metadata.json as well which can be used as a based for the language pack generation for extensions.

@aeschli lets discuss this in more detail when you are back in the office.

@dbaeumer dbaeumer mentioned this issue Jan 29, 2018
3 tasks
@vscodebot vscodebot bot locked and limited conversation to collaborators Mar 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
l10n-platform Localization platform issues (not wrong translations) on-testplan under-discussion Issue is under discussion for relevance, priority, approach
Projects
None yet
Development

No branches or pull requests

3 participants