-
Notifications
You must be signed in to change notification settings - Fork 0
/
Specification
1669 lines (1231 loc) · 74.8 KB
/
Specification
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
(DRAFT)
SCORPION PROTOCOL/FILE-FORMAT
Table of contents:
Protocol
Status codes
Detailed status codes
Client certificates
Receive subprotocol
Send subprotocol
Interactive subprotocol
Hashed URI scheme
Document file format
Extended link attributes
Metadata blocks
Data/text sub-blocks
Languages
Automated crawling
Conversion
Unordered Labels File Identification
X.509 extensions
Favicon formats
Superseding certificates
Recommendations and other notes
Dynamic files
Security issues
Missing details
FAQ
=== Protocol ===
The protocol can be TLS or non-TLS. Implementation of TLS is optional but
is recommended; implementation of non-TLS is mandatory. The same port
number can be used for TLS and non-TLS; this is distinguished by the first
byte that the client sends (TLS if it is 0x16 or non-TLS otherwise (in
which case it is the subprotocol byte)).
The default port number (for both TLS and non-TLS) is 1517.
Unless otherwise specified, the text in the protocol is ASCII. (This only
applies to the protocol; the contents of files are not limited to ASCII.)
The first byte is the subprotocol, and then optional subprotocol parameter,
and then a space, and then the absolute URL (the scheme is mandatory), and
then a carriage return and a line feed. (See the below sections about the
definitions of the send and receive subprotocols. "Subprotocols" is similar
than the different methods of HTTP, such as GET and PUT.)
The URL is supposed to have a slash after the host name (or after the port
number if present); the client MUST NOT send a URL that follows the host
name or port number immediately by # or ? or a carriage return. If it does
anyways, the server MUST either treat it as though the slash is present or
issue a redirect to / or to the full URL with / added on the end or to a
URL that those URLs would redirect to; this SHOULD be a permanent redirect.
(The client also should add / if it is missing according to the above in
the case of a link or a redirect to such a URL, too.)
The server then sends the status line, which consists of a two-digit status
code, followed by a space and then the parameters (which may be empty), and
then another carriage return and line feed. Parameters are separated by
spaces, although the last parameter may include spaces (the client will
know how many parameters according to the major status code).
The URL may include a username/password (the usual format of username and
password in a URL is used); if so, the client should not display the
password (unless the user has enabled an option to tell it to not hide
passwords). The client software should also include a command to discard
any existing username/password, in order to log out.
However, if the URL contains # then the client should not include the #
and what comes afterward, since that part is only for the client.
If the host name of the URL does not match any host name that this server
serves, or if the scheme is neither "scorpion:" nor "scorpions:", then it
is a proxy request (which the server may refuse if it wishes, due to
whatever criteria they want). Proxies should normally avoid converting the
files into a different format. (Note that a proxy to another service on
the same server might be implemented without needing to make another
connection, although this is not mandatory and not something that the
client needs to worry about.)
If it is not a proxy request, then the TLS and non-TLS variants of the
same protocol URI scheme should be treated as equivalent by the server
(but not by the client, which MUST treat them differently). (Proxied
requests will treat the scheme in the request like the client does.)
The recommended use of SNI is:
* Clients SHOULD use SNI when connecting to the server, and MAY have an
option to disable SNI (in order to mitigate some types of spying).
* Servers SHOULD NOT require SNI, and SHOULD ignore any provided SNI and
use the host name in the request instead. (Exceptions are possible, e.g.
if it needs to use a different protocol with the same port number and IP
address for some reason, or if a proxy server is used to forward requests
to another server without needing the proxy to handle encryption, etc.)
* If possible, clients SHOULD allow to use the system's DNS services to
implement encrypted Client Hello; if implemented, there MUST be an option
to disable this feature (although, depending on the implementation, this
option might be a part of a different component of the operating system).
* A server is not required to present a valid certificate if an incorrect
SNI (or no SNI) is provided by the client. Clients that wish to verify the
server's certificate should avoid using incorrect SNI.
=== Status codes ===
The first digit is the major code, and the second digit is the minor code.
Clients may ignore the minor code.
0x = Interactive mode; used only for the "I" subprotocol. After this is
sent to the client then arbitrary two-way communication is possible. The
parameter is optional and is the capability codes if it is not blank.
1x = Requires input; the parameter is the prompt text, which it should
display to the user, and then the user enters any text, and the client
redirects to the same as the current URL but with ? added and then the text
entered by the user, which should be percent-encoded if necessary. If the
existing URL already has a query string, then the result is unspecified
(it is probably best to delete the existing query string, due to how
relative URLs are working); servers MUST NOT use this code in that case.
2x = The response is OK and the contents of the file will follow; after
this line is the data of the file. The parameters are:
* The file size in decimal notation, or ? if the size is unknown (e.g. if
it is a dynamic file).
* The file type/format. MIME has some problems and ULFI is better, but for
now, for compatibility you can use MIME; however, spaces are not allowed.
If the "charset" parameter is not included then the default according to
the file format should be used; "us-ascii" is a recommended default if it
is not otherwise known. (If there is a possibility of the character set
being unknown, servers SHOULD explicitly specify them; this is unnecessary
if the file contains only ASCII characters.)
* Optionally, the version, which if present must consist only of uppercase
and lowercase letters, digits, and forward slash and plus sign. (Clients
and servers are not required to verify that the version code uses this
restricted character set.)
3x = Redirect. The parameter is another URL which it should redirect to,
which can be temporary or permanent. If the original URL contains a
fragment part and the target URL does not, then the client SHOULD add the
fragment part of the original URL to the target URL, too (unless the client
somehow knows that the fragment part is not useful). If the number of
consecutive redirects exceed the limit (which MUST be not more than five by
default, although it may be configurable by the user), then the client MUST
NOT automatically follow any further redirects.
4x = Temporary error. Has two parameters:
* The number of seconds which is recommended to wait before trying again;
it must be a unsigned 31-bit number (encoded as decimal in ASCII, not as
binary), or it can be ? if it is unable to estimate the required time
before trying again. (TODO: Possibly remove this; maybe it is not useful.)
(An implementation may have a maximum amount of time that it is willing to
wait; e.g. one week with a random variation of up to 24 hours either way.
Furthermore, if this error is received multiple times in succession, a
client should increase the amount of waiting time.)
* The error message text (optional).
5x = Permanent error. The parameter is an error message. The request should
not be repeated by automatic tasks, unless such tasks are manually reset by
the operator of the computer that controls such tasks.
6x = Client certificate required. The parameters are listed below. The
below section about client certificates has more details.
* The first parameter specifies what URLs the certificate applies to; see
the below section about client certificates. This is only a hint and is not
a requirement; clients MAY ignore it, and MUST allow the user to override
this specification with their own.
* The second parameter is arbitrary text which should be displayed to the
user, to explain what kind of client certificate is needed and/or why it is
needed, etc.
7x = Ready to receive; used only for the "S" subprotocol. The parameter is
an arbitrary text.
8x = Received data accepted; used only for the "S" subprotocol. The
parameters are same as for 2x but the data of the file is omitted. If
the file has been deleted then the parameters are omitted.
=== Detailed status codes ===
00 = Beginning of arbitrary two-way communication. (This is only valid for
the "I" subprotocol.)
10 = Requires input. (This protocol does not have the 11 code that Gemini
has; putting the password in the query string isn't very good because then
it is a part of the URL and might not be hidden.)
20 = Response is OK.
21 = Response is OK, but it is only a part of the file and not the entire
file; this is used as the response of a range request. The file size
parameter is the entire file size, and not only the requested part.
30 = Temporary redirect.
31 = Permanent redirect. A client may automatically update bookmarks, etc
if this feature has not been disabled by the user.
40 = Temporary error with a not more specific code.
41 = Down for maintenance.
42 = A dynamic file has an unexpected temporary error such as time out.
43 = Proxy error; this request requires the server to make another
connection, but it is unable to do so, or is able to connect but cannot
receive a valid response.
44 = Slow down; the client is sending requests too fast and should wait
before trying again.
45 = Temporarily locked file; used with the "S" subprotocol.
50 = Permanent error with a not more specific code.
51 = File not found (maybe it is at Area 51).
52 = The file does not exist (probabily because it has been deliberately
removed) and is not expected to exist again; clients should remember this
and not request it again automatically. (This code also means a permanently
locked file, if used with the "S" subprotocol.)
53 = A proxied request was refused by the server. A server MAY also use
this status code if a username/password have been provided in the URL but
are not required to access any files on this server (this is to simplify
the server implementation, so that it can check that it is not its own name
and therefore refuse the request). A server MAY also use this status code
if a URL with no scheme has been provided, for the same reason.
54 = Forbidden request. A username and/or password probably won't help.
Other conditions might or might not help, depending on the implementation
(e.g. it might only permit access to LAN addresses or only to 127.0.0.1).
55 = Edit conflict; used with the "S" subprotocol.
56 = A username and/or password are required. Either none have been
provided, or the username and/or password that have been provided are
incorrect. The error message SHOULD NOT distinguish between an unknown
username and an incorrect password for a known username.
59 = Bad request.
60 = A client certificate is required to access this file, but none has
been provided.
61 = The supplied client certificate is not authorized to access this file.
The certificate may be valid to access a different file, though (possibly
but not necessarily on the same server).
62 = The supplied client certificate is not valid (e.g. because it has
expired or because the signature is not valid).
70 = Ready to receive new file. (If a server sends this response but the
file is created before it has been fully received from the client for any
reason, then the server SHOULD NOT overwrite the existing file and SHOULD
instead send a response with an appropriate 4x or 5x status code.)
71 = Ready to receive to replace an existing file.
72 = Ready to receive data for a use other than a new or modified file.
80 = Accepted received data and created a new file.
81 = Accepted received data to modify an existing file.
82 = Accepted received data for a use other than a new or modified file.
=== Client certificates ===
Implementation of client certificates is optional, but it is recommended
to be implemented if TLS is implemented.
If a 6x response is received on a non-TLS connection, then it should change
the scheme from "scorpion:" to "scorpions:" when a client certificate is
available, before retrying the request (it MUST NOT do this automatically
if no client certificate is available).
The first parameter of a 6x response specifies the suggested set of URLs
that the client certificate is applicable to. It has one character followed
by a URL. The URL may be empty to mean the current URL, or can be any URL
that lacks a scheme and authority. The first character can be:
* "=" = Only that exact URL (as well as any that differ only by the
fragment part; the fragment part is always ignored for all of these modes).
* "+" = The specified URL as well as any that have a different query string
or no query string.
* "*" = Discard the query string of the specified URL, and then it means
the current URL as well as any one consisting of the current URL followed
by / or ? and then anything. If the URL already ends with / then it means
that URL followed by anything.
* "-" = An unspecified set of URLs which includes the specified URL.
(If Gemini is implemented as well as Scorpion, then a 6x response from a
Gemini server SHOULD use the "*" URL set hint (with the current URL) by
default, since that is what the Gemini specification says.)
The set of URLs MUST include the URL that has been requested, and MUST NOT
include any URL that differs by scheme, host, and/or port, than the URL
that has been requested. If this is not the case, clients SHOULD treat it
as an unrecognized hint.
Clients MUST allow the user to override the specifications above; those
specifications are merely a hint, and are not mandatory to be implemented.
Unrecognized hints shouldn't prevent the user from specifying a client
certificate anyways, but in that case the client MAY require that the user
explicitly specify which set of URLs it applies to; if it does not then
there will be a implementation-dependent default setting.
Client certificates normally apply to all subprotocols used with that URL,
although the user may be allowed to override this. Alternatively, a client
might apply it to all subprotocols if the subprotocol in the request is R,
but limit it to the requested subprotocol if it is S or I.
If the server requires a client certificate and a user name, it should use
a 54 response first, and then 60 once a valid username has been provided.
If the server requires either a client certificate or a user name (of the
client's choice), but does not need both, then it should provide a 54
response if the connection is not using TLS, or 60 if it is using TLS. A
server SHOULD NOT require both a password and a client certificate at the
same time. (TODO: Possibly change this paragraph.)
A client may have any UI its author wishes, but an example of a GUI which
might be used to manage client certificates would be:
URL: ________________________________
Match: (*) Exact (*) File (*) Path
[X] Restrict to current username
Subprotocols: [X] Receive [X] Send [X] Interactive
Duration: (*) Session (*) Permanent (*) ___ hours (*) ___ days
[X] Remember TLS options
<Select...> <Import...> <Create...> <Manage...>
<Set> <Set and go> <Cancel>
(A client can be designed with whatever differences they want than the
above. A similar UI may be used for Gemini, except that "Interactive" and
"Restrict to current username" are not applicable for Gemini.)
In the menu for selecting existing certificates, any existing certificate
for the same domain should be easy to find, and should allow changing the
scope of existing certificates. (This is more important for Gemini than it
is for Scorpion, but it would work with Scorpion too.)
A client MUST NOT automatically generate and send a client certificate
without first asking the user.
When creating a new certificate, the default settings for the new
certificate should be made up as follows (although the user might be
able to override them; the ability to override is not mandatory, since
an external program can be used to make up your own certificates if you
do not want to use the ones included in the browser):
* The key type, signature type, number of key bits, etc should match
that of the server's certificate. This is the most likely to be compatible.
* The "extended key usage" (2.5.29.37) extension should be included, and it
should specify "TLS client auth" (1.3.6.1.5.5.7.3.2) as the key usage.
=== Receive subprotocol ===
This is the usual subprotocol, coded as "R". (It is the only subprotocol
which is mandatory to be implemented.)
The subprotocol parameter can be blank or can be a range request. Servers
are not required to support range requests, and can respond with a 59 code
if it is not implemented. (It is also possible that range requests will be
possible only with some files and not with other files.)
A range request consists of two nonnegative integers in decimal notation
with - in between; these are zero-based file offsets, of the first byte to
receive, and of the first byte to not receive (e.g. "3-9" means the six
bytes, being the fourth, fifth, ..., ninth bytes of the file). The end
address can be omitted in which case it means up to the end of the file.
=== Send subprotocol ===
This subprotocol is coded as "S".
The subprotocol parameter can be omitted, but if not omitted then it is the
version of the file being replaced; this can check for edit conflicts. (The
server MAY require the version to be specified.)
Optionally, the subprotocol parameter can be a HMAC followed by @ and then
the version (which may be empty). The HMAC is of everything that follows
the at sign, including the entire request and everything the client will
send after the server responds the first time.
The response code should not be 2x nor 8x, but can be any other code. If it
is 7x then this means it is ready for the client to upload the file. If it
is 1x or 3x then they modify the URL and mean that the upload is required to
be made to a different URL instead of the one that was initially requested.
For the client to upload the file, it sends a status lines, and if 2x then
it is followed by the data. The possible status codes are:
* 20 = Upload a file. The size is mandatory and you cannot substitute a
question mark instead. The version is optional, and the server may override
it with its own version specification.
* 30 or 31 = Make the file into a redirect to a different file.
* 51 or 52 = Delete a file.
A server does not need to implement all of the above possibilities.
After the client finishes sending, the server sends the status line, which
will be a 8x code if successful or a different code if it is an error.
Note that if the client disconnects before sending its own status line or
before sending the amount of data of the file that it said it was going to
send to the server, then the upload is aborted; if the file is locked then
the server can unlock it, etc. (For example, a client might not want to
overwrite an existing file, but the server says 71 and the client wants to
add a new file and not overwrite an existing file, then the client can
disconnect, in which case the files on the server will be unchanged.)
=== Interactive subprotocol ===
This subprotocol is coded as "I", and is used for two-way communication
(usually terminal emulation, although it can be used with any kind of
two-way communication). (The main reason for this is if you have
multiple programs and you want to be able to specify the URL of each one.)
The subprotocol parameter is optional but if present it is the requested
capability codes; see below.
If it is acceptable, then the server sends a 00 response; its parameter is
the actual capability codes or can be blank. After that, it will continue
with ordinary two-way TCP communication. (Other valid status codes are 4x,
5x, and 6x, in which case the server closes the connection after that, like
it does with the other subprotocols. The 2x codes MUST NOT be used.)
If the client disconnects after receiving the 00 response (and possibly any
further data) but without sending any further data to the server, then the
server should not change the state; it should assume that the client had
connected by mistake and did not intend to do so.
Capability codes have no delimiters between them, and each one has the
following format (similar than CSI codes, in order that implementations
can use the same subroutines to parse them if they do terminal emulation):
* Zero or one byte in range 0x3C to 0x3F.
* Zero or more bytes in range 0x30 to 0x39 or 0x3B.
* Zero or one byte in range 0x21 to 0x2F.
* One byte in range 0x40 to 0x7E.
(Note that the protocol before the actual data will be ASCII only, so that
ait is possible to use even with terminal emulators that do not implement
this subprotocol.)
The ABNF of the capability codes is:
= *capability
capability = prefix parameters middle suffix
prefix = %x3C-3F / ""
parameters = *(DIGIT / ";")
middle = %x21-2F / ""
suffix = %x40-7E / ""
Possible capability codes:
* a = The client uses this if it wishes to send command-line arguments
and/or environment variables to the server. If so, the server will send
back the same capability code, possibly with different numbers, and then
the client sends the command-line arguments and environment variables,
each of which is null-terminated. The numbers are first the number of
command-line arguments, and then optionally a semicolon and then the
number of environment variables. Each environment variable name is not
allowed to contain a equal sign, since the equal sign is used to sepaate
the name from the value. The server will not send back this capability
code if the requested file cannot use this capability; if it does send
it back, then the numbers might be different than those that the client
requested; the number of arguments/variables that the client sends must
match those specified by the server. The server MUST NOT send this
capability code if the client did not request it.
* L = Line mode. The value is 0 for not local line editing (and not local
echo), or 1 for local line editing so that the text is not sent until an
entire line is entered and the user pushes send. Even in line mode it is
still possible for the server to send data to the client while the user
has partially entered a line of text; the client should handle that by
displaying the user's text entry separately.
* T = Terminal type (not currently defined).
* x = Screen size. The value is two numbers being the number of columns
and the number of rows; in both cases zero can mean unlimited. Optionally
can have a third number; 1 means the client handles pagination.
(Note that it is not required to implement the above capability codes. In
some cases, some or all of them might be not applicable, e.g. if it is not
a terminal emulation.)
=== Hashed URI scheme ===
The "hashed:" URI scheme has the format "hashed:X/Y,Z" where X is the
hash algorithm, Y is the hash (in hexadecimal format), and Z is another
URL (which can be absolute or relative, and can be of any scheme, including
another "hashed:" URL, in case you want to specify multiple hashes which
are using different hashing algorithms).
It refers to the same file as Z if X/Y is that file's hash, or is an error
if that file's hash is not X/Y.
If "[X|Y]" means the absolute URL corresponding to the URL Y treated as
relative to the absolute URL X, then the rules for resolving relative URLs
of this scheme are as follows:
"[hashed:A/B,C|D]" = "[C|D]" if "D" does not start with "#"
"[hashed:A/B,C|#D]" = "hashed:A/B,[C|#D]"
"[A|hashed:B/C,D]" = "hashed:B/C,[A|D]"
In case the notation is confusing: "[http://example.org/files/1.txt|2.txt]"
= "http://example.org/files/2.txt" is a valid equation, and the uppercase
letters in the above are placeholders.
For example, if the current URL is "hashed:0/ab8974,file:///tmp/help.txt"
and you want to access the relative URL "help2.txt", then the new absolute
URL is "file:///tmp/help2.txt" and not
"hashed:0/ab8974,file:///tmp/help2.txt", because "0/ab8964" is the hash of
"help.txt" and not of "help2.txt". However, if the relative URL starts with
# then it is a link to another part of the same file, so it is the same
file and therefore has the same hash and therefore it should not strip out
the hash in this case.
The hash algorithms are specified as hexadecimal numbers without a leading
zero, which are multicodec numbers (but not encoded as varint). The hash
values are specified as an even number of hexadecimal digits, which will
include leading zeros if any. List of hash algorithms:
11 SHA-1
12 SHA2-256
13 SHA2-512
14 SHA3-512
15 SHA3-384
16 SHA3-256
17 SHA3-224
d5 MD5
b250 BLAKE2s (128-bits)
b260 BLAKE2s (256-bits)
(Note that some hash algorithms are deprecated because they are insecure.)
(A request of the Scorpion protocol that sends a URL using the hashed:
scheme is considered to be a proxied request (even if it is a URL of a file
on that server), and may be refused. Clients MUST NOT send such requests,
unless the user configured it to be a proxy for the hashed: scheme (the
ability to do so is not required to be implemented, though).)
=== Document file format ===
The file format consists of a sequence of blocks, each of which has the
format (there is no global header, delimiters, etc):
* One byte being the block type and character encoding.
* Big-endian 16-bit attribute length.
* Attribute data.
* Big-endian 24-bit body length.
* Body data.
The block types are:
* 0x00 = Normal paragraph. The attribute is unused and MUST be empty. The
body is the text of the paragraph.
* 0x01 to 0x06 = Heading levels 1 (outermost) to 6 (innermost). The
attribute is the part after # in the URL to refer to this section (empty
if it cannot be referred to by the URL), and the body is the heading text.
* 0x08 = Normal hyperlink. The attribute is the URL (in ASCII encoding)
and the body is the link text. The URL can be relative or absolute. If the
attribute is empty then it means the same as the current URL (which isn't
very useful for type 0x08, but may be useful with types 0x09 and 0x0A). If
the attribute contains a null character, then only the part before the null
character is the URL, and the null character itself and anything afterward
will be ignored. Other control characters are not allowed in the URL.
* 0x09 = Hyperlink requesting input. Like 0x08 but it is treated like a
10 status code (with an implementation-defined prompt; it may be the same
as the text of the link) without making the request. This link type is not
to be used for gopher links (if it occurs anyways, a client SHOULD treat
it as a normal hyperlink but with type 7 instead of 1; however, authors
should be aware that a client might incorrectly use a question mark instead
of a tab if this block is used for gopher links).
* 0x0A = Interactive hyperlink. Like 0x08 but with the "I" subprotocol.
Implementation is optional. (Some implementations may wish to use an
external program which is an existing terminal emulator, if they can
add initial input.)
* 0x0B = Alternate service (e.g. mirrors, etc) than the previous block
(which MUST be a link block; if it is also 0x0B then it is an additional
alternate service), or, if there is no previous block, the current file.
The attribute is the URL of the alternate service. The body is not normally
used, but may contain text explaining the alternate service. Clients SHOULD
normally hide this block, although it might have a way to display some kind
of "alternate service" menu, to have an option to display them, to have an
option to automatically select for load balancing, etc. (This is similar
than the "+" type in Gopher menus.)
* 0x0C = Blockquote. The attribute is unused and MUST be empty. The body
is the text of the paragraph.
* 0x0D = Preformatted text. Valid control codes are tab and line feed. The
attribute SHOULD be blank; see below for its meaning (although a client is
allowed to ignore the attribute). The client MUST display this text with a
fixpitch font.
* 0x0F = This block is used for optional metadata such as a digital
signature of the rest of the document. Clients that do not understand it
will ignore this block.
The possible character encodings are:
* 0x00 = TRON-8 (left to right)
* 0x10 = PC (left to right)
* 0x80 = TRON-8 (right to left)
The control codes are:
* 0x02 = Whatever comes before it is some kind of section number or item
number or a bullet indicating a list item. (This may also be used to
separate the word from the definition in a definition list.)
* 0x05 = Follow by one or more bytes indicating a type of contents (e.g.
a word or phrase in a foreign language compared with the surrounding text,
or a measurement of a specific type (length, mass, etc)), and then 0x06
and then the text and then 0x07. If not implemented (or if this feature is
disabled by the user; and it SHOULD be disabled by default), then it MUST
skip up to the next 0x06 byte. You cannot include any other control
characters in the data part. A data/text sub-block cannot be inside of
either part of a furigana sub-block.
* 0x06 = Separates the data part (before) and the text part (after) of a
data/text sub-block. You are not allowed to nest data/text sub-blocks.
* 0x07 = Ends a data/text sub-block. The text part can contain other
control characters including furigana, but any changes to the formatting
are required to be reset before this 0x07 code.
* 0x09 = Tab; only in a preformatted block. This should not be used if
exact spacing is requred, since the way that it is displayed is
implementation-dependent (and possibly configurable by the user).
* 0x0A = Line break; only in a preformatted block.
* 0x10 = Only with PC character code; follow by one byte in range 0x41
to 0x5F, and you must subtract 0x40 to make the code of the graphic
character to display. This is allowed in preformatted blocks as well
as in other blocks.
* 0x11 = Normal style.
* 0x12 = Strong style.
* 0x13 = Emphasis style.
* 0x14 = Fixpitch style. This style MUST be displayed by fixpitch fonts
(but it is acceptable to display everything by fixpitch fonts, which
would mean that a special handling is not required).
* 0x15 = Forward text direction.
* 0x16 = Reverse text direction.
* 0x17 = Begin the main text of furigana. This should be followed by
the text and then 0x18 and then the other text and then the 0x19.
* 0x18 = Begin the furigana text of the furigana. If furigana is not
implemented (or if the user disabled it), then it should display the
main text of a furigana block but should not display the furigana text.
(Alternatively, it might have an option which causes the furigana text
to be displayed in parentheses or some other kind of delimiters, which
would effectively make 0x18 and 0x19 aliases for graphic characters.)
* 0x19 = End of the furigana. You are not supposed to nest any other
control codes inside of the furigana blocks (although 0x10 is allowed,
if this block uses the PC character code; however, furigana probably
would not be common when using PC character codes).
* 0x1B = Used for SGR codes. The next byte MUST be 0x5B, and then zero
or more bytes in range 0x30 to 0x3B except 0x3A, and then one byte
which is 0x6D. This is allowed only in preformatted blocks, although its
use is discouraged. Clients should skip over the SGR code entirely, but
MAY have an option to interpret them.
It is not required to implement most of the control codes, except as
specified above.
Stateful encodings MUST shift the state at the beginning of each block.
It is also required after 0x18 or 0x19 if the state before such a code
does not match the state at before the furigana block, and similarly also
for 0x06 and 0x07. Any document which does not satisfy this criteria may
result in an unreliable display on some clients.
If the attribute of a preformatted block is not empty, then a client
program MAY be able to use it to implement syntax highlighting, equations,
simple diagrams, etc. It MUST have an option to ignore the attribute if the
user wants to display all preformatted block as plain text, and MUST treat
unrecognized attributes the same as a blank attribute. Authors should write
the document with the expectation of the client not recognizing it. Clients
MAY display the attribute text of preformatted blocks.
(If data tables are required, you can link to a separate file that
contains the data; you cannot have inline data tables.)
=== Extended link attributes ===
Extended link attributes are optional, both to specify by the author
and to implement by the client.
* 0x20 = The file size (if known; it can be ? if not known) and file format
of the file that the link refers to, in the same format as the 2x response
code. This is not valid for block type 0x0A.
* 0x49 = A hint for the capability string to use with interactive mode.
This is only valid for block type 0x0A.
* 0x72 = Relation types. The value is any number of bytes; if the bytes are
0x01 to 0x7F then they are the relation type from this file to the target
file, and 0x81 to 0xFF are the same relation types but in reverse (from the
target file to this file). The browser should not prevent the display of
links or automatically handle links due to the relation types; however, it
may be e.g. shortcut keys, queries, user-defined styles, etc that might be
affected by these relation types.
Valid relation types are:
0x01 = Next page
0x02 = Citation
0x03 = Cross-reference
=== Metadata blocks ===
(These might be changed in future)
The first bytes of the attribute of a type 0x0F block (a metadata block)
identifies the type of metadata in this block.
* 0x00 to 0x3F = (Reserved)
* 0x40 to 0x7F = (Reserved)
* 0x80 to 0xBF = Applies to everything up to the next metadata block of
the same type.
* 0xC0 to 0xFF = Applies to the entire document.
The meaning of the rest of the attribute, and of the body, depends on the
type; their meaning is described below. The character set is used only if
the body has a meaning and is otherwise usually not meaningful, but
exceptions will be noted in the below specifications.
The possible metadata types are:
* 0x80 = The language of the text of further blocks. The attribute
specifies the language. The body may be empty; if it is not empty then
it contains the name of the language (and should not contain any control
characters).
* 0x81 = Type of article of further blocks. The rest of the attribute is
one byte as follows: 0x00=undefined, 0x41=article, 0x4E=navigation.
* 0xC0 = Modification date/time. The rest of the attribute will be the
big-endian signed 64-bit number of seconds past January 1, 1985, 00:00:00,
UTC, excluding leap seconds. The body is not used.
Metadata blocks are not supposed to affect the display of the document
except for the display of the metadata itself (if metadata display is
enabled). However, there are cases where it may affect features such as
multilingual search (if the user has specified that an inexact match is
satisfactory), speech synthesis, etc.
=== Data/text sub-blocks ===
The first byte of the data part of a data/text sub-block indicates the
type; the rest of the bytes are the parameter (which may be empty for
some types). The possible types are:
* 0x30 to 0x37 = Languages and/or pronouncing. Bit2 means that the language
is specified. The low 2-bits meaning: 0=none, 1=phonetic, 2=phonemic. If
the language is present, 0x20 separates the language code from the
pronounce code. (The codes 0x33 and 0x37 are not currently meaningful.)
(TODO: Specify how the languages and pronouncing are encoded.)
(TODO: this should include: languages, pronouncing (which may be combined
with languages or used independently), date/time, and SI units. Possibly
also other things, but also possibly not.)
=== Languages ===
The language specifications are made up according to the criteria:
* If one code is a prefix of another then the shorter one is considered
to be a more general specification that includes the longer one (e.g.
English vs Canadian English). (This improves simplicity of implementation,
since an implementation will not need to have complicated rules to figure
out which language is meant.)
* Both written languages (for e.g. text documents) and spoken languages
(for e.g. audio files) should be considered. Although in many cases the
same codes can be used, they should still be considered separately since
the requirements may be different in each case. Note that some languages
may be purely written languages (e.g. Blissymbols).
(TODO: Possibly use a variant of ISO 639 or Glottolog or something else.)
=== Automated crawling ===
(This section is currently a draft and may be changed in future. There is
some disputes about some of the below, so it is likely to be changed.)
Please note that all of this section is not actually enforceable and is
not intended to be. It is intended to be guidelines for bots that is
likely to be implemented correctly when it is implemented. It is a similar
idea than the "robots.txt", but is meant to be less ambiguous.
The recommendation for automated crawling/indexing is described here. This
specification applies to recursive crawlers that automatically download
files, especially if they use recurring intervals, and to public search
engines and mirrors that work automatically. It does not apply to users
that manually download files or that only download a single list of files
once, nor does it apply to proxies, gateway services, etc (but see below
about proxies that are themself available to be crawled).
Note that this does not prohibit anyone from making links to any files
regardless of whether or not crawling is allowed.
It should first try to download the file named "/.special/crawl" to find
the policy set up by the server administrator (it should not do this more
than once per crawling interval). Depending on the status code returned by
the server:
* 2x = Read and parse the file according to the below specifications.
* 4x = Do not access the server for at least the specified amount of time,
possibly plus some random number. After that, the crawler MAY try again,
and will again try to download the /.special/crawl file.
* 5x = No crawling policy is available. (The behaviour of a crawler in such
a case is not specified by this document.)
Note that you cannot assume that the crawling policy file will not be
changed in future. If you start over the crawling then you should try to
download the crawling policy file again.
The format of the file is lines ending with line feeds; each line starts
with one byte command code, and then the parameter. If the command code
has bit5 set then it SHOULD skip that line if it is not understood. If it
has bit5 clear then the client MUST treat the entire file as not understood
and should not proceed with crawling (and it might abort with an error
message in this case, explaining what the problem is).
Each crawler also has zero or more names, which are sequences of printable
ASCII characters.
The commands are:
* "`" = A comment that has no meaning. It may contain information which is
useful for users, search engine operators, mirror operators, etc, such as
downloading an archive file that contains all of the data.
* "@" = The parameter is the name of the crawler. Any lines preceding the
first line with @ are effective, and anything from the first @ line
matching the crawler's name up to the next line with @ are effective; all
other lines are ineffective. Ineffective lines MUST be ignored even if the
bit5 of the command code is clear.
* "i" = Suggests indexing the specified prefix.
* "v" = Suggests not indexing the specified prefix.
* "d" = Means that the files with the specified prefix are probably dynamic
so it might not be useful to mirror them.
* "c" = Allows crawling the specified prefix.
* "C" = Disallows crawling the specified prefix.
* "n" = Estimated number of files to download.
* "t" = Estimated total size of files to download.
* "N" = Maximum number of files to download.
* "P" = Maximum number of simultaneous downloads.
* "D" = Minimum delay (in seconds) after downloading one file before
proceeding with the next one.
* "R" = Minimum delay (in seconds) after starting to download the first
file before starting over from the beginning.
* "w" = Suggested time (in seconds) to wait before downloading the crawling
policy file again. Note that the crawling policy file still counts as a
file for the purpose of the D and R commands, too.
* "a" = (This command is intended to be used for archiving, but the
specification of the archiving has not been written yet.)
Numbers are given in decimal notation, and are always nonnegative integers,
using only digits 0 to 9. URL prefixes start with / and are relative to the
root directory of the server; it matches all URLs that it is a prefix of
(including that URL itself).
Once a command is found that matches the URL being accessed, then it should
ignore all further commands for the purpose of accessing that URL. However,
it still must keep those commands in memory or on disk so that it can refer
to them again later for another access. (It is also possible to work in an
alternative way, by somehow converting the data into an internal format on
the client that can more efficiently determine the access policy, as long
as the behaviour matches that described here.) If no command is found that
matches the URL that it wants to access, then assume that an implicit "C/"
follows (meaning it is disallowed).
A crawler may have a name with <> around it (in addition to its other
names) if it has the purposes described below:
* "<MIRROR>" = Mirrors and backups.
* "<SEARCH>" = Public search engines and indexing. This also applies to
proxies which do not themself have a policy to prohibit indexing.
* "<STUDY>" = Programs that are intended to study statistical properties
such as number of files, average file sizes, broken links, etc.
All crawlers MUST have an empty string as one of their names.
If a proxy service is available to be crawled/indexed, then the proxy
service should also check the above policies, and either refuse the proxy,
or to set up its own policy which prohibits access to the proxied files
to automated crawlers (either conditionally (according to which Scorpion
server is being accessed) or unconditionally (for all proxied files,
regardless of which server it is accessing through the proxy)).
Clients MAY add a query string when requesting the crawling policy file
which identifies the crawler. (This is only for the crawling policy file;
it is not supposed to do that for any other file. Also, the identification
does not necessarily match the crawler's name as described above.)
Crawlers that receive a 41 response when downloading any file other than
the crawling policy file SHOULD try to download the crawling policy file
again after waiting for the minimum time specified in the response
(whether or not it has previously downloaded the crawling policy file
successfully in the past). If it receives such a response too many times,
then it SHOULD stop for a longer amount of time than specified.
The crawling policy file is not allowed to be retroactive.
=== Conversion ===
This part of the specification is optional.
Conversion between file formats is possible, and can be specified by the
file called "/.special/conversion". Clients MUST NOT try to download this
file unless the user explicitly commands the computer to do so; it is not
supposed to do so merely by finding a file that is not known how to handle.
(Client software also MUST allow the user to override them, and to remove
any such files that have already been downloaded.)
Furthermore, any client that is able to understand it MUST NOT require that
it comes from the same server as the file; it can also be a local file
which has been written by the end user, a file from another server (or the
same server but a different path) etc, and once a conversion has been
enabled by the user then the same conversion might be used with other