doc/README.m17n


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459

Muntilingualizaion of w3m 
                                                              2003/03/08
                                                              H. Sakamoto

Introduction

  I have tried the muntilingualization of w3m (w3m-m17n).
  The patch for w3m-0.4.1 is available on the following site.

    http://www2u.biglobe.ne.jp/~hsaka/w3m/index.html#m17n
                                          patch/w3m-0.4.1-m17n-20030308.tar.gz
                                          patch/README.m17n

  It is a development version. And enough test is not preformed because
  I can understand Japanese only. Please use, test, and report bugs.

  Now, w3m-m17n has following functions.

Supported encoding schemes (character set)

  * Japanese
      EUC-JP           - US_ASCII, JIS X 0208, JIS X 0201, JIS X 0212
      (EUC-JISX0213)     (JIS X 0213)
      ISO-2022-JP      - US_ASCII, JIS X 0208, JIS X 0201, JIS X 0212, etc.
      ISO-2022-JP-2    - US_ASCII, JIS X 0208, JIS X 0201, JIS X 0212,
                         GB 2312, KS X 1001, ISO 8859-1, ISO 8859-7, etc.
      ISO-2022-JP-3    - US_ASCII, JIS X 0208, JIS X 0201, JIS X 0213, etc.
      Shift_JIS(CP932) - US_ASCII, JIS X 0208, JIS X 0201, CP932 extension
      Shift_JISX0213   - US_ASCII, JIS X 0208, JIS X 0201, JIS X 0213
  * Chinese (simplified)
      EUC-CN(GB2312) - US_ASCII, GB 2312
      ISO-2022-CN    - US_ASCII, GB 2312, CNS-11643-1,..7, etc.
      GBK(CP936)     - US_ASCII, GB 2312, GBK
      GB18030        - US_ASCII, GB 2312, GBK, GB18030, Unicode,
      HZ-GB-2312     - US_ASCII, GB 2312
  * Chinese (Taiwan, tradisional)
      EUC-TW        - US_ASCII, CNS 11643-1,..16
      ISO-2022-CN   - US_ASCII, CNS-11643-1,..7, GB 2312, etc.
      Big5          - Big5
      HKSCS         - Big5, HKSCS
  * Korean
      EUC-KR        - US_ASCII, KS X 1001 Wansung
      ISO-2022-KR   - US_ASCII, KS X 1001 Wansung, etc.
      Johab         - US_ASCII, KS X 1001 Johab
      UHC(CP949)    - US_ASCII, KS X 1001 Wansung, UHC
  * Vietnamese
      TCVN-5712 VN-1, VISCII 1.1, VPS, CP1258
  * Thai
      TIS-620 (ISO-8859-11), CP874
  * Other
      US_ASCII, ISO-8859-1 �� 10, 13 �� 15,
      KOI8-R, KOI8-U, NeXT, CP437, CP737, CP775, CP850, CP852, CP855, CP856,
      CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP869, CP1006,
      CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257
  * Unicode (UCS-4)
      UTF-8, UTF-7

  NOTE:
    * The left part of JIS X 0201 and GB 1988 (Chinese ASCII) are
      treated as US_ASCII because they are used in tags of HTML document.
      Another variant of US_ASCII is treated without change.
    * JIS C 6226(old JIS) is treated as JIS X 0208.
    * The sequence '~\n' of HZ is not supported.

Display

  There are two method for multilingual diplay.

  (1) kterm + ISO-2022-JP/CN/KR

    * kterm can handle JIS X 0213, CNS 11643, if the following patch
      is applied.
        http://www.st.rim.or.jp/~hanataka/kterm-6.2.0.ext02.patch.gz

    * Specify the fontList for kterm with -fl option or in ~/.Xdefaults.
    
        -fl "*--16-*-jisx0213.2000-*,\
             *--16-*-jisx0212.1990-0,\
             *--16-*-ksc5601.1987-0,\
             *--16-*-gb2312.1980-0,\
             *--16-*-cns11643.1992-*,\
             *--16-*-iso8859-*"

      Fonts of JIS X 0213 exist in
        http://www.mars.sphere.ne.jp/imamura/jisx0213.html

    * Set the "display_charset" to ISO-2022-JP(or ISO-2022-JP-2, KR, CN),
      and "strict_iso2022" to OFF on the option pannel. (see below)

  (2) xterm + UTF-8

    * Use xterm (xterm-140 or later) of XFree86.
        http://www.clark.net/pub/dickey/xterm/xterm.html

    * Fonts of Unicode exist in
        http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html
        http://openlab.ring.gr.jp/efont/index.html.en

    * Use xterm with -u8 option.
      The fonts are specified such as
        -fn "*-medium-*--13-*-iso10646-1" \
        -fb "*-bold-*--13-*-iso10646-1" \
        -fw "*-medium-*-ja-13-*-iso10646-1"

    * Set the "display_charset" to UTF-8.
      And, it is better that "pre_conv" is ON.

  (3) mlterm + ISO-2022-JP/KR/CN

    * Homepage
        http://mlterm.sourceforge.net/

    * Set encoding of mlterm to ISO-2022-JP/KR/CN or UTF-8.

    * Set the "display_charset" to ISO-2022-JP/KR/CN or UTF-8.

Command line options

   -I <document charset>
   -O <display/output charset>

        j(p):      ISO-2022-JP
        j(p)2:     ISO-2022-JP-2
        j(p)3:     ISO-2022-JP-3
        cn:        ISO-2022-CN
        kr:        ISO-2022-KR
        e(j):      EUC-JP
        ec,g(b):   EUC-CN(GB2312)
        et:        EUC-TW
        ek:        EUC-KR
        s(jis):    Shift_JIS
        sjisx0213: Shift_JISX0213
        gbk:       GBK
        gb18030:   GB18030
        h(z):      HZ-GB-2312
        b(ig5):    Big5
        hk(scs):   HKSCS
        jo(hab):   Johab
        uhc:       UHC
        l?:        ISO-8859-?
        t(is):     TIS-620(ISO-8859-11)
        tc(vn):    TCVN-5712 VN-1
        v(iscii):  VISCII 1.1
        vp(s):     VPS
        ko(i8r):   KOI8-R
        koi8u:     KOI8-U
        n(ext):    NeXT
        cp???:     CP???
        w12??:     CP12??
        u(tf8):    UTF-8
        u(tf)7:    UTF-7

Option pannel

   display_charset
       Display charset.
   document_charset
       Defalut Document charset.
   auto_detect
       Automatic charset detect when loading. (Default: ON)
   system_charset
       System charset. It is used for configuration files and file name.
   follow_locale
       System charset follows locale($LANG). (Default: ON)
   ext_halfdump
       Output with display charset when -halfdump.
   search_conv
       Adjust search string for document charset. (Default: ON)
   use_wide
       Use multi column characters. (Default: ON)
   use_combining
       Use combining characters. (Default: ON)
   use_language_tag
       Use Unicode language tags. (Default: ON)
   ucs_conv
       Charset conversion using Unicode map. (Default: ON)
   pre_conv
       Charset conversion when loading. (Default: OFF)
   fix_width
       Fix character width when conversion. (Default: ON)
       If it is OFF, the rendering may collapse.
   use_gb12345_map
       Use GB 12345 Unicode map instead of GB 2312's. (Default: OFF)
       If it is ON, GB2312 can be converted to Big5, EUC-TW, or EUC-JP.
   use_jisx0201
       Use JIS X 0201 Roman for ISO-2022-JP. (Default: OFF)
   use_jisc6226
       Use JIS C 6226:1978 for ISO-2022-JP. (Default: OFF)
   use_jisx0201k
       Use JIS X 0201 Katakana. (Default: OFF)
   use_jisx0212
       Use JIS X 0212:1990 (Supplemental Kanji). (Default: OFF)
   use_jisx0213
       Use JIS X 0213:2000 (2000JIS). (Default: OFF)
   strict_iso2022
       Strict ISO-2022-JP/KR/CN. (Default: ON)
       If it is OFF, all ISO 2022 base character set can be displayed
       with ISO-2022-JP/KR/CN.
   east_asian_width
       Use double width for some Unicode characters. (Default: OFF)
       If it is ON, treat East Asian Ambiguous characters as double width.
   gb18030_as_ucs
       Treat 4 bytes char. of GB18030 as Unicode. (Default: OFF)
   simple_preserve_space
       Simple Preserve space.
       If it is ON, a space is remained in Japanese and some other languages.

   alt_entity
       Use alternate expression with ASCII for entities. (Default: ON)
       If it is OFF, entities are treated as ISO 8859-1
   graphic_char
       Use DEC special graphics for border of table and menu.
       If it is OFF, ruled line is used with CJK charset or UTF-8.

Code conversion

  The following special code conversions are supported.
    * EUC-JP <-> ISO-2022-JP <-> Shift-JIS
    * EUC-CN <-> ISO-2022-CN <-> HZ-GB-2312
    * EUC-TW <-> ISO-2022-CN
    * EUC-KR <-> ISO-2022-KR <-> Johab (only Symbol and Hanja)

  Other conversions are based on Unicode.

Change document charset

   Press '=' (show document infomation), and select document charaset.

   If you specify the following keymaps,
     keymap C CHARSET
     keymap M-c DEFAULT_CHARSET
   you can press `C' to change the current document charset,
   and `M-c' to change the default document charset.

Line Editing 

  Input coding system is followed by display coding system.

  NOTE:
    * HZ can not be used as input coding system.
    * Input with ISO-2022-CN or ISO-2022-KR is perhaps failure, because
      SI(\017) and SO(\016) are already assigned as other command key.
      (SO is assigned as `next-history'). If you want to use SI and SO,
      press C-@(^@). After that, SI, SO, SS2, SS3, LS2, and LS3 of
      7bit ISO-2022 are recognited. When you press C-@ again, the default
      binding is set.

Regular expression

   Multilingual regular expression is supported.

-----------------------------------
Change log

2003/03/08      w3m-0.4.1-m17n-20030308
 * Base on w3m-0.4.1

2003/02/24      w3m-0.4-m17n-20030224
 * Base on w3m-0.4

2003/02/11      w3m-0.4rc1-m17n-20030211
 * Base on w3m-0.4rc1

2003/02/07      w3m-0.3.2.2-m17n-20030207
 * Base on w3m-0.3.2.2+cvs-1.742

2003/02/01      w3m-0.3.2.2-m17n-20030201
 * Base on w3m-0.3.2.2+cvs-1.734

2003/01/31      w3m-0.3.2.2-m17n-20030131
 * Base on w3m-0.3.2.2+cvs-1.732

2003/01/23      w3m-0.3.2.2-m17n-20030123
 * Base on w3m-0.3.2.2+cvs-1.705

2003/01/22      w3m-0.3.2.2-m17n-20030122
 * Base on w3m-0.3.2.2+cvs-1.699

2003/01/01      w3m-0.3.2.2-m17n-20030101
 * Base on w3m-0.3.2.2+cvs-1.655

2002/12/22      w3m-0.3.2.2-m17n-20021222
 * Base on w3m-0.3.2.2+cvs-1.640

2002/12/19      w3m-0.3.2.2-m17n-20021219
 * Base on w3m-0.3.2.2+cvs-1.635

2002/12/07      w3m-0.3.2.2-m17n-20021207
 * Base on w3m-0.3.2.2+cvs-1.599
 * Fixed a problem on int != long system

2002/11/27	w3m-0.3.2.1-m17n-20021127
 * Base on w3m-0.3.2.1+cvs-1.562

2002/11/20	w3m-0.3.2-m17n-20021120
 * Base on w3m-0.3.2+cvs-1.538

2002/11/18
 * Added UTF-7 to auto detection of charset.

2002/11/16	w3m-0.3.2-m17n-20021116
 * Base on w3m-0.3.2+cvs-1.526

2002/11/13	w3m-0.3.2-m17n-20021113
 * Base on w3m-0.3.2+cvs-1.506

2002/11/12	w3m-0.3.2-m17n-20021112
 * Base on w3m-0.3.2+cvs-1.498

2002/11/09	w3m-0.3.2-m17n-20021109
 * Base on w3m-0.3.2+cvs-1.490

2002/11/07	w3m-0.3.2-m17n-20021107
 * Base on w3m-0.3.2
 * Applied [w3m-dev 03371]

2002/10/22	w3m-0.3.1-m17n-20021022
 * Base on w3m-0.3.1+cvs-1.444

2002/07/17	w3m-0.3.1-m17n-20020717
 * Base on w3m-0.3.1

2002/05/29	w3m-0.3-m17n-20020529
 * Base on w3m-0.3+cvs-1.379.

2002/03/16	w3m-0.3-m17n-20020316
 * Base on w3m-0.3+cvs-1.353.

2002/03/11	w3m-0.3-m17n-20020311
 * Base on w3m-0.3+cvs-1.342.
 * Some bug fixes.

2002/02/16	w3m-0.2.5-m17n-20020216
 * Base on w3m-0.2.5+cvs-1.319.
 * Added an option "use_wide"

2002/02/05	w3m-0.2.5-m17n-20020205
 * Base on w3m-0.2.5+cvs-1.302.

2002/02/02	w3m-0.2.5-m17n-20020202
 * Base on w3m-0.2.5+cvs-1.291.

2002/01/31	w3m-0.2.4-m17n-20020131
 * Base on w3m-0.2.4+cvs-1.278.

2002/01/29	w3m-0.2.4-m17n-20020129
 * Base on w3m-0.2.4+cvs-1.268.
 * Some bug fixes.

2002/01/28	w3m-0.2.4-m17n-20020128
 * Base on w3m-0.2.4+cvs-1.265.

2002/01/08	w3m-0.2.4-m17n-20020108
 * Base on w3m-0.2.4.

2002/01/07
 * Replaced some wc_conv,wc_Str_conv with wc_conv_strict,wc_Str_conv_strict.

2001/12/31
 * Added the conversion between HKSCS and Unicode.
 * Changed the conversion table between Big5 and Unicode.
 * Deleted the special conversion between Big5 and CNS11643.
 * Fixed HKSCS.

2001/12/30	w3m-0.2.3.2-m17n-20011230
 * Base on w3m-0.2.3.2+cvs-1.196.

2001/12/22	w3m-0.2.3.2-m17n-20011222
 * Base on w3m-0.2.3.2.
 * [w3m-dev-en 00660] can't compile if INET6 is defined
 * [w3m-dev-en 00663] double meanings for WC_N_??? 

2001/12/21	w3m-0.2.3.1-m17n-20011221
 * Base on w3m-0.2.3.1.
 * Support of HKSCS, KOI8-U, UTF-7.
   The conversion table between HKSCS and Unicode is not yet available.
 * Add the conversion between ISO 8859-16 and Unicode.
 * Add option 'ext_halfdump'.

2001/04/14	w3m-(0.2.1)-m17n-0.20
 * Support of UTF-7.
 * [w3m-dev 01913] ([w3m-dev-en 00452])

2001/04/12	w3m-(0.2.1)-m17n-0.19
 * TILDE of JISX0212, JISX0213 -> FULLWIDTH TILDE of Unicode.
 * MICRO SIGN of Unicode -> GREEK SMALL MU of JISX0208.
 * [w3m-dev 01892], [w3m-dev 01894], [w3m-dev 01898], [w3m-dev 01902]

2001/03/31
 * Changed implement of <_SYMBOL> again.
 * When -dump option, "pre_conv" is false as default.

2001/03/29
 * Support combining characters of TCVN 5712.
 * [w3m-dev 01873], [w3m-dev-en 00411].

2001/03/28
 * Setting -suffix="" can be okay in confiugre. (thanks to naddy!)
 * Bugfix: when #define USE_SSL and #undef USE_SSL_VERIFY, rc.c
   doesn't compile. (thanks to naddy!)
 * [w3m-dev 01859].
 * Bugfix: 0xA0 is error in Shift-JIS.
 * Changed implement of <_SYMBOL> ([w3m-dev 01852]).

2001/03/24	w3m-(0.2.1)-m17n-0.18
 * Base on w3m-0.2.1.
 * [w3m-dev 01703], [w3m-dev 01814], [w3m-dev 01823]
 * Separated ISO-2022-JP-3 from ISO-2022-JP.
 * Improved auto detection.

2001/03/23
 * Base on w3m-0.2.0.

2001/03/21
 * Added functions (CHARSET and DEFAULT_CHARSET).
 * Improved document charset detection of frame HTML.

2001/03/20
 * Conversion from FULL WIDTH variant except ASCII to normal character.

2001/03/18	w3m-(0.1.11-pre-hsaka24)-m17n-0.17
 * Based on "[w3m-dev 01779] w3m-0.1.11-pre-hsaka24".
 * Prefer JIS X 0213 than JIS X 0212.

2001/03/14      w3m-(0.1.11-pre-kokb23)-m17n-0.16
 * Add the conversion between JIS X 0213 and Unicode Extention B.
 * Bugfix: conversion between JIS X 0213 and Unicode.
 * Bugfix: treat UHC as Hangul.
 * Ignore "search_conv" if "pre_conv" is ON.

2001/03/09	w3m-(0.1.11-pre-kokb23)-m17n-0.15
 * Improvement of wc_wchar_t (mainly for Unicode).
 * Some bugfixes for Unicode.
 * Ignore "use_gb12345_map" option when output with GBK or GB18030.
 * When -dump option, "prev_conv" is always true.
 * when -dump or -halfdump option, some proccessing is skiped.
 * Get system charset from the environment variable LC_CTYPE -> LANG -> LC_ALL.
 * Bugfixes: [w3m-dev 01724], [w3m-dev 01726], [w3m-dev 01752],
   [w3m-dev 01753], [w3m-dev 01754]

2001/03/06	w3m-(0.1.11-pre-kokb23)-m17n-0.14
 * Support of Language tag (UTR#7).
 * Bugfix: conversion between GB18030, Johab and Unicode.

2001/03/04	w3m-(0.1.11-pre-kokb23)-m17n-0.13
 * Support of GBK(CP936), GB18030, UHC(CP949) !
 * Unicode mapping table of GB2312 and GB12345 became compatible with
   CP936, GB18030. (Code point: 0xA1A4, 0xA1AA)
 * Allow 0xFFFE and 0xFFFF in Uncide (due to compatibility with GB18030).
 * Bugfix: code point of NBSP in Unicode.

2001/03/03	w3m-(0.1.11-pre-kokb23)-m17n-0.12
 * I wrote English README.m17n.

-------------------------------------------
Hironori Sakamoto <hsaka@mth.biglobe.ne.jp>
 http://www2u.biglobe.ne.jp/~hsaka/