





                     [1mCompound Text Encoding[0m
                           Version 1.1
                      X Consortium Standard
                    X Version 11, Release 6.4
                       Robert W. Scheifler



Copyright (C) 1989 by X Consortium

Permission  is  hereby granted, free of charge, to any person ob-
taining a copy of  this  software  and  associated  documentation
files  (the  ``Software''),  to  deal in the Software without re-
striction, including without limitation the rights to use,  copy,
modify,  merge,  publish,  distribute,  sublicense,  and/or  sell
copies of the Software, and to permit persons to whom  the  Soft-
ware is furnished to do so, subject to the following conditions:

The  above  copyright  notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ``AS IS'', WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO  THE  WARRANTIES
OF  MERCHANTABILITY,  FITNESS FOR A PARTICULAR PURPOSE AND NONIN-
FRINGEMENT.  IN NO EVENT SHALL THE X CONSORTIUM BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION  OF  CON-
TRACT,  TORT  OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Except as contained in this notice, the name of the X  Consortium
shall  not  be  used  in  advertising or otherwise to promote the
sale, use or other dealings in this Software without prior  writ-
ten authorization from the X Consortium.



[1m1.  Overview[0m

Compound  Text  is a format for multiple character set data, such
as multi-lingual text.  The format is based on ISO standards  for
encoding and combining character sets.  Compound Text is intended
to be used in three main contexts: inter-client communication us-
ing selections, as defined in the [4mInter-Client[24m [4mCommunication[24m [4mCon-[0m
[4mventions[24m  [4mManual[24m (ICCCM); window properties (e.g., window manager
hints as defined in the ICCCM); and resources (e.g.,  as  defined
in Xlib and the Xt Intrinsics).

Compound  Text  is intended as an external representation, or in-
terchange format, not as an internal representation.  It  is  ex-
pected (but not required) that clients will convert Compound Text
to some internal representation for processing and rendering, and
convert  from  that internal representation to Compound Text when
providing textual data to another client.










                               -2-


[1m2.  Values[0m

The name of this encoding is ``COMPOUND_TEXT''.  When text values
are used in the ICCCM-compliant selection mechanism or are stored
as window properties in the server, the type used should  be  the
atom for ``COMPOUND_TEXT''.

Octet values are represented in this document as two decimal num-
bers in the form col/row.  This means the value (col * 16) + row.
For example, 02/01 means the value 33.

For  our  purposes, the octet encoding space is divided into four
ranges:

     l l.  C0   octets from 00/00 to 01/15 GL   octets from 02/00
     to 07/15 C1   octets from 08/00 to  09/15  GR   octets  from
     10/00 to 15/15

C0  and  C1  are  ``control character'' sets, while GL and GR are
``graphic character'' sets.  Only a subset of C0  and  C1  octets
are  used in the encoding, and depending on the character set en-
coding defined as GL or GR, a subset of GL and GR octets  may  be
used; see below for details.  All octets (00/00 to 15/15) may ap-
pear inside the text of extended segments (defined below).

[For  those familiar with ISO 2022, we will use only an 8-bit en-
vironment, and we will always use G0 for GL and G1 for GR.]

[1m3.  Control Characters[0m

In C0, only the following values will be used:

     l     l     l.      00/09     HT   HORIZONTAL     TABULATION
     00/10     NL   NEW LINE 01/11     ESC  (ESCAPE)

In C1, only the following value will be used:

     l l l.  09/11     CSI  CONTROL SEQUENCE INTRODUCER

[The alternate 7-bit CSI encoding 01/11 05/11 is not used in Com-
pound Text.]

No  control  sequences  are defined in Compound Text for changing
the C0 and C1 sets.

A horizontal tab can be represented with the octet 00/09.  Speci-
fication of tabulation width settings is  not  part  of  Compound
Text  and  must  be obtained from context (in an unspecified man-
ner).

[Inclusion of horizontal tab is for consistency with  the  STRING
type currently defined in the ICCCM.]











                               -3-


A newline (line separator/terminator) can be represented with the
octet 00/10.

[Note  that  00/10 is normally LINEFEED, but is being interpreted
as NEWLINE.  This can be thought of as using the (deprecated) NEW
LINE mode, E.1.3, in ISO 6429.  Use  of  this  value  instead  of
08/05  (NEL,  NEXT  LINE) is for consistency with the STRING type
currently defined in the ICCCM.]

The remaining C0 and C1 values (01/11 and 09/11) are only used in
the control sequences defined below.

[1m4.  Standard Character Set Encodings[0m

The default GL and GR sets in Compound  Text  correspond  to  the
left  and right halves of ISO 8859-1 (Latin 1).  As such, any le-
gal instance of a STRING type (as defined in the ICCCM) is also a
legal instance of type COMPOUND_TEXT.

[The implied initial state in ISO 2022 is defined with the sequence:
 01/11 02/00 04/03  GO and G1 in an 8-bit environment only.  Designation also invokes.
 01/11 02/00 04/07  In an 8-bit environment, C1 represented as 8-bits.
 01/11 02/00 04/09  Graphic character sets can be 94 or 96.
 01/11 02/00 04/11  8-bit code is used.
 01/11 02/08 04/02  Designate ASCII into G0.
 01/11 02/13 04/01  Designate right-hand part of ISO Latin-1 into G1.
]

To define one of the approved standard character set encodings to
be the GL set, one of the following control sequences is used:

     l l.  01/11 02/08 {I} F   94 character set 01/11 02/04 02/08
     {I} F   94N character set

To define one of the approved standard character set encodings to
be the GR set, one of the following control sequences is used:

     l l.  01/11 02/09 {I} F   94 character set 01/11  02/13  {I}
     F   96  character set 01/11 02/04 02/09 {I} F  94N character
     set

The ``F''in the control sequences above stands for ``Final  char-
acter'',  which  is  always  in  the  range  04/00 to 07/14.  The
``{I}'' stands for  zero  or  more  ``intermediate  characters'',
which  are always in the range 02/00 to 02/15, with the first in-
termediate character always in the range  02/01  to  02/03.   The
registration authority has defined an ``{I} F'' sequence for each
registered character set encoding.

[Final  characters  for  private encodings (in the range 03/00 to
03/15) are not permitted here in Compound Text.]

For GL, octet 02/00 is always defined as SPACE, and  octet  07/15
(normally  DELETE) is never used.  For a 94-character set defined









                               -4-


as GR, octets 10/00 and 15/15 are never used.

[This is consistent with ISO 2022.]

A 94N character set uses N octets (N >  1)  for  each  character.
The value of N is derived from the column value for F:

     l l.  column 04 or 05     2 octets column 06 3 octets column
     07 4 or more octets

In  a  94N encoding, the octet values 02/00 and 07/15 (in GL) and
10/00 and 15/15 (in GR) are never used.

[The column definitions come from ISO 2022.]

Once a GL or GR set has been defined, all further octets in  that
range (except within control sequences and extended segments) are
interpreted  with  respect  to that character set encoding, until
the GL or GR set is redefined.  GL and GR sets can be defined in-
dependently, they do not have to be defined in pairs.

Note that when actually using a character set encoding as the  GR
set,  you  must  force  the  most significant bit (08/00) of each
octet to be a one, so that it falls in the range 10/00 to 15/15.

[Control sequences to specify character  set  encoding  revisions
(as in section 6.3.13 of ISO 2022) are not used in Compound Text.
Revision  indicators  do not appear to provide useful information
in the context of Compound Text.  The most  recent  revision  can
always be assumed, since revisions are upward compatible.]

[1m5.  Approved Standard Encodings[0m

The following are the approved standard encodings to be used with
Compound Text.  Note that none have Intermediate characters; how-
ever,  a good parser will still deal with Intermediate characters
in the event that additional encodings are later  added  to  this
list.

     l l l.  _
     [1m{I} F     94/96     Description[0m
     _
     4/02 94   7-bit    ASCII    graphics    (ANSI    X3.4-1968),
               Left half of ISO  8859  sets  04/09     94   Right
     half  of  JIS  X0201-1976 (reaffirmed 1984),           8-Bit
     Alphanumeric-Katakana Code 04/10     94   Left half  of  JIS
     X0201-1976  (reaffirmed 1984),           8-Bit Alphanumeric-
     Katakana Code
     04/01     96   Right half of ISO 8859-1, Latin alphabet  No.
     1  04/02     96   Right  half  of ISO 8859-2, Latin alphabet
     No. 2 04/03     96   Right half of ISO 8859-3, Latin  alpha-
     bet No. 3 04/04     96   Right half of ISO 8859-4, Latin al-
     phabet  No.  4  04/06     96   Right  half  of  ISO  8859-7,
     Latin/Greek  alphabet  04/07     96   Right  half   of   ISO









                               -5-


     8859-6,  Latin/Arabic  alphabet 04/08     96   Right half of
     ISO 8859-8, Latin/Hebrew alphabet 04/12     96   Right  half
     of  ISO 8859-5, Latin/Cyrillic alphabet 04/13     96   Right
     half of ISO 8859-9, Latin alphabet No. 5
     04/01     942  GB2312-1980,      China      (PRC)      Hanzi
     04/02     942  JIS  X0208-1983,  Japanese  Graphic Character
     Set 04/03     942  KS C5601-1987, Korean  Graphic  Character
     Set
     _

The  sets listed as ``Left half of ...'' should always be defined
as GL.  The sets listed as ``Right half of ...'' should always be
defined as GR.  Other sets can be defined either as GL or GR.

[1m6.  Non-Standard Character Set Encodings[0m

Character set encodings that are not  in  the  list  of  approved
standard  encodings  can be included using ``extended segments''.
An extended segment begins with one of the following sequences:

     l l.  01/11 02/05  02/15  03/00  M  L   variable  number  of
     octets  per  character 01/11 02/05 02/15 03/01 M L   1 octet
     per character 01/11 02/05 02/15 03/02  M  L   2  octets  per
     character 01/11 02/05 02/15 03/03 M L   3 octets per charac-
     ter 01/11 02/05 02/15 03/04 M L   4 octets per character

[This uses the ``other coding system'' of ISO 2022, using private
Final characters.]

The ``M'' and ``L'' octets represent a 14-bit unsigned value giv-
ing the number of octets that appear in the remainder of the seg-
ment.   The  number is computed as ((M - 128) * 128) + (L - 128).
The most significant bit M and L are always set to one.  The  re-
mainder  of  the  segment  consists of two parts, the name of the
character set encoding and the actual text.  The name of the  en-
coding  comes  first  and is separated from the text by the octet
00/02 (STX, START OF TEXT).  Note that the length  defined  by  M
and L includes the encoding name and separator.

[The encoding of the length is chosen to avoid having zero octets
in  Compound  Text when possible, because embedded NUL values are
problematic in many C language routines.  The use of zero  octets
cannot  be  ruled  out entirely however, since some octets in the
actual text of the extended segment may have to be zero.]

The name of the encoding should be registered with the X  Consor-
tium  to  avoid  conflicts  and should when appropriate match the
CharSet Registry and Encoding registration used in the X  Logical
Font  Description.   The  name itself should be encoded using ISO
8859-1 (Latin 1), should not use question mark (03/15) or  aster-
isk  (02/10),  and  should  use hyphen (02/13) only in accordance
with the X Logical Font Description.











                               -6-


Extended segments are not to be used for any character set encod-
ing that can be constructed from a GL/GR pair of  approved  stan-
dard  encodings.  For example, it is incorrect to use an extended
segment for any of the ISO 8859 family of encodings.

It should be noted that the contents of an extended  segment  are
arbitrary;  for example, they may contain octets in the C0 and C1
ranges, including 00/00, and octets comprising a given  character
may differ in their most significant bit.

[ISO-registered ``other coding systems'' are not used in Compound
Text;  extended  segments are the only mechanism for non-2022 en-
codings.]

[1m7.  Directionality[0m

If desired, horizontal text direction can be indicated using  the
following control sequences:

     l  l.   09/11  03/01  05/13   begin left-to-right text 09/11
     03/02 05/13     begin right-to-left text 09/11  05/13    end
     of string

[This  is  a subset of the SDS (START DIRECTED STRING) control in
the Draft Bidirectional Addendum to ISO 6429.]

Directionality can be nested.  Logically, a stack  of  directions
is  maintained.  Each of the first two control sequences pushes a
new direction on the stack, and the third sequence (revert)  pops
a  direction  from  the stack.  The stack starts out empty at the
beginning of a Compound Text string.  When the  stack  is  empty,
the directionality of the text is unspecified.

Directionality applies to all subsequent text, whether in GL, GR,
or an extended segment.  If the desired directionality of GL, GR,
or  extended  segments  differs,  then directionality control se-
quences must be inserted when switching between them.

Note that definition of GL and GR sets is independent  of  direc-
tionality;  defining  a new GL or GR set does not change the cur-
rent directionality, and pushing or popping a directionality does
not change the current GL and GR definitions.

Specification of directionality is entirely optional; text direc-
tion should be clear from context in  most  cases.   However,  it
must  be  the  case that either all characters in a Compound Text
string have explicitly specified direction or that all characters
have unspecified direction.  That is, if  directionality  control
sequences  are used, the first such control sequence must precede
the first graphic  character  in  a  Compound  Text  string,  and
graphic  characters are not permitted whenever the directionality
stack is empty.











                               -7-


[1m8.  Resources[0m

To use Compound Text in a resource,  you  can  simply  treat  all
octets  as  if they were ASCII/Latin-1 and just replace all ``\''
octets (05/12) with the two octets  ``\\'',  all  newline  octets
(00/10)  with the two octets ``\n'', and all zero octets with the
four octets ``\000''.  It is up to the client making use  of  the
resource  to  interpret  the data as Compound Text; the policy by
which this is ascertained is not constrained by the Compound Text
specification.

[1m9.  Font Names[0m

The following CharSet names for the standard character set encod-
ings are registered for use in font names  under  the  X  Logical
Font Description:

     l l l.  _
     [1mName Encoding Standard   Description[0m
     _
     ISO8859-1 ISO  8859-1     Latin alphabet No. 1 ISO8859-2 ISO
     8859-2     Latin    alphabet     No.     2     ISO8859-3 ISO
     8859-3     Latin     alphabet     No.     3    ISO8859-4 ISO
     8859-4     Latin    alphabet     No.     4     ISO8859-5 ISO
     8859-5     Latin/Cyrillic       alphabet       ISO8859-6 ISO
     8859-6     Latin/Arabic        alphabet        ISO8859-7 ISO
     8859-7     Latin/Greek         alphabet        ISO8859-8 ISO
     8859-8     Latin/Hebrew        alphabet        ISO8859-9 ISO
     8859-9     Latin   alphabet  No.  5  JISX0201.1976-0     JIS
     X0201-1976 (reaffirmed 1984)    8-bit  Alphanumeric-Katakana
     Code  GB2312.1980-0  GB2312-1980,  GL  encoding China  (PRC)
     Hanzi   JISX0208.1983-0     JIS   X0208-1983,   GL    encod-
     ing   Japanese   Graphic   Character  Set  KSC5601.1987-0 KS
     C5601-1987, GL encoding    Korean Graphic Character Set
     _

[1m10.  Extensions[0m

There is no absolute requirement for a parser to deal  with  any-
thing but the particular encoding syntax defined in this specifi-
cation.   However,  it  is possible that Compound Text may be ex-
tended in the future, and as such it may  be  desirable  to  con-
struct the parser to handle 2022/6429 syntax more generally.

There are two general formats covering all control sequences that
are expected to appear in extensions:

01/11 {I} F

     For  this  format,  I is always in the range 02/00 to 02/15,
     and F is always in the range 03/00 to 07/14.

09/11 {P} {I} F










                               -8-


     For this format, P is always in the range 03/00 to 03/15,  I
     is  always  in  the range 02/00 to 02/15, and F is always in
     the range 04/00 to 07/14.

In addition, new (singleton) control characters (in the C0 and C1
ranges) might be defined in the future.

Finally, new kinds of ``segments'' might be defined in the future
using syntax similar to extended segments:

01/11 02/05 02/15 F M L

     For this format, F is in the range 03/05 to 3/15.  M  and  L
     are  as  defined  in extended segments.  Such a segment will
     always be followed by the number of octets defined by M  and
     L.  These octets can have arbitrary values and need not fol-
     low the internal structure defined for current extended seg-
     ments.

If  extensions  to  this specification are defined in the future,
then any string incorporating instances of such  extensions  must
start with one of the following control sequences:

     l  l.   01/11  02/03 V 03/00 ignoring extensions is OK 01/11
     02/03 V 03/01    ignoring extensions is not OK

In either case, V is in the range 02/00 to  02/15  and  indicates
the  major  version  minus  one  of the specification being used.
These version control sequences are for use by clients  that  im-
plement  earlier versions, but have implemented a general parser.
The first control sequence indicates that it is acceptable to ig-
nore all extension control sequences;  no  mandatory  information
will  be  lost in the process.  The second control sequence indi-
cates that it is unacceptable to ignore any extension control se-
quences; mandatory information would be lost in the process.   In
general, it will be up to the client generating the Compound Text
to decide which control sequence to use.

[1m11.  Errors[0m

If  a  Compound Text string does not match the specification here
(e.g., uses undefined control characters,  or  undefined  control
sequences,  or  incorrectly  formatted  extended segments), it is
best to treat the entire string as invalid, except  as  indicated
by a version control sequence.















