ASCII Characters

Operating systems, programming/scripting lanuages, protocols and text processing systems use characters in different ways. This summarizes the character set and some of the special uses of and restrictions on characters. The ASCII (7-bit) (American National Standard Code for Information Interchange) code set is defined in ANSI Spec X3.4. Extended (8-bit codes), as defined in ISO8859-1, (Latin 1) can also be used in HTML.

Text data: ASCII

See also: Special Character Names
          Character Usage
 There are two main codes in use for
character data: ASCII and EBCDIC. EBCDIC is used almost exclusively on IBM
machines and their clones. On most other computer systems,
, ASCII is used, so that is all we will discuss here.
ASCII is by far the more common of the two.

ASCII stands for American Standard Code for Information Interchange. It
contains a binary code for all the characters generated by the keyboard, and
a few others that are not generated by all keyboards.

 The standard ASCII set consists of 128 binary codes, from 000 0000 to 111
1111. The msb of the byte is not written because it is sometimes reserved
for a parity bit (an error check: see later) and on some micro computers
another 128 special symbols (graphic characters or mathematical symbols) are
defined using this eighth bit. Since its use varies from one system to
another, we will explicitly write only the first 7 bits.

 HTML Character References use the Decimal code.  e.g. @  = '@' .
 URL Encoding uses Hex characters (e.g. %40 = @)

Control Characters

                    CTRL   (^D means to hold the CTRL key and hit d)
Oct  Dec Char  Hex  Key     Comments
00   0  NUL  \x00  ^@  (Null byte)
01   1  SOH  \x01  ^A    (Start of heading)
02   2  STX  \x02  ^B    (Start of text)
03   3  ETX  \x03  ^C    (End of text) (see: UNIX keyboard CTRL)
04   4  EOT  \x04  ^D    (End of transmission) (see: UNIX keyboard CTRL)
05   5  ENQ  \x05  ^E    (Enquiry)
06   6  ACK  \x06  ^F    (Acknowledge)
07   7  BEL  \x07  ^G    (Ring terminal bell)
10   8   BS  \x08  ^H \b (Backspace)  (\b matches backspace inside [] only)
                                        (see: UNIX keyboard CTRL)
11   9   HT  \x09  ^I \t (Horizontal tab)
12  10   LF  \x0A  ^J \n (Line feed)  (Default UNIX NL) (see End of Line below)
13  11   VT  \x0B  ^K    (Vertical tab)
14  12   FF  \x0C  ^L \f (Form feed)
15  13   CR  \x0D  ^M \r (Carriage return)  (see: End of Line below)
16  14   SO  \x0E  ^N    (Shift out)
17  15   SI  \x0F  ^O    (Shift in)
20  16  DLE  \x10  ^P    (Data link escape)
21  17  DC1  \x11  ^Q    (Device control 1) (XON) (Default UNIX START char.)
22  18  DC2  \x12  ^R    (Device control 2)
23  19  DC3  \x13  ^S    (Device control 3) (XOFF)  (Default UNIX STOP char.)
24  20  DC4  \x14  ^T    (Device control 4)
25  21  NAK  \x15  ^U    (Negative acknowledge)  (see: UNIX keyboard CTRL)
26  22  SYN  \x16  ^V    (Synchronous idle)
27  23  ETB  \x17  ^W    (End of transmission block)
30  24  CAN  \x18  ^X    (Cancel)
31  25  EM   \x19  ^Y    (End of medium)
32  26  SUB  \x1A  ^Z    (Substitute character)
33  27  ESC  \x1B  ^[    (Escape)
34  28  FS   \x1C  ^\    (File separator, Information separator four)
35  29  GS   \x1D  ^]    (Group separator, Information separator three)
36  30  RS   \x1E  ^^    (Record separator, Information separator two)
37  31  US   \x1F  ^_    (Unit separator, Information separator one)
\177 127  DEL  \x7F  ^?    (Delete)  (see: UNIX keyboard CTRL)

Printable Characters

Specials (32-47)

                    (See: Special Character Names)
40  32 " " \x20               (space)
41  33  !  \x21    EXCLAMATION POINT(bang)
42  34  "  \x22    QUOTATION MARK, DIAERESIS
43  35  #  \x23:   NUMBER SIGN (Pound sign) (see: UNIX keyboard CTRL)
44  36  $  \x24    DOLLAR SIGN
45  37  %  \x25    PERCENT SIGN
46  38  &  \x26    AMPERSAND
47  39  '  \x27    APOSTROPHE, RIGHT SINGLE QUOTATION MARK, ACUTE ACCENT (single quote)
50  40  (  \x28    LEFT PARENTHESIS  (open parenthesis)
51  41  )  \x29    RIGHT PARENTHESIS (close parenthesis)
52  42  *  \x2A    ASTERISK
53  43  +  \x2B    PLUS SIGN
54  44  ,  \x2C    COMMA, CEDILLA
55  45  -  \x2D    HYPHEN, MINUS SIGN
56  46  .  \x2E    PERIOD, DECIMAL POINT, (Full Stop)
57  47  /  \x2F    SLANT (SOLIDUS), slash

Digits

60  48  0  \x30
61  49  1  \x31
62  50  2  \x32
63  51  3  \x33
64  52  4  \x34
65  53  5  \x35
66  54  6  \x36
67  55  7  \x37
70  56  8  \x38
71  57  9  \x39

Specials (58-64)

72  58  :  \x3A    COLON
73  59  ;  \x3B    SEMICOLON
74  60  <  \x3C    LESS-THAN SIGN  (left angle bracket)
75  61  =  \x3D    EQUALS SIGN
76  62  >  \x3E    GREATER-THAN SIGN  (right angle bracket)
77  63  ?  \x3F    QUESTION MARK
\100  64  @  \x40    COMMERCIAL AT † (see: UNIX keyboard CTRL)

Latin Capital Letters

\101  65  A  \x41	\112  74  J  \x4A	\123  83  S  \x53
\102  66  B  \x42	\113  75  K  \x4B	\124  84  T  \x54
\103  67  C  \x43	\114  76  L  \x4C	\125  85  U  \x55
\104  68  D  \x44	\115  77  M  \x4D	\126  86  V  \x56
\105  69  E  \x45	\116  78  N  \x4E	\127  87  W  \x57
\106  70  F  \x46	\117  79  O  \x4F	\130  88  X  \x58
\107  71  G  \x47	\120  80  P  \x50	\131  89  Y  \x59
\110  72  H  \x48	\121  81  Q  \x51	\132  90  Z  \x5A
\111  73  I  \x49	\122  82  R  \x52

Specials (91-96)

\133  91  [  \x5B    LEFT (SQUARE) BRACKET (open bracket)  †
\134  92  \  \x5C    REVERSE SLANT (REVERSE SOLIDUS) (backslash, backslant)  †
\135  93  ]  \x5D    RIGHT (SQUARE) BRACKET (closing bracket)  †
\136  94  ^  \x5E    CIRCUMFLEX ACCENT  †
\137  95  _  \x5F    UNDERLINE (LOW LINE)
\140  96  `  \x60    LEFT SINGLE QUOTATION MARK, GRAVE ACCENT  †

Latin Small Letters

\141  97  a  \x61	\152 106  j  \x6A	\163 115  s  \x73
\142  98  b  \x62	\153 107  k  \x6B	\164 116  t  \x74
\143  99  c  \x63	\154 108  l  \x6C	\165 117  u  \x75
\144 100  d  \x64	\155 109  m  \x6D	\166 118  v  \x76
\145 101  e  \x65	\156 110  n  \x6E	\167 119  w  \x77
\146 102  f  \x66	\157 111  o  \x6F	\170 120  x  \x78
\147 103  g  \x67	\160 112  p  \x70	\171 121  y  \x79
\150 104  h  \x68	\161 113  q  \x71	\172 122  z  \x7A
\151 105  i  \x69	\162 114  r  \x72

Specials (123-126)

\173 123  {  \x7B  LEFT BRACE (LEFT CURLY BRACKET) (open brace) †
\174 124  |  \x7C  VERTICAL LINE (pipe) †
\175 125  }  \x7D  RIGHT BRACE (RIGHT CURLY BRACKET) (closing brace) †
\176 126  ~  \x7E  TILDE (OVERLINE) (squiggle) †

Control (127)

\177 127 DEL \x7F ^?            (Delete)  (see: UNIX keyboard CTRL)

 † The characters following the letters may be used for additional
letters in countries with alphabets containing more than 26 letters.These characters should not bae used in international interchange
without determining that there is agreement between sender and recipient.

Usage of Special Characters

End of Line character

End of Line varies depending on the operating system: DOS/Windows: <CR><LF> Macintosh:... <CR> UNIX..........<LF> (See File Format Notes for more information.)

UNIX Keyboard Control Characters

: The default keyboard control characters vary depending on the UNIX system. Most people change them with the stty command in their .profile. SysV Sun/Solaris HP/UX Erase (character delete) # <DEL> <BS> (^H) Kill (line delete) @ ^U @ Intr (Interupt process) <DEL> ^C <DEL> EOF (End of File) ^D ^D ^D EOF Signals End of File for characters input from the terminal. Also causes shell to terminate.

Special Characters allowed in names and addresses:

 Note: The only characters other than letters and digits which appear to
       be universly acceptable are - (dash) and _ (underscore) and you
       have to watch out for '-' which can be interpreted as minus when
       used in a name in certain perl scripts.

         (1)      (2) (3)
Octal   UNIX DOS SMTP URL (HTML - allows all but <, >, &,and  ")
11 TAB
40 " "          -     Spaces can be used in mail addresses if the addr. is quoted.
41  !       *   *   *     ! can cause problems in csh in UNIX.
42  "
43  #   *   *   *      (see: UNIX keyboard CTRL)
44  $       *   *   *
45  %   *   *   *
46  &       *   *
47  '       *   *   *
50  (       *
51  )       *
52  *           *   *
53  +   *       *   *   (URL's sometimes use + for space)
54  ,   *
55  -   *   *   *   *
56  .   *
57  /           *
72  :   *
73  ;
74  <
75  =   *       *
76  >
77  ?           *
\100  @   *   *         (see: UNIX keyboard CTRL)
\133  [
\134  \
\135  ]
\136  ^       *   *
\137  _   *   *   *   *
\140  `       *   *
\173  {       *   *
\174  |           *
\175  }       *   *
\176  ~   *   *   *

(1) UNIX - Any character except "/" (slash)  is allowed
     in a UNIX file name but many are not recommended
      because they cause problems in scripting and/or
      programming languaages dealing with the files.
(2) SMTP - (Simple Mail Transfer Protocol)
(3)URI/URL - Uniform Resource Identifier/Locator. Other characters can
be used but require encoding with % and the HEX value (e.g. @ = %40)
(Space is sometimes encoded as "+".)
(4) HTML - HyperText Markup Language requires 4 ASCII characters to be
encoded as character or entity references (escape sequences).
ASCII characters with special meaning in HTML so they must be encoded:
              Character Entity
 Character    Reference Reference
    <          <   &lt;
    >          >   &gt;
    &          &   &amp;
    "          "   &quot;
Other common non-ASCII character encodings for HTML:
Description         Code        Entity name       Octal Code
 e, acute accent      é --> é  &eacute; --> é  \351 (octal) = é
 ampersand            &  --> &  &amp; --> &
 registered trademark ® --> ®  &reg;  --> ®
 copyright            © --> ©  &copy; --> ©
 trademark            ™ --> ™  <SUP><FONT SIZE=-1>TM</FONT></SUP> --> TM

Other HTML Character Reference Tables

ISO8859-1, (Latin 1) notes and Character List at Best Business Solutions (BBS).
Extended ASCII (same as ISO859-1) at emory.edu

 ISO (International Organization for Standardization) defines several character sets.
e.g. the ISO 8859 series.
HTML Character Entity names are defined targnet.org and uni-passau.

IBM
IBM uses (EBCDIC) Extended Binary Coded Decimal Interchange Code
 (8-bit) coding on most of their systems.
They uses code pages to specify charact sets for keyboards, displays,
printers, ... for DOS, AIX, Mainframes, ....
 Standard DOS code pages are:
    437  United States
    850  Multilingual (Latin 1)
    852  Slavic (Latin 2)
    863  Canadian-French
    865  Nordic (Norwegian, Danish)
    860  Portuguese
 See:
 IBM OS/390 Code Pages

 General Info. on Code Pages

See also: BYTE article 'Organizing Babylon' on international character sets.

Netscape Character Sets
MIME Charset parameter in HTTP. If the server includes this parameter in its
response, Netscape Navigator will change its character set appropriately.
 For example:

              Content-Type: text/html;charset=iso-8859-1
              Content-Type: text/html;charset=iso-2022-jp

       The charset names recognized by Netscape Navigator 1.1 are specified in
RFC 1700 (except for the names that begin with "x-".) These include:
              us-ascii
              iso-8859-1
              iso-2022-jp
              x-sjis
              x-euc-jp
              x-mac-roman

       Additionally, the following aliases are recognized for us-ascii:

              ansi_x3.4-1968
              iso-ir-6
              ansi_x3.4-1986
              iso_646.irv:1991
              ascii
              iso646-us
              us
              ibm367
              cp367

One Comment

  1. [...] ‘x26′, you’ll find that it also means ‘&’ (check out here or here, x26 is the hex version of &). So when writing ‘x26amp;’ it really meant [...]


RSS Feed for this entry

Leave a Comment