Operating systems, programming/scripting lanuages, protocols and text processing systems use characters in different ways. This summarizes the character set and some of the special uses of and restrictions on characters. The ASCII (7-bit) (American National Standard Code for Information Interchange) code set is defined in ANSI Spec X3.4. Extended (8-bit codes), as defined in ISO8859-1, (Latin 1) can also be used in HTML.
- Control Characters
- Printable Characters
- Usage of Special Characters
- Special Characters allowed in names and addresses
- ISO Latin and extended ASCII Character References
Text data: ASCII See also: Special Character Names Character Usage There are two main codes in use for character data: ASCII and EBCDIC. EBCDIC is used almost exclusively on IBM machines and their clones. On most other computer systems, , ASCII is used, so that is all we will discuss here. ASCII is by far the more common of the two. ASCII stands for American Standard Code for Information Interchange. It contains a binary code for all the characters generated by the keyboard, and a few others that are not generated by all keyboards. The standard ASCII set consists of 128 binary codes, from 000 0000 to 111 1111. The msb of the byte is not written because it is sometimes reserved for a parity bit (an error check: see later) and on some micro computers another 128 special symbols (graphic characters or mathematical symbols) are defined using this eighth bit. Since its use varies from one system to another, we will explicitly write only the first 7 bits. HTML Character References use the Decimal code. e.g. @ = '@' . URL Encoding uses Hex characters (e.g. %40 = @)
Control Characters
CTRL (^D means to hold the CTRL key and hit d) Oct Dec Char Hex Key Comments 00 0 NUL \x00 ^@ (Null byte) 01 1 SOH \x01 ^A (Start of heading) 02 2 STX \x02 ^B (Start of text) 03 3 ETX \x03 ^C (End of text) (see: UNIX keyboard CTRL) 04 4 EOT \x04 ^D (End of transmission) (see: UNIX keyboard CTRL) 05 5 ENQ \x05 ^E (Enquiry) 06 6 ACK \x06 ^F (Acknowledge) 07 7 BEL \x07 ^G (Ring terminal bell) 10 8 BS \x08 ^H \b (Backspace) (\b matches backspace inside [] only) (see: UNIX keyboard CTRL) 11 9 HT \x09 ^I \t (Horizontal tab) 12 10 LF \x0A ^J \n (Line feed) (Default UNIX NL) (see End of Line below) 13 11 VT \x0B ^K (Vertical tab) 14 12 FF \x0C ^L \f (Form feed) 15 13 CR \x0D ^M \r (Carriage return) (see: End of Line below) 16 14 SO \x0E ^N (Shift out) 17 15 SI \x0F ^O (Shift in) 20 16 DLE \x10 ^P (Data link escape) 21 17 DC1 \x11 ^Q (Device control 1) (XON) (Default UNIX START char.) 22 18 DC2 \x12 ^R (Device control 2) 23 19 DC3 \x13 ^S (Device control 3) (XOFF) (Default UNIX STOP char.) 24 20 DC4 \x14 ^T (Device control 4) 25 21 NAK \x15 ^U (Negative acknowledge) (see: UNIX keyboard CTRL) 26 22 SYN \x16 ^V (Synchronous idle) 27 23 ETB \x17 ^W (End of transmission block) 30 24 CAN \x18 ^X (Cancel) 31 25 EM \x19 ^Y (End of medium) 32 26 SUB \x1A ^Z (Substitute character) 33 27 ESC \x1B ^[ (Escape) 34 28 FS \x1C ^\ (File separator, Information separator four) 35 29 GS \x1D ^] (Group separator, Information separator three) 36 30 RS \x1E ^^ (Record separator, Information separator two) 37 31 US \x1F ^_ (Unit separator, Information separator one) \177 127 DEL \x7F ^? (Delete) (see: UNIX keyboard CTRL)
Printable Characters
Specials (32-47)
(See: Special Character Names) 40 32 " " \x20 (space) 41 33 ! \x21 EXCLAMATION POINT(bang) 42 34 " \x22 QUOTATION MARK, DIAERESIS 43 35 # \x23: NUMBER SIGN (Pound sign) (see: UNIX keyboard CTRL) 44 36 $ \x24 DOLLAR SIGN 45 37 % \x25 PERCENT SIGN 46 38 & \x26 AMPERSAND 47 39 ' \x27 APOSTROPHE, RIGHT SINGLE QUOTATION MARK, ACUTE ACCENT (single quote) 50 40 ( \x28 LEFT PARENTHESIS (open parenthesis) 51 41 ) \x29 RIGHT PARENTHESIS (close parenthesis) 52 42 * \x2A ASTERISK 53 43 + \x2B PLUS SIGN 54 44 , \x2C COMMA, CEDILLA 55 45 - \x2D HYPHEN, MINUS SIGN 56 46 . \x2E PERIOD, DECIMAL POINT, (Full Stop) 57 47 / \x2F SLANT (SOLIDUS), slash
Digits
60 48 0 \x30 61 49 1 \x31 62 50 2 \x32 63 51 3 \x33 64 52 4 \x34 65 53 5 \x35 66 54 6 \x36 67 55 7 \x37 70 56 8 \x38 71 57 9 \x39
Specials (58-64)
72 58 : \x3A COLON 73 59 ; \x3B SEMICOLON 74 60 < \x3C LESS-THAN SIGN (left angle bracket) 75 61 = \x3D EQUALS SIGN 76 62 > \x3E GREATER-THAN SIGN (right angle bracket) 77 63 ? \x3F QUESTION MARK \100 64 @ \x40 COMMERCIAL AT † (see: UNIX keyboard CTRL)
Latin Capital Letters
\101 65 A \x41 \112 74 J \x4A \123 83 S \x53 \102 66 B \x42 \113 75 K \x4B \124 84 T \x54 \103 67 C \x43 \114 76 L \x4C \125 85 U \x55 \104 68 D \x44 \115 77 M \x4D \126 86 V \x56 \105 69 E \x45 \116 78 N \x4E \127 87 W \x57 \106 70 F \x46 \117 79 O \x4F \130 88 X \x58 \107 71 G \x47 \120 80 P \x50 \131 89 Y \x59 \110 72 H \x48 \121 81 Q \x51 \132 90 Z \x5A \111 73 I \x49 \122 82 R \x52
Specials (91-96)
\133 91 [ \x5B LEFT (SQUARE) BRACKET (open bracket) † \134 92 \ \x5C REVERSE SLANT (REVERSE SOLIDUS) (backslash, backslant) † \135 93 ] \x5D RIGHT (SQUARE) BRACKET (closing bracket) † \136 94 ^ \x5E CIRCUMFLEX ACCENT † \137 95 _ \x5F UNDERLINE (LOW LINE) \140 96 ` \x60 LEFT SINGLE QUOTATION MARK, GRAVE ACCENT †
Latin Small Letters
\141 97 a \x61 \152 106 j \x6A \163 115 s \x73 \142 98 b \x62 \153 107 k \x6B \164 116 t \x74 \143 99 c \x63 \154 108 l \x6C \165 117 u \x75 \144 100 d \x64 \155 109 m \x6D \166 118 v \x76 \145 101 e \x65 \156 110 n \x6E \167 119 w \x77 \146 102 f \x66 \157 111 o \x6F \170 120 x \x78 \147 103 g \x67 \160 112 p \x70 \171 121 y \x79 \150 104 h \x68 \161 113 q \x71 \172 122 z \x7A \151 105 i \x69 \162 114 r \x72
Specials (123-126)
\173 123 { \x7B LEFT BRACE (LEFT CURLY BRACKET) (open brace) †
\174 124 | \x7C VERTICAL LINE (pipe) †
\175 125 } \x7D RIGHT BRACE (RIGHT CURLY BRACKET) (closing brace) †
\176 126 ~ \x7E TILDE (OVERLINE) (squiggle) †
Control (127)
\177 127 DEL \x7F ^? (Delete) (see: UNIX keyboard CTRL) † The characters following the letters may be used for additional letters in countries with alphabets containing more than 26 letters.These characters should not bae used in international interchange without determining that there is agreement between sender and recipient.
Usage of Special Characters
End of Line character
End of Line varies depending on the operating system: DOS/Windows: <CR><LF> Macintosh:... <CR> UNIX..........<LF> (See File Format Notes for more information.)UNIX Keyboard Control Characters
: The default keyboard control characters vary depending on the UNIX system. Most people change them with the stty command in their .profile. SysV Sun/Solaris HP/UX Erase (character delete) # <DEL> <BS> (^H) Kill (line delete) @ ^U @ Intr (Interupt process) <DEL> ^C <DEL> EOF (End of File) ^D ^D ^D EOF Signals End of File for characters input from the terminal. Also causes shell to terminate.
Special Characters allowed in names and addresses:
Note: The only characters other than letters and digits which appear to
be universly acceptable are - (dash) and _ (underscore) and you
have to watch out for '-' which can be interpreted as minus when
used in a name in certain perl scripts.
(1) (2) (3)
Octal UNIX DOS SMTP URL (HTML - allows all but <, >, &,and ")
11 TAB
40 " " - Spaces can be used in mail addresses if the addr. is quoted.
41 ! * * * ! can cause problems in csh in UNIX.
42 "
43 # * * * (see: UNIX keyboard CTRL)
44 $ * * *
45 % * * *
46 & * *
47 ' * * *
50 ( *
51 ) *
52 * * *
53 + * * * (URL's sometimes use + for space)
54 , *
55 - * * * *
56 . *
57 / *
72 : *
73 ;
74 <
75 = * *
76 >
77 ? *
\100 @ * * (see: UNIX keyboard CTRL)
\133 [
\134 \
\135 ]
\136 ^ * *
\137 _ * * * *
\140 ` * *
\173 { * *
\174 | *
\175 } * *
\176 ~ * * *
(1) UNIX - Any character except "/" (slash) is allowed
in a UNIX file name but many are not recommended
because they cause problems in scripting and/or
programming languaages dealing with the files.
(2) SMTP - (Simple Mail Transfer Protocol)
(3)URI/URL - Uniform Resource Identifier/Locator. Other characters can
be used but require encoding with % and the HEX value (e.g. @ = %40)
(Space is sometimes encoded as "+".)
(4) HTML - HyperText Markup Language requires 4 ASCII characters to be
encoded as character or entity references (escape sequences).
ASCII characters with special meaning in HTML so they must be encoded:
Character Entity
Character Reference Reference
< < <
> > >
& & &
" " "
Other common non-ASCII character encodings for HTML:
Description Code Entity name Octal Code
e, acute accent é --> é é --> é \351 (octal) = é
ampersand & --> & & --> &
registered trademark ® --> ® ® --> ®
copyright © --> © © --> ©
trademark ™ --> ™ <SUP><FONT SIZE=-1>TM</FONT></SUP> --> TM
Other HTML Character Reference Tables
ISO8859-1, (Latin 1) notes and Character List at Best Business Solutions (BBS). Extended ASCII (same as ISO859-1) at emory.edu ISO (International Organization for Standardization) defines several character sets. e.g. the ISO 8859 series. HTML Character Entity names are defined targnet.org and uni-passau. IBM IBM uses (EBCDIC) Extended Binary Coded Decimal Interchange Code (8-bit) coding on most of their systems. They uses code pages to specify charact sets for keyboards, displays, printers, ... for DOS, AIX, Mainframes, .... Standard DOS code pages are: 437 United States 850 Multilingual (Latin 1) 852 Slavic (Latin 2) 863 Canadian-French 865 Nordic (Norwegian, Danish) 860 Portuguese See: IBM OS/390 Code Pages General Info. on Code Pages See also: BYTE article 'Organizing Babylon' on international character sets. Netscape Character Sets MIME Charset parameter in HTTP. If the server includes this parameter in its response, Netscape Navigator will change its character set appropriately. For example: Content-Type: text/html;charset=iso-8859-1 Content-Type: text/html;charset=iso-2022-jp The charset names recognized by Netscape Navigator 1.1 are specified in RFC 1700 (except for the names that begin with "x-".) These include: us-ascii iso-8859-1 iso-2022-jp x-sjis x-euc-jp x-mac-roman Additionally, the following aliases are recognized for us-ascii: ansi_x3.4-1968 iso-ir-6 ansi_x3.4-1986 iso_646.irv:1991 ascii iso646-us us ibm367 cp367

[...] ‘x26′, you’ll find that it also means ‘&’ (check out here or here, x26 is the hex version of &). So when writing ‘x26amp;’ it really meant [...]