ASCII

Share

2024-10-04

What is ASCII?

ASCII is an old way to store text as bits in a computer, i.e. it's a character encoding. ASCII is obsolete for multiple reasons, but it's still used today in many ways. The international standard that replaces it, Unicode's UTF, is (almost) compatible with ASCII. All URLs are written using only characters that ASCII supports. There's even a type of digital art called ASCII art.

7-Bit Ascii

The original ASCII (American Standard Code for Information Interchange) was a 7-bit character encoding format. As it's 7-bit, it can only have 128 unique code points. ASCII supported only unaccented Latin letters, Arabic numerals, and punctuation used in American English. If you wanted to write "señor" for example in an operating system or program that only supported ASCII, you wouldn't be able to because the computer literally had no way to represent the text character ñ as bits in memory.

Why was ASCII 7-Bits?

You may be wondering why would anyone make a character encoding format that uses 7 bits, when bytes are 8 bits, and almost everything in modern computers is byte-aligned.

The main reason for this is that ASCII is extremely old, adopted in 1967, PC . It seems that, although the term "byte" already existed at the time, different computers had different sizes for bytes. On modern CPUs, everything is byte-aligned, from machine code instructions to memory addresses. However, at the time, they weren't making programs for complex microcomputers with dozens of tiny components that had to interface with each other in a standard way. They were using punch cards!

[...] The version in use today is more completely called ASCII-1967 (it was adopted in 1967) [...]

ASCII uses only seven bits. Although it was communicated in eight-bit bytes, normal communication channels were unreliable. The 8th bit was used for error checking (parity). Typically the 8th bit was set to ensure that there was always an odd number of 1's in each byte transmitted (e.g. '$' is binary 0100100 which has an even number on 1's so is transmitted as 10100100; 'F' is binary 1000110 with an odd number of 1's so is transmitted as 01000110), but even parity systems were also used. The receiving equipment would simply check the parity of each byte; any single-bit inversion would be detected, and large errors were very likely to be noticed.

[...] The tape is decoded by treating the holes as 1's, the lacks-of-hole as 0's, converting from binary and looking up the resulting number in a standard ASCII table [...]. The least significant bit is at the bottom, and the parity bit is in the most significant position (just ignore it after checking the number of holes is odd). [...]
https://rabbit.eng.miami.edu/info/ascii.html (accessed 2024-10-04)

8-Bit Extended ASCII with Accented Letters

There's something also called "ASCII" which is a 8-bit format and includes accented letters. In fact, this is what "ASCII" often refers to. However, this isn't actually ASCII, it's a proprietary Windows format called CP1252 or Latin1, or a standard format called ISO-8859.

ISO-8859 is an eight-bit character code which extends the well-known seven-bit code, ASCII.

There are many versions of ISO-8859, one for each major language group, called Code Pages. The only code page listed here is the first, ISO-8859-1, known as "Latin1". Latin1 is intended for English and Western European Languages, but does not do a very good job, as it does not even require the characters required for British spelling. The Microsoft character set (sometimes called CP1252) adds a few more characters and surprisingly does a much better job than ISO-8859.
https://rabbit.eng.miami.edu/info/iso8859.html (accessed 2024-10-04)

ASCII in UTF

Today, the encoding that everybody uses is called Unicode, or UTF. If you don't use UTF, you better have an exceptionally exceptionally good reason for it, because everyone on the planet uses it.

UTF is a superset of ASCII. A code point in UTF has at minimum 1 byte (8 bits) in UTF-8, and can have up to 4 bytes. UTF-8 uses the 8th bit to know whether a code point is encoded in ASCII or not.

Code points: it's worth noting that in UTF, a single code point isn't necessarily a single character, because some characters require multiple code points. But in simple cases they mean the same thing.

If a code point is in ASCII, then the first 7 bits will be the ASCII value, and the eight bit will be 0. If a code point isn't in ASCII, then the eight bit will be 1, and the algorithm will consider the next byte to figure out what code point it is.

For example, "a" is ASCII, so it's 1 byte, but "ã" has an accent, so it's not ASCII, and will take 2 bytes to encode in UTF-8. A text with two characters such as "aã" would take 3 bytes to encode, one byte for the first character, two bytes for the second.