How Computers Parse Text Codes

Share

Things like hashtags on Instagram or Twitter (e.g. #happynewyear), username mentions and text formatting on Discord (@everyone, **bold**), emojis (or "emotes") on Twitch (:LUL:), filepaths (C:/folder/photo.jpg), URLs (https://www.example.com/), hexadecimal color codes (#FF0000), among many, many other things, are actually code, even though most don't call it that. These text codes that we see everywhere and use unknowingly all operate in a fundamentally identical way. In this article, we'll learn how computers parse (i.e. recognize, interpret, and understand) these codes and some common features found in text codes that will help you understand some phenomena you may encounter in some programs you have used and will use in the future.

How Computers See Text

To understand how text codes work, first we need to understand how text works.

In a computer, everything is data (bits and bytes), and text is no different. With rare exceptions for applications that deal with text for publishing and printing, e.g. PDFs, most text you see on the computer screen works on a simple unidimensional principle. That is, the computer doesn't see text data as 2D, with left, right, up, and down, width and height. The text data is 1D: It has a start, a length, and an end; you can only go forward and backward.

It becomes 2D only when the text data is turned into an image to be displayed on the screen using a font. More specifically, only when the layout of the text is computed.

In its 1D form, the text data only contains which characters are written. A character is any letter, number (as text), or punctuation. Anything you can type is a character, including spaces. In programming, we call a text, a sequence of characters one after the other, a string of characters. For example, two words is a string of 9 characters: the substring two has 3 characters, the space ( ) between the two words counts as 1 character, and then words counts as 5 characters. 3 + 1 + 5 = 9.

The CPU doesn't know English. It can only understand bits and bytes: bits are 0's and 1's, while bytes are sequences of 8 bits. Consequently, in order to convert to store text in the computer, we need to convert text into sequences of bits. This act of converting one type of data into a medium is called encoding.

There are many ways to encode text data as bits. The standard way is the UTF-8 encoding, which encodes 1 character into as little as 8 bits (1 byte per character), although some characters can take 2 or more bytes. In this encoding, the character "A" would be become 01000001. The character "B" would become 01000010. If we interpret these bit sequences as binary numbers, and then convert it to decimal, the letter "A" would be the number 65 inside the number, while the letter "B" would be the number 66, then "C" would be 67, and so on.

A 01000001 = 65
B 01000010 = 66
C 01000011 = 67
D 01000100 = 68
E 01000101 = 69
F 01000110 = 70
G 01000111 = 71
H 01001000 = 72

This means any time you see an "A" in text in your screen, there's a 01000001 sequence of bits in your computer's memory somewhere.

There are also codes for punctuation, and, most importantly, a separate binary code for lower case characters.

a 01100001 = 97
b 01100010 = 98
c 01100011 = 99

This means that inside the computer's memory, "A" and "a" are encoded as different bit sequences of same length: 01000001 and 01100001, respectively. Note that they're only different in the third bit.

Let's see how a phrase would look inside the computer's memory:

A simple example.
└────────┬──────┘
01000001 A
00100000   (space)
01110011 s
01101001 i
01101101 m
01110000 p
01101100 l
01100101 e
00100000   (space)
01100101 e
01111000 x
01100001 a
01101101 m
01110000 p
01101100 l
01100101 e
00101110 . (dot)

Observe how 01100101 appears 3 times because we have 3 "e" characters. This may look complicated at first, but, besides the fact that we use binary codes to represent characters, nothing unexpected really happens.

Since we can interpret bits as numbers, we can also rewrite these bits using hexadecimal numbers for brevity. Observe:

A     s  i  m  p  l  e     e  x  a  m  p  l  e 
41 20 73 69 6D 70 6C 65 20 65 78 61 6D 70 6C 65

Note how, since simple and example end in mple, we get the code for mple twice as well: 6D 70 6C 65.

With 8 bits, we have 256 permutations of 0's and 1's, so we can encode at most 256 characters. UTF-8 seeks to encode all characters in all of human's languages, which means we'll need more than that in order to encode the tens of thousands of Chinese characters used in Asia.

In UTF encoding, a single character may have more one byte. This format uses some bits to figure out whether or not a character is single-byte or not, which means that it can't represent 256 characters using only 8 bits as some permutations need to indicate the character continues in the next byte. For example:

á 11000011 10100001

Above we have "á," which is "a" with an accent. It takes 16 bits to encode this character using UTF-8.

Because the first byte starts with 110, that means it's a multi-byte character.

Before UTF-8, there were other ways to encode text. Many countries had their own encoding for their own language, such as ASCII from America for English, and SHIFT-JIS for Japanese. Because there were different encoding systems, that means 1 byte in ASCII meant something completely different than 1 byte in SHIFT-JIS. You wouldn't be able to tell which character the bits encoded unless you also knew which encoding was used to encode the text, and in many cases there was no way to tell which encoding was used because the programs just assume all the text would be in ASCII or all the text would be in SHIFT-JIS. This was a mess, so we learned our lesson and created the Unicode, which has code points for all languages (and even emoji), and the Unicode Transformation Format (or UTF), which is how to encode those Unicode code points as binary data. Now we just assume all the text is in UTF-8.

The ASCII encoding was a 7-bit encoding. UTF-8 was designed so that when the first bit is 0, the next 7 bits match what ASCII would be, for compatibility with systems that are older than UTF-8. For reference, the web and e-mail already existed before UTF-8 was standardized, which is why the domain names of websites only have ASCII characters, URLs only have ASCII characters, and e-mail address only have ASCII characters, no accented letters or Chinese characters allowed. Nowadays, we can have these things (e.g. with puny code for domain names), but in general it's avoided because some programs may not support it, and you will have no way of knowing, so most developers just stick to ASCII characters, since you know everyone supports them.

Case and Accent Sensitivity

We say a program is case-sensitive when it cares about whether or not something is spelled in lower-case or upper-case or mixed case. Before computers, this concept would have made no sense, as humans couldn't care less. But for computers, because characters with different casings are encoded differently, this becomes an issue.

Similarly, whether a letter has an accent or not isn't a problem for humans, but it is a problem for computers, not just because it won't be the same binary code, but also because it would take more bytes to store it. That's the reason, for example, why you can't have usernames with accented letters. The engineer who creates the database where the usernames are stored configures the database so an username can't have more than 20 bytes, but when you create an account, in a form it will tell you that your username shouldn't have more than 20 characters. This can only work if 1 character = 1 byte.

On the other hand, programmers programming computers know that people don't understand this, so they often add extra programming to make their systems case-insensitive. For example, example.com and EXAMPLE.COM are the same website, because domain names were designed to be case-insensitive. Similarly, many things that people type, such as usernames and hashtags, are also case-insensitive. In general, what programs do in this case is to simply turn the text into lower case or upper case before doing anything with it. For example, if you type #HappyNewYear, the program internally converts that into #happynewyear so that both hashtags share the same posts.

Naturally, these things have limits. For example, it wouldn't account for numbers, word order, or punctuation, so #newhappyyear, #happynewyear2024, or #happynewyear! would all be different things.

In particular, with URLs, while the domain name is case-insensitive, what comes after it (the path) is case-sensitive, e.g. https://www.example.com/some-page is different from https://www.example.com/SOME-PAGE. Whether these two URLs show the same page, or only one works, depends on the website.

Parsing Text Codes

Now that we have some understanding of how computers see text, let's begin understanding the algorithms to parse code.

Binary Data to Text Data to Numeric Data

In virtually every text-parsing algorithm, the computer has to check characters one by one in sequence. It can't even do byte-per-byte due to how a single UTF-8 may have multiple bytes. For example, imagine we have this binary data:

01101100 11000011 10100001

We assume this is UTF-8 encoded text data, and the first thing we do is start translating the binary code into text characters.

So we take this first byte, 01101100. As mentioned previously, if the first byte starts in 110, then the character is longer than 1 byte. This byte starts in 011, so it's a single-byte character. The computer program then looks up which Unicode character this code point represents. It's l.

Then we look at the second byte, it starts with 110, so it's more than 1 byte. We take the next byte, look up what the binary code means in Unicode, it's á.

So the data above means .

But what does mean? Is that even code?

So whenever we have text code, the computer turns binary data into text data, then turns text data into some other data based on what the text code means. Let's see a practical example.

#FF0000

This is a text code called a color code, or color hex code. It always starts with #, so we can ignore the first character, even though that's the only thing that tells us at first glance that this is a color code specifically and not just a bunch of hexadecimal numbers. Let's focus on FF0000. This is a hexadecimal representation of the bytes that compose a RGB color.

Generally, a RGB color is a 3-byte data structure, where each byte contains a numeric value for each of the channels. The first byte has the value for "red," the second byte has the value for "green," and the third byte has the value for "blue." We can store a numeric value from 0 to 255 in one byte. And what these values represent are the intensity with which lights shine in your screen. When all 3 lights are at 100% intensity (value 255 out of 255), then it's white light. When they're all at 0, it's black, the lights are turned off. So the value above is for the red color:

Red      Green    Blue
FF       00       00
255      0        0
11111111 00000000 00000000

So these 3 bytes are for a RGB color data structure, which is completely different from the text data we have.

If we look at the text data #FF0000, we have completely different bytes:

#FF0000
└─────┴──┐
00100011 #
01000110 F
01000110 F
00110000 0
00110000 0
00110000 0
00110000 0

In particular, observe that "0" as numeric data for doing math is just 00000000, but "0" as a text data character encoded in UTF-8 is 00110000. The way computers do basic is by treating bits as binary numbers. Since the bits of the text "0" doesn't match the binary number 0, we can't do math with text. We would need to convert one data type to the other.

But before that, most programs that accept RGB hex codes are case-insensitive, so you can type #ff0000 instead of #FF0000. Again, the bits for f and F would be different. This is why in some programs this doesn't actually work, and it only accepts upper case or lower case, but not any case.

If we look back at the RGB data, FF equals 255 which becomes 11111111. This is one byte. But as text data FF would become two bytes: 01000110 01000110. So the conversion isn't straight-forward. We end up with three steps:

  1. Turn bits into characters.
  2. Parse characters as hexadecimal numbers.
  3. Store numeric values as bytes.

So in this case, the computer would have to take the two first characters, FF, and treat it as a text representation of a hexadecimal number, then convert that to actual numeric data, whose value is 255, and store it in one byte. Then do the same thing for the next 2 characters, 00, Then for the last 2 characters, 00. And this algorithm would work with any color hex code, so long as it has exactly 6 characters.

However, there are some other hex codes you may encounter. For example:

#F00
#FF0000FF
#F00F

The first hex code is uses 1 character instead of 2 per color channel. Consequently, it has a total of 3 characters instead of 6.

The second hex code has an extra color channel called alpha, used for transparency. This is a RGBA hex code. Consequently, it has 8 characters in total.

The third hex code has 4 channels and 1 character per channel, so 4 characters in total.

In order to support all these 4 different types of hex codes, there must be a distinguishing factor that he program can use to tell which algorithm it should use, which letter means which value in the RGB bytes. This factor is the number of characters each code has. So before parsing each code, the program just checks if the code has 3, 4, 6, or 8 characters, and then executes the appropriate algorithm.

If you try to type #FF00007F into a program that supports all three types, as you type the characters the program will parse it as different colors. When you type #FF0 it will think you want to type yellow, #FF00, will make it show a fully transparent yellow color, #FF0000 will make it show a fully opaque red, and finally #FF00007F will make it semi-transparent red.

Separators

RGB codes are the easiest ones to parse since a RGB code always has a fixed number of characters. In most cases, a text code will have a varying number of characters, which means we depend on things like separators to figure out where each component of the code starts and ends. For example, observe the two text codes below:

16/05/2024
16/5/2024

These two text codes represent dates. We've used this type of text code since before computers were widespread. Above we have three components, day/month/year, in this order (sorry, Americans). We know that 16/05/2024 is exactly the same date as 16/5/2024. It refers to the same day, of the same month, of the same year. However, these two text codes have different lengths. The computer can't just take the first 2 characters for the day, then skip the third which is a slash, take the 4th and 5th characters for the month, skip the 6th, and then take the rest. It needs to be able to figure out where each component of varying length begins and ends.

It can do this because the components are separate by a specific character: the forward slash (/). So what the algorithm the computer program will execute will be:

  1. Split the string of characters into 3 substrings according to where a forward slash is found.
  2. Parse each substring into a numeric value.
  3. Combine the 3 numbers into a date structure.

By doing this, we get the components 16, 05, and 2024 for the first code, and 16, 5, and 2024 for the second code.

In order for the program to work, all we need is an algorithm that can convert both the text values "05" (00110000 00110101) and "5" (00110101) into the same numeric value 5 (00000101).

This is very important: every time you type a number into a computer program, you are typing text data, and a computer program needs to run in order to convert that text data into numeric data so that math can be done with it.

In particular, if you are working with rational numbers, e.g. "0.5," (or "0,5" in some countries), things get more complicated, as there are several ways to store rational numbers as bits inside the computer, and the way this works would be different from how integer numbers would work.

Filepaths are a common type of code that uses separators. On Windows, a filepath such as:

C:\folder\subfolder\photo.jpg

Would have C as the drive letter, and we know this is the drive letter because :/ comes after it. Then folders, subfolders, and filenames separated by a backward slash (\). This is important.

On Linux, a filepath looks like this:

/home/john/Pictures/photo.jpg

It's almost the same thing, but there is no drive letter. The parts of the filepath are separated by a forward slash (/) instead of a backward one.

Another key difference between these text codes is that on Windows, a filepath is case-insensitive, while on Linux, it's not. This means that you can have two files in the same folder called photo.jpg and PHOTO.JPG on Linux, but on Windows you can't. In fact, previous versions of Windows had a bug where if you renamed a file using File Explorer but only changed the case, e.g. if you renamed photo.jpg to Photo.jpg, the file would be renamed as you expected, but File Explorer would still show the old filename to you, so it would look like nothing happened.

Another separator found in filepath codes is the dot (.) that comes before a file extension. If you have a filename such as photo.jpg, the jpg part, sometimes including the dot, .jpg, is called the file extension. Windows hides this file extension by default(how to display?). so normally you would only see photo as the filename. The operating system needs this dot in the filename to be able to tell what type of file it is. If jpg comes after the dot, it's JPG image file. If txt comes after the dot, it's a plain text file. And so on.

URLs are another type of code that uses separators. Normally, you see URLs like this:

https://www.example.com/webpage

The first colon character (:) separates the scheme or protocol of the URL. In this case, it's https, which is a protocol. Some other protocols found in URLs are http and ftp. There's also email: (for e-mail addresses), tel: (for telephone numbers), data: for arbitrary data, irc: for IRC servers, etc.

The double forward slashes (//) separate the authority, in this case a domain name, www.example.com.

The domain name itself is also a code. You can see it's separated by dots (.). The rightmost component is called the Top Level Domain (TLD), which is com. The second level if example. And then we have subdomains, like www. So if you think about it, the right order should be com.example.www, but when they designed this system they made it reversed for some reason.

After that, in the HTTPS's case, at least, we have a forward slash (/), which is where the path of the URL starts. In this case the path is webpage.

Another component a URL may have is called the "query," which comes after a question mark (?). For example:

https://www.google.com/search?q=an+example&hl=ja

Above, search is path, then we have a question mark(?), so q=an+example&hl=ja is the query.

This query part, too, is its own code. It encodes a number of key-value pairs. Here, the separator is an ampersand (&). So when split we get the key-value pairs q=an+example and hl=ja. This time, the separator is the equal sign (=). So for the key "q", we have the value "an+example", and for the key "hl", we have the value "ja".

But what do these codes mean? What is q? What is hl? What is ja? The meaning of these codes depend on the program running in Google's servers. It has nothing to do with URL code. Different servers would have their own codes that they put in their URLs. But in this case, it's at least easy to tell.

The hl key decides the language of the page. In this case, ja stands for the Japanese language. This specific two-letter code comes from a standard called ISO 639, which defines a list of two-letter codes for languages around the world.

The q key decides what is being searched for, in this case, its value is an+example, which looks rather odd. Where did this plus (+) come from?

This plus (+) is code as well. It means a space character ( ). So an+example means an example. But why do we have a single-character code for another single character? What is the point of this? Why not just use a space character?

Because space characters aren't allowed in URLs. URLs simply can't contain a space character at all.

So in order to give Google's program the text data "an example", we need to replace the space with something else in the URL. One way we can replace it, is with the plus (+) character.

Escaping Special Characters

You may be wondering what happens if we want to send the plus (+) character itself as text data in the URL instead of sending a space. What do we do then? Because if we send a plus character, the program will think that's a space, so we need to send something else. And if we send something else, then wouldn't have the same problem when we wanted to send that something else instead as-is? How do we solve this?

The general problem we face is that some characters have special meaning. These are special characters. They're different from characters that are interpreted as-is, i.e interpreted literally, literal characters. So example is all literal, we don't have to worry about it, but 2+2 has a special character in the middle of it, we need to do something extra.

Of course, what counts as a special character or not varies from program to program. In URLs, things like forward slashes (/) and the question mark (?) are special characters. It's very common for punctuation to become a special character. But there could be some program out there that parses code, but that doesn't treat slashes or question marks specially, so they will be treated as literals. In URLs, though, they're special.

The solution for special character is called escaping. Because the program treats the special characters specially, escaping escapes this special treatment. There are various ways to implement escaping, and we'll see this concept again in this article.

For URLs, though, escaping is done with the percent (%) character.

When the program encounters a percent (%) in a URL, the next 2 characters will be interpreted as a hexadecimal number, and its value in bits will be converted to an ASCII character.

For example, "an%20example" will be converted to "an example", because %20 is converted into the ASCII character whose hexadecimal value is 20.

You may remember this 20 value from the diagram we saw before, with the text data "A simple example".

A     s  i  m  p  l  e     e  x  a  m  p  l  e 
41 20 73 69 6D 70 6C 65 20 65 78 61 6D 70 6C 65

This means we don't need to type letters in the URL query values, we could just percent-encode everything.

https://www.google.com/search?q=%41%20%73%69%6D%70%6C%65%20%65%78%61%6D%70%6C%65

To answer the question: the plus character (+) encoded in UTF-8 is 00101011. The number 101011 in binary is 43 in decimal, which is 2B in hexadecimal. So to send the plus character, we would percent-encode it as %2B.

https://www.google.com/search?q=how+much+is+2%2B2%3F

The URL above asks Google "how much is 2+2?".

By the way, to encode a percent sign (%) using this percent-encoding, we would use %25, since 25 is the hexadecimal value of the percent character (%) in UTF-8 encoding.

Characters Impossible to Type

Now that we know about escaping, it's a good time to introduce you to a cool feature of computer text: characters that are impossible to type.

As we know, each text character equals one byte in ASCII, but some bytes in ASCII don't actually become a character in text!

For example, 00001101 and 00001010 are the codes for Carriage Return (CR) and Line Feed (LF), respectively. On your keyboard, you may have a key that's labelled the "return" key. If you don't, it's probably labelled "enter" instead. The enter or return key inserts a new line in text. What happens in the text data is that Windows inserts these two characters, this CRLF code. Meanwhile Linux only inserts LF.

In programming languages, LF is generally represented by \n, while CR is represented by \r (so CRLF is \r\n). This \ is a special character (typically escaped as \\). For example:

var text_to_display = "First line\nSecond line";

In the Javascript source code above, the variable called text_to_display is assigned a string value, which is the text "First line", a new line character (\n), and the text "Second line". When this is displayed somewhere, it will probably show as two separate lines. Observe that in the source code, we need to escape new lines because if we just typed an actual new line in the source code, like this:

var text_to_display = "First line
Second line";

Then the program that parses this source code would think that's an error. For the program parsing the source code, a new line character in the source code is a special character, so we can't just type it anywhere. To have an actual new line character in text data, it has to be escaped.

Another cool character is 00000000, the Null Character, typically represented by the code \0. This has a special meaning in programming, and is the cause of countless security exploits found in countless computer programs across entire decades of computer history, even today. The purpose of the null character is to mark where the text ends.

Essentially, any time you have a string of characters in a program, the last character is always \0. For example, if you have the string value "abc", the actual text data in memory would be this:

a        b        c        \0
01100001 01100010 01100011 00000000

The reason this happens is due to how a programming language called C works. In C, character strings don't store how long they are. You know where the string starts in memory, but you have no idea how many bytes long it is. Instead, in C, the string ends when the program encounters a null character.

This means that if the program never encounters a null character in memory, it never finds a 00000000 byte in memory, then it would theoretically just go through every single byte in memory, which could be gigabytes, billions of bytes, and interpret all of it as if it were text. Of course, this is unlikely to happen, and modern computers don't allow a program to go through the entire memory, potentially accessing data of other programs, but it was theoretically possible in the past.

The real issue here is that in order for a program to store text data in memory, it must first reserve a specific amount of bytes where to put the data. For example, since abc has 3 characters plus null, we would need 4 bytes to store it in memory, so we would need to reserve 4 bytes. But in some cases, a program doesn't know how many bytes it's going to need before it finishes processing all the data. For example, if every time you typed a letter, the program added that letter to the memory, then the more letters you typed, the more bytes it would need to have reserved.

The obvious solution to this is to just reserve more bytes the more data you need. But when the computer does this, it doesn't simply extend the reserved space it gave to a program, instead it just finds another space for it to put its bytes, and the program will need to copy its data from one space to another. To give an extreme example, imagine you had 4 gigabytes of RAM, and only 1 program running. This program is using 2 gigabytes for one thing, and now it needs 2.1 gigabytes. The operating system can't just increase the space it reserved by 0.1 gigabytes. It needs to find 2.1 gigabytes free somewhere else in memory. And this space needs to be contiguous. In this simple scenario, it wouldn't be possible, because 2 + 2.1 is over 4.

Another problem is that you would need to keep copying data over and over every time the amount of data you need increases. The more data you have, the more data you would have to copy, and this could become seriously slow.

So what do programs do instead? When the amount of data they have may change, they just reserve some extra space, just to be safe. For example, the program starts, and assumes it won't need more than 8 bytes of data, so it reserves 8 bytes of space in memory.

This space in memory is uninitialized. This the byte values currently there are random. If you thought they would be all 00000000, you guessed wrong. That's because it costs processing power to change the value of a byte to 0's, so when a program doesn't need a portion of the memory anymore, and frees its reservation, what the computer does is just forget about it instead of changing the bytes to zero. As a program keeps running, it will reserve spaces in memory that it had used before, and because the data is uninitialized, it will have the same data as before.

For example, let's say the first time the program runs, we have this data in our 8 bytes:

Turtles
└─────┴──┐
01010100 T
01110101 u
01110010 r
01110100 t
01101100 l
01100101 e
01110011 s
00000000 \0

The second time the program runs, we happen to reserve the space in memory, but instead of "Turtles" our data is just "Z".

Z
└────────┐
01011010 Z
00000000 \0
01110010 r
01110100 t
01101100 l
01100101 e
01110011 s
00000000 \0

This kind of thing is why it's called overwriting. We can't erase or create bytes out of nothing. There's a fixed amount of bytes available in memory and in our hard disks, and SSDs, and the only thing we can do is overwrite old bytes with new bytes.

Before we had the 8 bytes we needed for "Turtles". We overwrote that with 2 bytes for "Z". This means we didn't touch the other 6 bytes that were used by "Turtles" before, so the text data rtles is still in memory, but it's never going to be used by the computer program because a null character comes before it, so the program stops processing the text data before it reaches the old data.

However, there's a bug, an error in the programming, that can occur and reveal data that should be inaccessible.

If we can trick the program into trying to put 9 bytes of text data into a space that is only 8 bytes, it might forget to include the null character at the end, as it would be the 9th byte and wouldn't fit in the 8 bytes of space. Then, the next time the program needs to display this text data, it wouldn't stop after 9 bytes, because there is no null character, and it would just start displaying all text in memory until it finds a 00000000 byte.

Another case is the Heartbleed bug, which affected OpenSSL's heartbeat extension a decade ago. A computer program could send the server a piece of text with its length, e.g. "bird", which is 5 bytes long. The server has to send back the same data. However, the request was able to specify a length larger than the data it was sending. So if the request said "bird" is 500 bytes long, the server would send back 500 bytes, which could include sensitive data that it wasn't supposed to send.

By the way, your files work the same way as I explained above. When you delete a file, even "permanently" from the trash, what normally happens is that the operating system just forgets that the file existed, allowing the space it occupied to be used by another file. Its bits could still be there the there unmodified, and, in some cases, it's even possible to recover a permanently deleted file by searching for byte patterns that match the way common file types start.

What happens if the text data is just 1 byte long, and that single byte is just the null character? That's called an empty string, or a zero-length string. In other words, it means that it's technically text data, but there's zero characters in it.

Another character worth noting is \t. This is the code for the tab character. Some file formats, like CSV, may use \t to separate cells of data, which means it's possible to write a simple regex to get rid of unwanted columns if you use \t as part of the search pattern.

Delimiters

Back to how text codes work, the last feature of text codes is delimiters. A delimiter example is this:

That's so funny :LUL:

In some platforms, like Twitch and Discord, text typed between two colon characters (:) is turned into an emoji (or emote, or emoticon, or whatever you call it).

Before displaying the text, the computer would search the text for a colon character (:). If it finds one, it would search for the next colon character. If it can find 2 colon characters, it assumes the text between these two delimiters is the name an emoji, which, too, is a kind of text code.

In this example, :LUL:, the name of the emoji is LUL, while the colons are only there so the computer knows where the name starts and ends.

Let's see another example:

My opinion: LUL is an emoji. Why: emote sounds weird.

According to our algorithm, we have an emoji above. Its name is LUL is an emoji. Why. After all, emoji names start at one : and end at another :, so, naturally, LUL is an emoji. Why (with a space character at the start, too) has to be the name of an emoji, right?

As you may image, in this case the parsing algorithm needs to be more complicated, which means there are fewer things that count as a valid emoji name. More specifically, emoji names may not contain spaces. If the computer encounters a space character (or another invalid character) before it encounters the closing delimiter (:), it ignores the characters up until the invalid character it encountered.

Let's see an example:

My opinion: :LUL: is an emoji. Why: emote sounds weird.

Above, we have 4 colons (:), which means if they were separators, we would split the text into five components:

  1. My opinion
  2. LUL
  3. is an emoji. Why
  4. emote sounds weird.

However, we know the text from the start to the first colon (:) doesn't count, since we need a colon on both sides, so the 1st item above doesn't count. Similarly, the text from the last colon to the end of the text also doesn't count, so the 5th item above doesn't count. Items 2 and 4 contain spaces, so they're invalid. Which means only 3 is a valid emoji code.

Let's see another example, this one for text formatting:

*This text is italic* while **this is bold**.

This kind of code found on Discord and several other social media is called Markdown and is based on how people used to write on e-mails. In this case, the asterisk (*) acts as a delimiter and spaces count as valid characters. The computer can distinguish between italic and bold formatting by the number of consecutive asterisks: one asterisk is italic, two asterisks is bold. In some implementations you can even have 3 asterisks for both italic AND bold.

Because any pair of asterisks trigger formatting, this causes a problem compared to the emoji name codes we saw before. It's not possible to type two asterisks in a single text without making them turn into italic. The solution to this is something we saw before: escaping.

Markdown parsers generally allow escaping of the asterisk through a backward slash. For example:

Type \*italic\* to make text *italic*!

The text code above would be rendered as: Type *italic* to make text italic!

A confusing feature found in Markdown is that if you type an octothorpe (#) at the start of a line, it converts that into a heading, which makes the text bigger.

# BIG TEXT

This is confusing because it only happens if the octothorpe is the first thing in a line, which means you can have used a program that parses markdown and written octothorpes in it and never encountered this strange feature until you accidentally type it at the very beginning. Once again, we should be able to escape it using \#.

Another feature found in Markdown parsers that come from e-mail formatting is the use of a greater than sign (>) to quote someone. This one is only valid when it's at the start of a line. Multiple > in sequence mean nested quotes. For example:

>>I think emoji is the right name.
>You're wrong, emote is the right name.
You two are wrong, emoticon is the right name.

One function of Markdown that you may encounter from time to time is that Markdown ignores new lines by default. This means if you type:

First line.
Second line.

The parser will turn this into: First line. Second line.

It will convert the new line character (\n) into a space character ( ).

If there is an empty line between two texts, Markdown interprets that as a paragraph separator.

First paragraph.

Second paragraph.

In some parsers, it is possible to render two lines in the same paragraph, all you need to do is type two spaces before the end of the line.

First line.  
Second line.

In other words, " \n", it counts as a line break, but just "\n" does not. You can't tell the difference unless you select the text.

Next we have hashtags, which we can find in various social media websites. An example would look like this:

This year will be awesome! #happynewyear

In this case, a computer program will turn the text #happynewyear into a link to a webpage that shows other posts with that same hashtag. This computer program is running inside Instagram's servers, or Twitter servers, or TikTok's servers, or wherever you are posting. It doesn't run in your computer.

When your web browser accesses a webpage, what it does is download a text file containing HTML code. Using this HTML code, the web browser is able to figure out how to render the webpage on your screen. Part of this HTML code tells how to render links that go to webpages when you click on them. This means that when a computer program turns a hashtag into a link, what it's doing is turning some sort of code that allows hashtags, which the web browser doesn't understand, into HTML code that contains links, which the web browser does understand.

The HTML code for the above would look like this:

This year will be awesome! <a href="https://www.example.com/hashtag/happynewyear">#happynewyear</a>

We can see above that the text #happynewyear is inside the HTML opening tag <a> and closing tag </a>. In the opening side, we also have href="..." containing the URL for the page that the web browser opens when you click the link.

Let's see another example.

This #year will be #awesome! #happynewyear

Above, we should have 3 hashtags: #year, #awesome!, and #happynewyear. Which brings us a question.

What delimits a hashtag?

How does the computer know where a hashtag starts and where a hashtag ends?

We know that hashtags start at an octothorpe (#), so this is the starting delimiter, but what delimits the end of the hashtag? We don't type hashtags with two octothorpes. We don't type #happynewyear#. So it has to be something else.

Generally, the hashtag ends when the computer encounters the first space character, which is why in This #year will, the hashtag is #year. But this brings us a new question.

Why is the hashtag #year, and not #year including a space in the end? Or, conversely, why does the hashtag include the octothorpe? Shouldn't it be just year without either delimiter?

Most likely, the way Instagram's and others' programs handle it is really to just ignore the octothorpe. You can see it in the URLs for the hashtags that there is not # in them.

https://www.instagram.com/explore/tags/happynewyear/

More technically, there couldn't be a # in the URL even if they wanted to add one, because # is a special character in URL code: it comes before the fragment of the URL. For webpages, this is generally a code that specifies a part of the webpage, such as a heading, so the web browser automatically scrolls to that part when you access it.

Another question: is the hashtag #awesome! or #awesome? Does punctuation count as a valid character in a hashtag name? It could be that dots (.) are valid, but exclamation points (!) and question marks (?) and colons (:) are invalid. All of this can vary from program to program, from website to website.

The same principles employed with parsing hashtags are employed with parsing username mentions, although these are normally done with the @ character instead. For example, on Discord, @everyone sends a message addressed to literally everyone that joined a server, but it could be someone's username, like @john. On Reddit, /u/john would be a mention an username and link to the user's profile page, while /r/movies is a way to link to a subreddit.

A final important type linking feature exists: creating links out of arbitrary URLs.

On most social media websites, when you type or paste a URL, any URL, that is converted into a link. How does that work?

Well, for starters, it's not really any URL, only URLs that start with https:// and http://. It's possible that it also works with domain names that fit some common patterns, like www.example.com and example.com. Nowadays, there are countless TLDs, like .xyz, .blog, .app, etc. Some of these are known to be used mainly by hackers to create fake and dangerous websites, so it would be a bad idea to just turn everything that looks like a domain name into a link. In particular, someone had the terrible idea of turning .zip into a TLD, which means typing photos.zip on an e-mail, for example, could be converted into a URL to a shady and potentially dangerous website. With these things in consideration, while parsers could turn anything that looks like a URL into a link, it's probably a bad idea to do that, so they would be more conservative in the input they accept.

As mentioned previously, a URL doesn't contain spaces, which means that social media websites are safe to use spaces as delimiters to figure out whether the URL starts and ends. However, URLs may contain punctuation, and that's where a problem starts. For example, consider the following message:

I think this website is cool: https://www.example.com/.

Observer above that the text ends in a dot (.). Would the program consider this dot as part of the URL or not? If the link goes to https://www.example.com/, then that's likely a valid website, as it's the root URL of that website. However, if the dot (.) is included, then the URL becomes https://www.eample.com/., which is probably not a URL of a real webpage. You'll probably get a 404 error in this case, no matter which URL you typed before the dot ( .).

On the other hand, there are many URLs that have a dog, ending in index.html, index.php, or even image.jpg for example, because the dot (.) is used to mark where the file extension starts in a filename.

The only way to solve this for sure is to type a space before the dot.

I think this website is cool: https://www.example.com/ .

Now the computer won't include the dot inside the URL link.

A similar problem occurs with the question mark (?). As we've seen before, the question mark is a special character in URLs, which means many URLs have a question mark in them. However, consider the following message:

Have you seen this website: https://www.example.com/?

In this case, the question mark shouldn't be part of the URL, but instead part of the normal text. Is the program smart enough to figure this on its own? Generally, it's a better idea to surround URLs with spaces just in case:

Have you seen this website: https://www.example.com/ ?

This does look weird, but at least we don't risk writing a URL that doesn't work.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *