What is SEO?
SEO stands for Search Engine Optimization, which means optimizing websites and webpages to rank high on Google (mainly) and also on other less used search engines. The term is split into two very different meanings: the basic SEO that a webpage needs to show up at all, and the aggressive "SEO spam" that tries to craft webpages that game Google's algorithms to rank higher in the results.
Basic SEO
In order for Google (and other search engines) to show a webpage on its results, it must first crawl the webpage with their web spider bot. The bot is not human, it's computer program, so it does everything automatically and requires things to make sense technically in order to work properly.
In a happy scenario:
- The bot finds a URL for a webpage somehow.
- Using that URL, the bot connects to the website's server via the HTTP protocol.
- The web server returns a HTTP response with a
200 OK
status code header and the HTML code of the webpage in the body of the response. - The bot analyses the HTML code, categorizes it somehow, and adds it to its index.
- The bot finds URLs in the links in HTML code to other webpages.
- Go to step 2.
The first problem that can occur is if the HTTP server returns a response other than 200 OK
, such as 404 NOT FOUND
or 403 FORBIDDEN
or any of the 4XX
series. What we know as a "404 not found page" is actually the body of the HTTP response. The bot doesn't care about that. It cares about the hidden HTTP header. So it's possible to have a page that looks normal for humans but is a 404 page for a bot, or looks like a 404 page for a human but is a 200 page for a bot. The bot naturally only indexes 200 OK
pages and will ignore URLs that are not found, in some cases de-indexing the URL if it was indexed previously.
The second problem that can occur is if the HTML code is lacking critical metadata, such as the title of the page, or it's malformed somehow. Nowadays most websites are made with a CMS (Content Management System), such as WordPress, which generates the HTML code automatically, so this kind of mistake is unlikely, but if you're writing the HTML yourself in PHP (or you hired someone to do this), they can make this sort of mistake. For reference, a valid HTML page would look like this:
<!doctype html>
<html lang="en-US">
<meta charset="utf-8">
<title>My Blog's Homepage</title>
<h1>Welcome!</h1>
<p>I hope you like my blog!
You can use W2's validator [https://validator.w3.org/] to make sure the code is valid.
The cod e above contains 5 important things.
- The
!doctype
declaration which means the HTML page is at least written by someone who knows HTML. - The language of the webpage in the root element.
- The character set of the text that is the HTML code.
- The
<title>
tag is what appears on tabs in your web browser, and it's the title that appears on search engine results pages (SERP). - The
<h1>
is a heading part of the body of the HTML document, which is displayed in the browser by default.
HTML has six heading levels (<h1>
, <h2>
, <h3>
, <h4>
, <h5>
, <h6>
). Search engines consider terms written in the title and headings specially when indexing a webpage. Naturally having the world "Google" in a heading means the webpage is more about Google than a webpage that only mentions Google passingly in a paragraph.
It's possible to style any text in HTML to look like anything, which causes problems. By default, the heading levels look like large text, so a lot of people use tools to define headings in HTML as if they were just tools to make text larger. This means you could have a webpage with a text like:
Which has zero keywords in it for search engines, and would also be completely useless if it appeared in a table-of-contents (which is used by accessible technologies, for example).
On the other hand, if the HTML tag used is a <p>
(for paragraph) but it's styled large, it looks important for humans but for the bot it will look like just any other paragraph of text.
It's worth noting that this is the general idea. In practice, it could be that Google is so advanced now that it can take a look at how the webpage would look in practice to decide what is a heading or not. But other search engines may not have the same technology.
Besides these, there are also many simple techniques, such as:
- Including a
<meta name="description">
tag to describe the webpage. This isn't visible inside the web browser, but may show in search results under the title, and is also used by social media when a link is shared. - Including a
<link rel="canonical">
in webpages to tell the search engine the correct URL for the webpage if it accesses it from another URL (this can happy with any webpage, since you can just add?foo=bar
to any URL and you'll see exactly the same page in most cases).. - Marking up sections of your webpage with tags that label those sections, such as
<section>
,<header>
, and<footer>
. It's worth noting that several of these tags, called "semantic" tags, have misleading names, e.g.<article>
which isn't only for articles,<aside>
which has nothing to do with sidebars, and<nav>
which should be used only in main navigation and not just any random collection of links. - Adding
schema.org
code to your webpages to gain access to special features provided mainly by Google.
Abusive SEO Techniques
As Google uses some factors to decide which pages appear first, it's possible to game the algorithm to gain higher and higher visibility. This has been going for as long as Google existed. I`ll give some examples here.
It`s worth noting that among SEO experts, using abusive SEO techniques is called "black hat SEO," while doing things honestly is called "white hat SEO." These terms, black hat and white hat, are also found in other contexts in hacker culture. White hats always do things legally and morally, black hats do things that are either immoral or illegal, while gray hats may do something that is illegal or immoral for the greater good (e.g. a white hat will ask for your permission to try to hack you to check if you're secured, this called pentesting (penetration testing), while a gray hat may try to hack you without permission, then warn you that you're vulnerable if successful).
Keyword Stuffing
In the past it was recommended to use <meta name="keywords">
to set keywords for your webpages so that search engines would be able to better understand how to categorize them. This went as bad as one would expect. Abusive webmasters started to "stuff keywords," just adding keywords for everything, even if it wasn't relevant to the webpage, to the point search engines had to start completely ignoring this metadata as more people were using it abusively than honestly.
Similar things can happen in the content of the page. As mentioned previously, search engines thread tags <title>
and headings specially. Words contained in these tags have stronger influence on the page ranking. What's to stop a webmaster from just shoving as many keywords in those tags as possible?
Some particularly abusive webmasters would add keywords to a webpage in the text, but make the text invisible or the same color as the background. A search engine without access to the style information would index the page based on what words it sees in the HTML code, not knowing that those words would be invisible for human users when they access the page. This keyword stuffing was all for search engines.
Clearly the solution is that there is some algorithm that figures out what counts as keyword stuffing and what does not.
What this means is that there will be a "degree" of keyword stuffing that is considered acceptable, and anything beyond that threshold impacts ranking negatively.
For example, a single webpage may feature three headings that say things like this:
- What is X?
- What is Y in X?
- How to do Z in X?
When the text would flow a lot more naturally if they were written like this:
- X
- Y
- Doing Z.
The reason why the heading are written like search queries is exactly because they're trying to appear for those search queries. A webpage that wants to appear as result of as many search queries as possible ends up writing several search queries into its text.
This is the "SEO writing" that you can notice across so many websites you find on Google because their SEO, no matter how awkward it sounds, actually worked.
Don't blame the player, blame the game.
You may be asking yourself: "but couldn't Google just..." No. It couldn't.
Google isn't actually that smart. Google looks smart the same way as someone who never says anything sounds smart. Google is a black box. In order to fight black hat SEO, Google can't tell anyone how its algorithms work (it's so complex at this point their own engineers probably don't understand it fully). You can't see what it's doing, so when it blocks spam, it looks like it's smart. But you specially can't see what Google it NOT doing.
Can Google find a webpage that just says "X" instead of "What is X"? Maybe. Maybe not. We don't know. They don't tell us. Maybe they don't even know themselves. Even if they could, would this work for other types of webpages? Other contexts? Other meanings? We don't know to what degree Google is intelligent. It's safer to assume it's really not that smart at all.
For websites that earn money from ad revenue from their visitors who primarily come from Google, they need to rank high on Google. If users subscribed to their e-mail newsletter (they don't), donated to the website (they don't), subscribed to their premium plan so access paywalled articles (they don't), this wouldn't be necessary, but users just want free content, and with that currently we have ads and SEO, and the SEO is engineered for Google, and the ads probably also come from Google AdSense.
Google is the secret mastermind behind how webpages are written today!
Also note that writing clickbait titles, like "Top 10 Tips about X (Updated for 2099)" isn't SEO spam. It's just good old marketing. If you take a look at how magazines are designed, they're full of eye-catching keywords as well. That's just how you design, or in this case write, things to catch people's attention.
Spam Bot Link Farming
As webmasters can just stuff keywords, search engines can't rely on keywords alone to determine relevancy, so they seek other ways. One method is called "link juice." In essence, a webpage's value is determined by how many links point toward it on the Internet.
In principle, this makes sense, or at least it used to. A well-written, useful article will be linked by many websites, which means the most useful websites will have the highest number of links. Unfortunately, various factors ruined this whole idea.
First, consider that if we just need links, all we need to do is build a simple spam bot to spam comments all over the Internet with a link to our site. This is in fact what happens all the time today with WordPress websites. If a website is made with WordPress, all you need is to craft a HTTP request of POST
verb to /wp-comments-post.php
and you can post a comment. A few lines of Python can be a spam bot that will ruin a person's WordPress blog, and something more complex will ruin everyone's WordPress experience. There are so many WordPress spam bots nowadays that I can only imagine some developer is making the spam bots to get more people to buy their premium anti-spam WordPress plugin.
And this has nothing to do with SEO. Just spamming links everywhere is enough to get people to visit your website.
If you ever took a basic course on marketing, you may have heard of the term "foot traffic." The idea is that the physical placement of your store has direct consequences on sales. The store building itself is a vehicle to advertise the store to people who walk by. By having your store somewhere where lots of potential buyers pass by, that is enough to guarantee sales and if you don't do any other sort of advertising.
On the web, we don't walk on foot, we surf. Placing a link on a webpage with millions of eyeballs on it will get you some traffic even if no search engine will touch it. Doing it in an automated manner with spam bots is always profitable to spammers because the spamming cost is negligible and they can abuse countless vulnerable websites.
Needless to say, this link spam is one reason why link juice doesn't work as one would hope.
Another big problem is malicious link farming with non-spam methods. Again, the most vulnerable to this are unsecured WordPress websites. For example, a rogue WordPress plugin can change the content of articles to insert spam links into the webpages. In some cases, a rogue plugin will redirect from an article page to another website (redirects are 3XX
HTTP status codes), which is understood by web crawlers as a webpage having moved from a URL to another.
There is even accidental link farming or spamming.
For example, if you have a signature in a forum, and that signature has a link to your website, then every post you make is a link to your website, and that may or may not be considered spam by Google (who knows!). Normally you wouldn't even think about that when adding a link to your signature.
In one case, there was a web developer who made a widget that you could place in your blog's articles. When placed, that widget included a link to the developer's website. That widget was extremely popular and the developer gained countless links in a short amount of time, catapulting him to the top of the ranks in the search pages, before Google took action about it.
I've personally farmed some links by accident. Oops. I made an article that linked to various articles of a WordPress website. That website had a featured called "pingback" enabled. The way it works is that if a WordPress website links to you, it shows under a "pingback" section that it linked to you. I accidentally made them give me 100 links like this.
When Google finds a website is bad for some reason, it may start penalizing webpages that link to that website. This means that if spam bots spam links to bad webpages, Google penalizes the article itself.
In order to make comments work on articles, a compromise was made, broken, and replaced.
First, you could mark a link as rel="nofollow"
which would tell a web spider to not follow the link. CMS's could do this automatically to user comments. User-contributted content is normally User Generated Content (UGC). UGC is third party, not editorialized, and untrustworthy, so this sort of measure has to be taken in order to allow UGC to coexist with the website's first-party, editorialized content.
As fewer and fewer people started having websites and more and more used social media and comment's sections, Google had a new problem: too few people were linking. The whole link scheme assumes people are going to link to things. If they don't, it falls apart. If social media websites mark all of their users' links as nofollow
, then the bot can't follow any links posted on social media.
So Google decided to started following nofollow
links anyway. It also introduced rel="ugc"
for user-generated links.
In any case the damage was done. You'll notice that many websites nowadays just avoid linking to other websites completely, despite links being the foundation upon which the "web" is built. Because their SEO experts are afraid of sharing "link juice" with other webpages, or getting penalized by linking to a website that Google considers bad.
HTML5
An interesting victim of the SEO-engineered world was HTML5, which to be fair made some pretty terrible naming decisions.
HTML5 decided, out of nowhere, that the <i>
and <b>
tags for italic and bold text weren't semantic enough and invented <em>
and <strong>
for "emphasis" and "strong emphasis" (no joke.), which were conveniently styled as italic and bold.
Many so-called SEO experts immediately assumed all you had to do was put <strong>
on your webpages for free first place on Google, so that's what they did. Overnight, the B and I buttons on several WYSIWYG editors, like WordPress, CKEditor, etc., started inserting <strong>
and <em>
tags instead of <b>
and <i>
, making them, de facto, the same thing as bold and italic tags, no matter what that W3C, WHATWG, or MDN tries to tell you.
I'll never get over the fact MDN calls it "<b>
: The Bring Attention To element" now. It's okay to admit mistakes were made. Let's go back to when things made sense, okay?
Similarly, <figure>
was originally for marking up figures in the sense of a "see Figure 1." The idea is that you can take the figure and place it anywhere on the webpage and it should make sense because it's labelled with a <figcaption>
that includes its identifier. In practice, WordPress and other CMS's just wrap every single image or video in a <figure>
element, and everyone who writes articles refers to them as "see image below," making moving them away from where they were placed impossible in practice.
Any expectation of using these elements the way they are in the spec goes out of the window immediately when you tell people that it's "good for SEO" to use <figure>
. Then everyone is going to use it everywhere and none of them will bother to read what it's actually for, specially when it's named that way.
These low-level technical decisions aren't explained at all to the writers who just want to add an image to their articles, so naturally the blame is entirely upon the developers who wrote the tools, and the committee who specified code for billions to use without thinking ahead enough about how most will actually use it.
The lessons that Google was forced to learn over the years seem ignored by the committee that designed the HTML language. This is important to note because there are many other technologies nowadays, startups, that are designing things now without thinking about the potential for abuse in the future.
Also observe that, despite HTML being over 30 years old by now, this disaster of a language still lacks proper markup to just caption a rectangle on the screen, which is part of the reason for <figure>
's and <figcaption>
's popularity. For example, imagine that you had two panels that said "danger" in a webpage. You can't really use <h2>
for "danger," because then you would have 2 different headings with the exact same text. You can't use <figcaption>
because per spec (not that anybody cares) it's position-independent. <summary>
is interactive, so it doesn't make sense here. <legend>
is exactly what we want, but for some reason only available for forms. <caption>
is also exactly what we want, but it's only for tables. There is a <label>
, ironically, but that's only available for form fields, and it's also interactive. Why does HTML have 6 different ways to label things, and none of them are actually just a label?
Leave a Reply