What is Link Rot?
Link rot refers to links that point to websites, webpages, or content that is longer accessible. The idea is that when an author first creates a link, they're linking to the webpage as it exists at time of writing. Due to the nature of web, the content of the linked URL may change or even become unavailable after it was linked. If this happens, the link stops working, it's become rotten.
How to Deal with Link Rot
There are several techniques available for both authors and readers to deal with link rot.
Archives: if you encounter a link that is rotten, it may still be possible to access it via the Internet Archive's Wayback Machine. It's the largest archive of its kind, archiving documents from the entire web, all the way back to the year 2000.
Similar archives may exist that focus on specific websites. For example, a Wikipedia article may be modified at any time, but you can view old revisions of an article through its history feature, which serves as an archive of all content ever published on Wikipedia.
Quoting: for authors, it's a good idea to always include relevant passages of the linked work in your document. Although there's a risk of this being considered considered infringement, citing people is typically protected under fair use clauses or similar laws across the world. So long as you only cite the relevant passages, instead of, for example, copy-pasting an entire copyrighted article and posting it into Reddit, it should be okay
Quoting provides redundancy in case of the linked document disappears from the Internet, and it also helps identify whether a document changed after the article was written. For example, if you quote "John said this was unusual," and 5 years late that quote is nowhere found in the linked document, that's a sign that you quoted it wrong or the quoted document changed. It's a good idea to have proper quoting discipline and mark literal quotes in a clear way so that you can rely on that quote actually coming from the cited document in the future.
Timestamps: a practice that is required if you're citing links in some academic contexts (e.g. essays, dissertations, etc.) is to include the date when you accessed the URL that you are citing in your references. This may sound unnecessary since you should also include authorship information such as when the webpage was published (assuming this can be found on the webpage).
In many cases, a webpage or post on the Internet makes clear when it was first published, but doesn't make clear that it was modified afterwards, when it was modified, or what was modified. A URL should be seen as a live document that may change or become unavailable at any instant.
For instance, I forgot what it was, but one time I found an article in a news website, and that website had been acquired by another company, so their URLs redirected to the parent company's website now. The parent company still hosted the news articles, and you could read them if you wrote the URL yourself, but the redirect simply always redirected to a page that said "X was acquired by Y." This is actually a terrible practice. It's not the way HTTP redirects were designed to be used. A redirect should always redirect to the new URL of the same content if one exists, not to a message that says the old website is dead. It's trivial to do this, and disappointing that there are so many cases where the correct practice is not followed by large websites.
Content Identifiers: another useful practice is to avoid relying on URLs as the sole means of identifying a resource, and use instead an identifier that uniquely identifies a document.
For Wikipedia articles, every article has an unique revision identifier that you can find in the History tab. If you link to the specific revision, the link won't rot because even if anybody can edit the article afterwards, your link always links to the immutable version that was published when you accessed the article.
Some scholar articles have a DOI (Digital Object Identifier). It's a good idea to link to the DOI URL from the DOI website [doi.org], because the DOI website will redirect visitors the the current URL of the article (e.g. if the university that hosts the article changes URLs). So using the DOI prevents link rot from the universities, although it won't help much if the DOI website itself becomes defunct in the future.
Creating Archivable Content
Some tips for creating content that can be archived.
Use Static HTML: avoid relying on Javascript to load content or to make content usable. Your website should ideally be usable with Javascript disabled, with just HTML and CSS. For most websites, it's not actually possible to make the website work perfectly without Javascript, and that's fine. Nobody that matters browsers the Internet with Javascript disabled. The only goal is to ensure that you can still use and navigate the website in some way or form even if the Javascript doesn't run. For example, if you want to display links in a dynamic manner at the top of the page and having them there without Javascript would look weird, simply place the links at the bottom of the page normally and hide them with Javascript.
One trick is to use Javascript to add the class javascript-is-loaded
to document.rootElement
, then use display: none;
to hide elements that should not be visible when the Javascript is loaded.
Although this sounds pointless considering <noscript>
exists, the idea is that you should do this in the Javascript file that is loaded to add dynamic functionality. That is, if Javascript is enabled, but the script file fails to load, then <noscript>
will be hidden because Javascript is enabled in the browser, but the dynamic elements won't be added because the script file that adds them did not load. By adding the class ourselves, we can ensure that the static content is only hidden when we're able to provide the dynamic alternative.
Write URLs in Plain Text: an extreme recommendation is to write your URLs in plain text instead of hiding them in the href
attribute of anchor elements. While the links should always work without problem in any archiver that archives HTML content, it's not impossible for content to be archived only in plain text form. For example, if a human user just copies the text and pastes it somewhere else, it will be in plain text, and the links won't be links any longer. If the user posts it in a comment one day, but years later the owner of the comment section decides to remove all links from the comments for safety, the URL will be lost.
Keep Web Page Sizes Small: archivers are very good at preserving webpages that are small. If your webpage has lots of HTML, that is going to use a lot of storage space. In some cases, an index won't even keep a copy of something if it's too large.
One technique is to include in the webpage only the main content and navigation, and load sidebars and comments through Javascript. So long as you have a link that goes to a webpage that contains ONLY the navigation, or ONLY the comments, this doesn't conflict with the idea that your website should be viewable without Javascript.
Compress Your Images: a very common mistake is to use huge, high-resolution uncompressed or poorly compressed images on your webpages. Yes, they look good, but they incur costs. In most cases you do not need such ultra high-quality images, as even text remains legible at higher compression levels. A single 500kb JPEG consumes ten times more bandwidth and storage than a 50kb JPEG.
Use Backups: perhaps more important than anything else, ensure that you have a backup of your website in case something goes wrong, and MAKE SURE THE BACKUPS ACTUALLY WORK. There have been many cases through history of a website going down permanently due to lack of backups, and in some cases even if they had backups the backups didn't actually work because nobody ever tested them. The easiest way to keep content available is not go down in the first place.
Avoid single points of failure. For example, if you depend on a proprietary tool to keep your website online, and one day it becomes inviable to keep using that tool, naturally you won't be able to publish your content online anymore.
This includes your registrar, through which you purchased your domain name, your web host, whom you pay to host your content, and any tools you use to publish it. In the worst case scenario, these three things are one, and you can't export your content to migrate to another service, and even if you could it would be very difficult to maintain the same link structure, creating a lot of link rot. WordPress, despite all of its security problems, is great in this regard because it works anywhere, and it's popular enough that it has tools to import content from other CMS's, like Blogger.
Since we're on an article about link rot, it's of extreme importance to observe that just because you have a download link for a tool, that doesn't mean that link is going to work in the future. Tools come and go. Today everyone uses Docker and package managers that just magically download packages from the Internet. But sometimes an update to a tool you use will break your site, so you need to use a specific version. If Docker disappears tomorrow, would you have the exact version of the image you need to keep developing your website? Is pip
going to work in 20 years?
If you need to update workflows, that's maintenance you need to do. To make a website last longer, you want to keep it as low-maintenance and low-cost as possible.
Quotes
[...]
In some sense, weblogs sum up what’s so great about the Internet. Like fanzine editors before them, weblog editors embrace a topic or theme and run with it. Weblogs are a great indicator of what’s happening on the Internet and within the web community. As our weblogs grow and mature, let’s offer up some hope for those that follow in our footsteps. Pass along your tips for finding the best tidbits and links. Archive your site and make it searchable. Run a link-checking program against it to combat link-rot, and occasionally dig through your archives to find the truly great links, and feature them again.
[...]
Anatomy of a Weblog, Cameron Barrett at January 26, 1999 11:59 PM [https://camworld.org/1999/01/26/anatomy-of-a-weblog-2/] (accessed 2024-09-18)
Observations
According to Google Books Ngram Viewer, the term "rotten link" was used more often in English before the web existed (likely to refer to chain links?). The term "link rot" started being used around 1992, and is now far more common in literature than "rotten link."
Leave a Reply