Cool URIs don't change

The author is Sir Tim Berners-Lee, inventor of URIs, URLs, HTTP, HTML and the World Wide Web, current head of the W3C. Article written in 1998

What URI can be considered "cool"?
One that doesn't change.
How do URIs change?
URIs don't change: people change them.

In theory, there is no reason for people to change URIs (or stop maintaining documents), but in practice there are millions of them.

Theoretically, the nominal owner of the domain namespace actually owns the domain namespace and therefore all the URIs in it. Other than insolvency, nothing prevents the owner of a domain name from keeping that name. And in theory, the URI space under your domain name is completely under your control, so you can make it as stable as you like. Pretty much the only good reason for a document to disappear from the internet is that the company that owned the domain name is out of business or can no longer afford to keep the server running. Then why are there so many missing links in the world? In part, this is simply a lack of foresight. Here are some reasons you might hear:

We just reorganized the site to make it better.

Do you really feel like the old URIs can't work anymore? If so, then you have chosen them very badly. Consider keeping the new ones after the next redesign.

We have so much stuff that we can't keep track of what's outdated, what's private, what's still relevant, and so we thought it'd be best to just turn it all off.

I can only sympathize. The W3C has gone through a period where we had to carefully screen archival material for privacy before releasing it to the public. The decision needs to be thought out ahead of time - make sure you record the acceptable readership, creation date, and ideally expiration date with each document. Save this metadata.

Well, we found that we need to move the files...

This is one of the most pathetic excuses. What many people don't know is that web servers allow you to control the relationship between an object's URI and its actual location in the file system. Think of a URI space as an abstract space, perfectly organized. Then make a mapping to whatever reality you actually use to implement it. Then tell the web server about it. You can even write your own server snippet to get it right.

John no longer maintains this file, Jane does now.

Was John's name in the URI? No, just the file was in his directory? Well, okay.

We used to use a CGI script for this, but now we use a binary program.

There is a crazy idea that pages generated by scripts should be located in the "cgibin" or "cgi" area. This exposes the mechanics of how you run your web server. You change the mechanism (even while keeping the content), and oops, all your URIs change.

Take, for example, the National Science Foundation (NSF):

Online NSF Documents

http://www.nsf.gov/cgi-bin/pubsys/browser/odbrowse.pl

The first page to start browsing documents will obviously not be the same in a few years. cgi-bin, oldbrowse ΠΈ pl - all this gives out bits of information about how-we-do-it-now. If you use the page to search for a document, you get the first equally bad result:

Report of the working group on cryptology and coding theory

http://www.nsf.gov/cgi-bin/getpub?nsf9814

for the index page of the document, although the html document itself looks much better:

http://www.nsf.gov/pubs/1998/nsf9814/nsf9814.htm

Here, the heading pubs/1998 will give any future archive service a good clue that the old 1998 document classification scheme is in effect. Although the document numbers may look different in 2098, I can imagine that this URI will still be valid and it will not interfere with NSF or any other organization that will maintain the archive.

I didn't think URLs were meant to be permanent - they were URNs.

This is probably one of the worst side effects of the URN discussion. Some people think that because of the research into a more permanent namespace, they might be casual about dangling links because "URNs will fix it all". If you are one of these people, then let me disappoint you.

Most URN schemes I've seen look like an authority identifier followed by either a date and a string you choose, or just a string you choose. This is very similar to an HTTP URI. In other words, if you think your organization will be able to create long-lived URNs, then prove it now by using them for your HTTP URIs. There is nothing in HTTP itself that makes your URI unstable. Only your organization. Create a database that maps the document's URN to the current filename and let the web server use it to actually retrieve the files.

If you've gotten to this point, then if you don't have the time, money, or connections to develop some software, then you can make the following excuse:

We wanted to, but we just don't have the right tools.

And this is something you can sympathize with. I totally agree. What you need to do is get the webserver to instantaneously process the persistent URI and return the file wherever it is currently stored in your current crazy file system. You want to keep all URIs in a file as a check and keep the database up to date at all times. You want to maintain relationships between different versions and translations of the same document, as well as maintain an independent checksum record to protect against file corruption by accidental error. And web servers just don't come out of the box with these features. When you want to create a new document, your editor asks for a URI.

You want the ability to change ownership, document access, archive-level security, and so on in the URI space without changing the URI.

Everything is too bad. But we will fix the situation. At the W3C, we're using the Jigedit (Jigsaw editing server) functionality that keeps track of versions, and we're experimenting with document creation scripts. If you develop tools, servers, and clients, look out for this problem!

This excuse also applies to many W3C pages, including this one: so do what I say, not what I do.

Why should I care?

When you change the URI on your server, you can never fully tell who will have references to the old URI. These may be links from regular web pages. Bookmarks to your page. The URI might have been scrawled in the margin of a letter to a friend.

When someone follows a link and it's broken, they usually lose trust in the server owner. He is also frustrated - both emotionally and realistically from the impossibility of achieving his goal.

A lot of people complain about broken links all the time and I hope the damage is clear. I hope that the reputational damage to the maintainer of the server where the document disappeared is also obvious.

So what should I do? URI design

It is the responsibility of the webmaster to allocate URIs that can be used in 2 years, in 20 years, in 200 years. This requires thoughtfulness, organization and purposefulness.

URIs change if some information changes in them. It is very important how you design them. (What, URI design? Do I need to design URI? Yes, you should think about it). Design basically means not having any information in the URI.

The date the document was created - the date the URI was issued - is something that will never change. It is very useful for separating queries that use the new system from those that use the old system. It's good to start a URI with it. If the document is dated, even if the document will be relevant in the future, then this is a good start.

The only exception is a page that is intentionally the "latest" version, for example, for the entire organization or a large part of it.

http://www.pathfinder.com/money/moneydaily/latest/

This is the last column of Money Daily in Money magazine. The main reason this URI doesn't need a date is because there's no reason to store a URI that will outlive the log. The concept of Money Daily will disappear when Money disappears. If you want to link to content, you must link to it separately in the archives:

http://www.pathfinder.com/money/moneydaily/1998/981212.moneyonline.html

(Looks good. Assumes "money" will mean the same thing throughout the existence of pathfinder.com. There is a duplicate "98" and an unnecessary ".html", but otherwise looks like a strong URI.

What to leave aside

All! Other than the creation date, putting any information in a URI is asking for trouble anyway.

  • Author's name. Authorship may change with new versions. People leave organizations and give things to others.
  • Subject. It is very difficult. It always looks good at first, but it changes surprisingly quickly. I will cover this in more detail below.
  • Status. Directories like 'old', 'draft', and so on, not to mention 'latest' and 'cool', appear on all filesystems. Documents change status - otherwise it would not make sense to create drafts. The latest version of a document needs a permanent identifier, regardless of its status. Keep the status out of the name.
  • Access. At the W3C, we have divided the site into sections for staff, members, and the public. It sounds good, but of course, documents start as team ideas from employees, are discussed with members, and then become public. Indeed, it's a shame if every time a document is opened for wider discussion, all the old references to it are broken! Now we move on to a simple date code.
  • File extension. A very common occurrence. "cgi", even ".html" will change in the future. You may not be using HTML for this page in 20 years, but today's links to it should still work. Canonical links on a W3C site do not use the extension (how it's done).
  • Software mechanisms. In the URI look for "cgi", "exec" and other terms that scream "look what software we're using". Anyone want to dedicate a lifetime to Perl CGI scripting? No? Then remove the .pl extension. Read the server manual on how to do this.
  • Disk name. Come on! But I have seen this.

So the best example from our website is simply

http://www.w3.org/1998/12/01/chairs

… a report on the minutes of the meeting of the W3C chairs.

Topics and classification by topics

I'll go into more detail about this danger, as it's one of the hardest things to avoid. Typically, topics end up in URIs when you categorize your documents by the work they do. But this breakdown will change over time. The area names will change. At the W3C, we wanted to change MarkUP to Markup and then to HTML to reflect the actual content of the section. In addition, there is often a flat namespace. In 100 years, are you sure you don't want to reuse anything? In our short life, we already wanted to reuse "History" and "Style Sheets" for example.

It's a tempting way to organize a website - and a really tempting way to organize anything, including the entire Web. This is an excellent medium-term solution but has serious drawbacks in the long term.

Part of the reason lies in the philosophy of meaning. Every term in a language is a potential object of clustering, and each person may have a different idea of ​​what it means. Because relationships between subjects are more like a web than a tree, even those who agree with the web may choose a different representation of the tree. These are my (often repeated) general remarks about the dangers of hierarchical classification as a general solution.

In fact, when you use a subject name in a URI, you are binding yourself to some kind of classification. Perhaps in the future you will prefer a different option. Then the URI will be subject to violation.

The reason for using a topic area as part of a URI is that responsibility for subsections of the URI space is usually delegated, and then you need the name of the organizational body - division, group, or whatever - that is responsible for that subspace. This is a URI binding to an organizational structure. It's usually safe only when the URI further (left) is date-protected: 1998/pics might mean to your server "what we meant in 1998 by pics" rather than "what we did in 1998 with what we now call pics.

Don't forget the domain name

Remember that this applies not only to the path in the URI, but also to the server name. If you have separate servers for different things, remember that this division will be impossible to change without destroying many, many links. Some classic "look at what software we use today" mistakes are "cgi.pathfinder.com", "secure", "lists.w3.org" domain names. They are designed to make server administration easier. Whether the domain represents a division within your company, document status, access level, or security level, be very, very careful before using more than one domain name for multiple document types. Remember that you can hide multiple web servers inside a single visible web server using forwarding and proxying.

Oh, and also think about your domain name. You don't want to be referred to as soap.com after you switch product lines and stop making soap (I apologize to whoever owns soap.com at the moment).

Conclusion

Saving a URI for 2, 20, 200, or even 2000 years is obviously not as easy as it sounds. However, across the web, webmasters are making decisions that really make it harder for them to do so in the future. Often this is because they use tools whose job it is to present the best site just at the moment - and no one has estimated what will happen to the links when things change. However, the point here is that a lot, a lot can change, and your URIs can and should stay the same. This is only possible when you think about how you create them.

See also:

Additions

How to remove file extensions...

...from a URI in the current file-based web server?

If you are using Apache, for example, you can configure it for content negotiation. Store the file extension (for example, .png) in the file (for example, mydog.png), but you can link to a web resource without it. Apache then checks the directory for all files with that name and any extension, and can choose the best one from the set (for example, GIF and PNG). And don't put different types of files in different directories, in fact content negotiation won't work if you do that.

  • Set up your server for content negotiation
  • Always refer to URIs without extension

Links with extensions will still work, but will not allow your server to choose the best available format currently and in the future.

(In fact, mydog, mydog.png ΠΈ mydog.gif β€” valid web resources, mydog is a generic content type resource, and mydog.png ΠΈ mydog.gif - resources of a specific content type).

Of course, if you're writing your own web server, it's a good idea to use a database to bind persistent ids to their current form, though beware of unlimited DB growth.

Board of Shame - Story 1: Channel 7

During 1999, I tracked school closures due to snow on the page http://www.whdh.com/stormforce/closings.shtml. Do not wait for the information to appear at the bottom of the TV screen! I put a link to it from my home page. The first big snowstorm of 2000 arrives and I check the page. It is written there:,

β€” As of.
Nothing is currently closed. Please return in case of weather warnings.

It can't be the same big storm. It's funny that the date is missing. But if you go to the main page of the site, there will be a big button "Closed Schools", which leads to the page http://www.whdh.com/stormforce/ with a long list of closed schools.

Maybe they changed the system for getting the list - but they didn't need to change the URI.

Board of Shame - Story 2: Microsoft Netmeeting

With the growing dependence on the Internet came the clever idea that links to the manufacturer's website could be embedded in applications. This has been used and abused a lot, but you can't change the URL. Just the other day I tried a link from the Microsoft Netmeeting 2/something client in the Help/Microsoft on the Web/Free stuff menu and got a 404 error - no response from server found. Maybe it's already been fixed...

Β©1998 Tim BL

Historical note: In the late 20th century when this was written, "cool" was an epithet of approval, especially among young people, indicating fashion, quality, or appropriateness. In the rush, the URI path was often chosen for "coolness" rather than usefulness or durability. This note is an attempt to redirect the energy behind the search for coolness.

Source: habr.com

Add a comment