Stop Putting AUTO_INCREMENT IDs in URLs

Recently someone posted about an online job:

Well, that looks interesting, except that what interested me wasn't the job, it was the information the ID tells us about weworkremotely.com/ (though I suspect it's not something they're worried about).

The exchange for that tweet continued as follows:

This is an incredibly common anti-pattern in Web-based software and causes all sorts of problems. It stems from something like this (a shortened snippet from Veure):

CREATE TABLE station (
    station_id     SERIAL PRIMARY KEY,
    star_id        integer NOT NULL,
    name           character varying(255) NOT NULL,
    affiliation_id integer NOT NULL,
    law            percentage NOT NULL,
    orwellian      percentage NOT NULL,
    is_jumpgate    boolean NOT NULL,
    -- constraints
);

The SERIAL PRIMARY KEY is sort of PostgreSQL's equivalent to MySQL's AUTO_INCREMENT, or an IDENTITY column in SQL Server. That type of non-identifying key on a table is called a surrogate key because it's often used as a substitute for the natural key (the fields upon which each record depends). Surrogate keys have their strengths and weaknesses. A huge strength is that if the data in your natural key changes, you don't have to go through all tables in your database and change the key. However, using an auto incrementing key has all sorts of downsides. The auto incrementing values should generally be used internally to join tables and not relied upon for anything else. However, people abuse them in all sorts of ways.

One of the first is SQL like this:

SELECT foo, bar
  FROM some_table
 WHERE some_table_id > $some_previous_table_id

See what's happening there? The developer is relying on an artifact of auto-incrementing values: get records from a table created after a certain record. Of course, there's nothing stopping someone from re-using an old ID (a bad practice), or from changing the IDs to a different strategy, such as a UUID (often a performance issue if not handled correctly), or (common) deleting and re-inserting a record. Thus, the SQL above has all sorts of failure modes. Worse: they're often rare failure modes, making the bugs intermittent and hard to track down.

In the case of Web applications, however, it gets even worse. There are many instances of session hijacking because someone used an auto increment id as the session id. You just bump that session ID up or down a bit and you get someone else's session. Whee!

However, it's also an information leak, like it is with weworkremotely.com (or it would be if the information couldn't be obtained by simply counting their jobs). Today, their current max job id is 1050 for a Ruby on Rails position. According to the Internet Archive, one year ago today, when the Internet Archive took their first snapshot, their job id of 1 was a customer support position for 37 signals. (Note that you can easily scan job URls to verify that this value is auto incrementing). That's about 87 jobs a month. Given that they advertise $200 per job posting, and ignoring that some posting are likely free, they're likely grossing an average of around $17K a month for the past year. You can spend an extra $50 per ad to highlight it and spend another $200 to keep it running, so it's hard to say how they're doing, though I was tempted to write a spider for the Internet Archive (a naughty thing to do) to see how they're really doing.

Adding hosting fees, salaries, marketing, and so on, that's not a huge amount of money. However, if you keep your costs low (and they do appear to have done so), that's a viable business model.

Frankly, I don't think that weworkremotely.com is that concerned about people seeing those IDs. After all, you can just count the jobs on a page and get an idea of their income. However, what if you represent other companies on your Web site and you want to appear larger than you are. Do you really want your "Featured Clients" to have URLs like /client/2, /client/5, and so on? That tells everyone that you're a small company. Instead, URLs like /client/acme-corp and /client/generic-inc are your friend.

Sometimes you want to hide that information in other ways. For example, in Veure, I don't want people seeing something like /area/50 and then manually changing that to /area/51 because I want them to explore the world, not discover it through URL hacking. The same goes for items in the game, so I've created this little PostgreSQL function:

CREATE OR REPLACE FUNCTION add_uri_name() RETURNS TRIGGER AS $$
DECLARE
    uri_name TEXT := NEW.name;
BEGIN
    uri_name = regexp_replace(uri_name, E'\\s+', ' ', 'g');
    uri_name = regexp_replace(uri_name, E'\\W+', '-', 'g');
    uri_name = lower(uri_name);
    uri_name = trim(both ' ' from uri_name);
    NEW.uri_name = uri_name;
    RETURN NEW;
END;
$$ LANGUAGE 'plpgsql';
CREATE TRIGGER item_uri_name
  BEFORE INSERT ON item
  FOR EACH ROW EXECUTE PROCEDURE add_uri_name();

(Note: item data is added by devs, not by characters, so the above, while limited, is relatively safe, but needs work).

What the above does is, when inserting an item in the database, it generates a unique uri_name. So an item called "French Bread" has the uri_name of french-bread. Now, instead of seeing /item/273, the player sees /item/french-bread and there's no information leak. Further, if you visit the archives, instead of seeing /archives/history/7, you'll see /archives/history/the-catastrophe. This has multiple advantages:

  • Less information is leaked
  • URLs using descriptive names like this are more user-friendly
  • URLs using descriptive names like this are more SEO-friendly

That's not to say that you should never put IDs in your Web apps, but be careful about them. They might tell people much more than you intended.

Have any more horror stories like this? Share them below!

14 Comments

The default permalink format for Wordpress is broken in exactly this way:

http://example.com/?p=N

First thing I do when configuring a new site is to change it to one of the other alternatives:

http://example.com/2012/12/30/post-name

Your solution makes sens.

However, at some point it is gonna raise new problems and considering the "Cool URIs don't change" rule (http://www.w3.org/Provider/Style/URI.html) they will hurt.

I mean, this scheme gets tricky with language support and sometimes you face the choice between keeping mistakes and killing (profitable) URLs. Not good.

I'm gradually switching towards this kind of stuff: http://site.domain/category-id/oid/article-id where category-id and article-id work more or less like your uri_name but only the /oid/ part is needed/used for page identification

oid are 20 carateres long, part random.

This way, URL can't die unless it's useless, category-id and article-id help humains to understand what they are handling

I understand it could be use for nasty url manipulations someway, but that's as far as I've made it for now

I don't know if I should be admitting to this, but it was over ten years ago now, so hopefully it wont reflect too badly on me.

For an old project, I didn't want people to know how few orders we'd had when we first started so I essentially had a cron job which ran this on our MySQL db:

alter table orders auto_increment=?

Where "?" was an integer based on how much time had passed. We also started off with an artificially high auto increment id as well, so that people didn't get an order with id of "4" etc.

I know this is bad, for numerous reasons. But on the other hand, it worked fine, took a few minutes to implement, and let me ship code. I wouldn't do it again though ;)

For example, in Veure, I don't want people seeing something like /area/50 and then manually changing that to /area/51 because I want them to explore the world, not discover it through URL hacking.
However, changing these ids to something less obvious is really just security through obscurity, since there is no such thing as a 'secret' URI. URIs leak all over the place and the design of the web is essentially that they are public. If you want to disallow exploring new places by URI-hacking, you need some real access control mechanism tracking the places visited, not just to make it more difficult (but still far from impossible) to guess what the other URIs are.

Clearly for an online game it doesn't matter much, but for anything more important it would be deluded to think that obfuscating the URIs adds any security. Better to work on the basis that URIs are publicly known, and have a real authentication and authorization mechanism to limit access.

You are right, though, that incrementing numeric ids leak information, and this may sometimes matter. I think the rule of thumb is: are you happy for your page to have 'previous' and 'next' links to move along the numbered sequence? If that wouldn't be something you want to reveal, perhaps best not to have the number in the URI either.

That does not "fix" anything you can still navigate via GET params like p or cat etc. with pretty permalinks on.

I agree with the sentiment of this article, but not with the majority of the points you make.

Most of the time, information catalogued and then URL-ised by AUTO_INCREMENT are public information that could easily be spidered another way - as you noted yourself. So most of the time AUTO_INCREMENT is just used as an arbitrary counter / identifier which is just as guessable with other arbitrarily chosen identifiers. It's little worse than the /CCYY/MM/ URL you're using to order articles on this blog. Granted you are then using a natural string to specify a specific record, but that method could just as easily work for the record ID in the same way /AUTO_INCREMENT/NATURAL_STRING. However the question of whether you'd want to hide (or make it harder to guess) URLs for articles published to the public still remains.

Where AUTO_INCREMENT really causes problems is when you have information you don't want public. Such as session IDs. But in those instances, it would be rare to URL-ise them in the first place (bar maybe some AJAX calls). And as far as I'm aware, nobody uses AUTO_INCREMENT for session IDs anyway (they'd have to be pretty useless developers if they did).

There might be some argument against AUTO_INCREMENT for some user-submitted / community services such as Paste Bin and URL shorteners. But even here, if the information isn't intended to be public then it either a) should be hidden behind some level of authentication or b) shouldn't have been published online to begin with.

Security aside, the other point you raised was with SEO. However even that's not really an issue for the same reason your /CCYY/MM/ URL isn't an issue: sites that need to be search engine optimised will support a human readable string as part of the URL.

So I don't think the headline "Stop Putting AUTO_INCREMENT IDs in URLs" and subsequent article really tacks the legitimate problems with incrementing IDs; and I think your primary example is very misleading too (and unfair to the guys you Tweeted to). The real issue is people who use predictable (which includes guessable hashes!) row ID's as identifiers for non-public data AND the issue with developers who choose to hide non-public data behind obscurity[1] instead of secure authentication. None of which has much to do with URLs in the first place since incrementing session ID's are just as bad regardless of whether they're encoded in a URL, HTTP "POST" body, a cookie, or whatever.

[1] https://en.wikipedia.org/wiki/Security_through_obscurity

We use Friendly ID https://github.com/norman/friendly_id for nice slugs in URLs over integer IDs.

I don't like that french-bread example counter-measure. It still leaks information and only adds the requirement of human imagination to guess and find out neighbouring pages.

Best practice: use random IDs instead of autoincrement. In PostgreSQL, you should use uuid_generate_v4(). In Perl, you can use Data::UUID::MT->new->create. Unlike the example in the article, these IDs are stable and unguessable. Expose IDs in URIs with urlsafe_b64encode, which is less unwieldy than the usual 36 hexdigits form.

To satisfy user and SEO friendliness, you may coin a descriptive name (slug) as explained above and append it to the URI, e.g. http://example.com/HXGYBHddS12461aiOHvSuQ/french-bread. Slugs can change, but the ID is permanent as the application will disregard the slug and look up via the ID only.

If you want to use random-like ids, you could use your convenient autoincrement internal application number, then do a simple encryption of the number. That makes more sense to me than a random UUID, in that it will look and act random for external users, but it is deterministic and reversible on your side in case you want that. No need to store what big random UUID mapped to what id, for instance.

That said, I like the human-readable bit more than random hex digits, but it's something to consider.

Guessable hashes are no better than having raw row IDs exposed.

laumers, by "guessable hashes" you mean 32 random hex digits? How is that guessable? Note I did not say to hash the id, as that would indeed be guessable. I said encrypt, e.g. $cbc->encrypt($id,$key,$iv);

I agree with Dana. Simply use a toy cipher like SKIP32, for example. I'm pretty sure YouTube uses something similar for their video ids.

@Dana: Actually you did say hash the ID: "If you want to use random-like ids, you could use your convenient autoincrement internal application number, then do a simple encryption of the number."

I appreciate now that you meant you'd use a salt - which would be required to make the hash non-guessable - but it didn't help your case that you were also talking about making the hash reversible (why? it would be easier just to dump the hash in the DB as the unique primary key).

You're still falling into the trap of thinking this would make your web application more secure. It doesn't. URLs are never going to be secure to begin with as URLs can be probed and published. By obfuscating the record number you're just adding to the length of the URL - all of the same rules apply for actually securing the content. I appreciate some sites like to us a hash for the URL but often this is just for aesthetic or SEO reasons (like why some blogs / news sites use the date in the URL).

I think a few people on here need to familiarise themselves with the term 'security through obscurity': https://en.wikipedia.org/wiki/Security_through_obscurity

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/