Stop Putting AUTO_INCREMENT IDs in URLs
Recently someone posted about an online job:
Did u know I'm actively hiring experienced web developers to work remotely? https://t.co/krb5gep8V7 #rails #nodejs #angularjs #plsRT
— Obie (@obie) November 4, 2014
Well, that looks interesting, except that what interested me wasn't the job, it was the information the ID tells us about weworkremotely.com/ (though I suspect it's not something they're worried about).
The exchange for that tweet continued as follows:
@obie @brianwisti Love how they built their URLs with, apparently, auto increment IDs, showing you just how few job adverts they've ever had
— Curtis Ovid Poe (@OvidPerl) November 4, 2014
@OvidPerl @brianwisti that seems impossible
— Obie (@obie) November 4, 2014
@obie @brianwisti Take your job, subtract 1 from ID, check date on new job. Check past IDs on internet archive. The pattern holds.
— Curtis Ovid Poe (@OvidPerl) November 5, 2014
This is an incredibly common anti-pattern in Web-based software and causes all sorts of problems. It stems from something like this (a shortened snippet from Veure):
CREATE TABLE station (
station_id SERIAL PRIMARY KEY,
star_id integer NOT NULL,
name character varying(255) NOT NULL,
affiliation_id integer NOT NULL,
law percentage NOT NULL,
orwellian percentage NOT NULL,
is_jumpgate boolean NOT NULL,
-- constraints
);
The SERIAL PRIMARY KEY
is sort of PostgreSQL's equivalent to MySQL's AUTO_INCREMENT
, or an IDENTITY
column in SQL Server. That type of non-identifying key on a table is called a surrogate key because it's often used as a substitute for the natural key (the fields upon which each record depends). Surrogate keys have their strengths and weaknesses. A huge strength is that if the data in your natural key changes, you don't have to go through all tables in your database and change the key. However, using an auto incrementing key has all sorts of downsides. The auto incrementing values should generally be used internally to join tables and not relied upon for anything else. However, people abuse them in all sorts of ways.
One of the first is SQL like this:
SELECT foo, bar
FROM some_table
WHERE some_table_id > $some_previous_table_id
See what's happening there? The developer is relying on an artifact of auto-incrementing values: get records from a table created after a certain record. Of course, there's nothing stopping someone from re-using an old ID (a bad practice), or from changing the IDs to a different strategy, such as a UUID (often a performance issue if not handled correctly), or (common) deleting and re-inserting a record. Thus, the SQL above has all sorts of failure modes. Worse: they're often rare failure modes, making the bugs intermittent and hard to track down.
In the case of Web applications, however, it gets even worse. There are many instances of session hijacking because someone used an auto increment id as the session id. You just bump that session ID up or down a bit and you get someone else's session. Whee!
However, it's also an information leak, like it is with weworkremotely.com (or it would be if the information couldn't be obtained by simply counting their jobs). Today, their current max job id is 1050 for a Ruby on Rails position. According to the Internet Archive, one year ago today, when the Internet Archive took their first snapshot, their job id of 1 was a customer support position for 37 signals. (Note that you can easily scan job URls to verify that this value is auto incrementing). That's about 87 jobs a month. Given that they advertise $200 per job posting, and ignoring that some posting are likely free, they're likely grossing an average of around $17K a month for the past year. You can spend an extra $50 per ad to highlight it and spend another $200 to keep it running, so it's hard to say how they're doing, though I was tempted to write a spider for the Internet Archive (a naughty thing to do) to see how they're really doing.
Adding hosting fees, salaries, marketing, and so on, that's not a huge amount of money. However, if you keep your costs low (and they do appear to have done so), that's a viable business model.
Frankly, I don't think that weworkremotely.com is that concerned about people seeing those IDs. After all, you can just count the jobs on a page and get an idea of their income. However, what if you represent other companies on your Web site and you want to appear larger than you are. Do you really want your "Featured Clients" to have URLs like /client/2, /client/5, and so on? That tells everyone that you're a small company. Instead, URLs like /client/acme-corp and /client/generic-inc are your friend.
Sometimes you want to hide that information in other ways. For example, in Veure, I don't want people seeing something like /area/50 and then manually changing that to /area/51 because I want them to explore the world, not discover it through URL hacking. The same goes for items in the game, so I've created this little PostgreSQL function:
CREATE OR REPLACE FUNCTION add_uri_name() RETURNS TRIGGER AS $$
DECLARE
uri_name TEXT := NEW.name;
BEGIN
uri_name = regexp_replace(uri_name, E'\\s+', ' ', 'g');
uri_name = regexp_replace(uri_name, E'\\W+', '-', 'g');
uri_name = lower(uri_name);
uri_name = trim(both ' ' from uri_name);
NEW.uri_name = uri_name;
RETURN NEW;
END;
$$ LANGUAGE 'plpgsql';
CREATE TRIGGER item_uri_name
BEFORE INSERT ON item
FOR EACH ROW EXECUTE PROCEDURE add_uri_name();
(Note: item data is added by devs, not by characters, so the above, while limited, is relatively safe, but needs work).
What the above does is, when inserting an item in the database, it generates a unique uri_name
. So an item called "French Bread" has the uri_name
of french-bread
. Now, instead of seeing /item/273, the player sees /item/french-bread and there's no information leak. Further, if you visit the archives, instead of seeing /archives/history/7, you'll see /archives/history/the-catastrophe. This has multiple advantages:
- Less information is leaked
- URLs using descriptive names like this are more user-friendly
- URLs using descriptive names like this are more SEO-friendly
That's not to say that you should never put IDs in your Web apps, but be careful about them. They might tell people much more than you intended.
Have any more horror stories like this? Share them below!
The default permalink format for Wordpress is broken in exactly this way:
http://example.com/?p=N
First thing I do when configuring a new site is to change it to one of the other alternatives:
http://example.com/2012/12/30/post-name
Your solution makes sens.
However, at some point it is gonna raise new problems and considering the "Cool URIs don't change" rule (http://www.w3.org/Provider/Style/URI.html) they will hurt.
I mean, this scheme gets tricky with language support and sometimes you face the choice between keeping mistakes and killing (profitable) URLs. Not good.
I'm gradually switching towards this kind of stuff: http://site.domain/category-id/oid/article-id where category-id and article-id work more or less like your uri_name but only the /oid/ part is needed/used for page identification
oid are 20 carateres long, part random.
This way, URL can't die unless it's useless, category-id and article-id help humains to understand what they are handling
I understand it could be use for nasty url manipulations someway, but that's as far as I've made it for now
I don't know if I should be admitting to this, but it was over ten years ago now, so hopefully it wont reflect too badly on me.
For an old project, I didn't want people to know how few orders we'd had when we first started so I essentially had a cron job which ran this on our MySQL db:
alter table orders auto_increment=?
Where "?" was an integer based on how much time had passed. We also started off with an artificially high auto increment id as well, so that people didn't get an order with id of "4" etc.
I know this is bad, for numerous reasons. But on the other hand, it worked fine, took a few minutes to implement, and let me ship code. I wouldn't do it again though ;)
Clearly for an online game it doesn't matter much, but for anything more important it would be deluded to think that obfuscating the URIs adds any security. Better to work on the basis that URIs are publicly known, and have a real authentication and authorization mechanism to limit access.
You are right, though, that incrementing numeric ids leak information, and this may sometimes matter. I think the rule of thumb is: are you happy for your page to have 'previous' and 'next' links to move along the numbered sequence? If that wouldn't be something you want to reveal, perhaps best not to have the number in the URI either.
Ed, you're perfectly correct. I do have those access controls in place, with a fair amount of tests in place to ensure they don't get broken at some point in the future (though someone will find a way to cheat, I'm sure).
That does not "fix" anything you can still navigate via GET params like
p
orcat
etc. with pretty permalinks on.I agree with the sentiment of this article, but not with the majority of the points you make.
Most of the time, information catalogued and then URL-ised by
AUTO_INCREMENT
are public information that could easily be spidered another way - as you noted yourself. So most of the timeAUTO_INCREMENT
is just used as an arbitrary counter / identifier which is just as guessable with other arbitrarily chosen identifiers. It's little worse than the/CCYY/MM/
URL you're using to order articles on this blog. Granted you are then using a natural string to specify a specific record, but that method could just as easily work for the record ID in the same way/AUTO_INCREMENT/NATURAL_STRING
. However the question of whether you'd want to hide (or make it harder to guess) URLs for articles published to the public still remains.Where
AUTO_INCREMENT
really causes problems is when you have information you don't want public. Such as session IDs. But in those instances, it would be rare to URL-ise them in the first place (bar maybe some AJAX calls). And as far as I'm aware, nobody usesAUTO_INCREMENT
for session IDs anyway (they'd have to be pretty useless developers if they did).There might be some argument against
AUTO_INCREMENT
for some user-submitted / community services such as Paste Bin and URL shorteners. But even here, if the information isn't intended to be public then it either a) should be hidden behind some level of authentication or b) shouldn't have been published online to begin with.Security aside, the other point you raised was with SEO. However even that's not really an issue for the same reason your /CCYY/MM/ URL isn't an issue: sites that need to be search engine optimised will support a human readable string as part of the URL.
So I don't think the headline "Stop Putting
AUTO_INCREMENT
IDs in URLs" and subsequent article really tacks the legitimate problems with incrementing IDs; and I think your primary example is very misleading too (and unfair to the guys you Tweeted to). The real issue is people who use predictable (which includes guessable hashes!) row ID's as identifiers for non-public data AND the issue with developers who choose to hide non-public data behind obscurity[1] instead of secure authentication. None of which has much to do with URLs in the first place since incrementing session ID's are just as bad regardless of whether they're encoded in a URL, HTTP "POST" body, a cookie, or whatever.[1] https://en.wikipedia.org/wiki/Security_through_obscurity
We use Friendly ID https://github.com/norman/friendly_id for nice slugs in URLs over integer IDs.
I don't like that french-bread example counter-measure. It still leaks information and only adds the requirement of human imagination to guess and find out neighbouring pages.
Best practice: use random IDs instead of autoincrement. In PostgreSQL, you should use uuid_generate_v4(). In Perl, you can use Data::UUID::MT->new->create. Unlike the example in the article, these IDs are stable and unguessable. Expose IDs in URIs with urlsafe_b64encode, which is less unwieldy than the usual 36 hexdigits form.
To satisfy user and SEO friendliness, you may coin a descriptive name (slug) as explained above and append it to the URI, e.g. http://example.com/HXGYBHddS12461aiOHvSuQ/french-bread. Slugs can change, but the ID is permanent as the application will disregard the slug and look up via the ID only.
If you want to use random-like ids, you could use your convenient autoincrement internal application number, then do a simple encryption of the number. That makes more sense to me than a random UUID, in that it will look and act random for external users, but it is deterministic and reversible on your side in case you want that. No need to store what big random UUID mapped to what id, for instance.
That said, I like the human-readable bit more than random hex digits, but it's something to consider.
Guessable hashes are no better than having raw row IDs exposed.
laumers, by "guessable hashes" you mean 32 random hex digits? How is that guessable? Note I did not say to hash the id, as that would indeed be guessable. I said encrypt, e.g. $cbc->encrypt($id,$key,$iv);
I agree with Dana. Simply use a toy cipher like SKIP32, for example. I'm pretty sure YouTube uses something similar for their video ids.
@Dana: Actually you did say hash the ID: "If you want to use random-like ids, you could use your convenient autoincrement internal application number, then do a simple encryption of the number."
I appreciate now that you meant you'd use a salt - which would be required to make the hash non-guessable - but it didn't help your case that you were also talking about making the hash reversible (why? it would be easier just to dump the hash in the DB as the unique primary key).
You're still falling into the trap of thinking this would make your web application more secure. It doesn't. URLs are never going to be secure to begin with as URLs can be probed and published. By obfuscating the record number you're just adding to the length of the URL - all of the same rules apply for actually securing the content. I appreciate some sites like to us a hash for the URL but often this is just for aesthetic or SEO reasons (like why some blogs / news sites use the date in the URL).
I think a few people on here need to familiarise themselves with the term 'security through obscurity': https://en.wikipedia.org/wiki/Security_through_obscurity