Stop Putting AUTO_INCREMENT IDs in URLs
Recently someone posted about an online job:
Well, that looks interesting, except that what interested me wasn't the job, it was the information the ID tells us about weworkremotely.com/ (though I suspect it's not something they're worried about).
The exchange for that tweet continued as follows:
This is an incredibly common anti-pattern in Web-based software and causes all sorts of problems. It stems from something like this (a shortened snippet from Veure):
CREATE TABLE station ( station_id SERIAL PRIMARY KEY, star_id integer NOT NULL, name character varying(255) NOT NULL, affiliation_id integer NOT NULL, law percentage NOT NULL, orwellian percentage NOT NULL, is_jumpgate boolean NOT NULL, -- constraints );
SERIAL PRIMARY KEY is sort of PostgreSQL's equivalent to MySQL's
AUTO_INCREMENT, or an
IDENTITY column in SQL Server. That type of non-identifying key on a table is called a surrogate key because it's often used as a substitute for the natural key (the fields upon which each record depends). Surrogate keys have their strengths and weaknesses. A huge strength is that if the data in your natural key changes, you don't have to go through all tables in your database and change the key. However, using an auto incrementing key has all sorts of downsides. The auto incrementing values should generally be used internally to join tables and not relied upon for anything else. However, people abuse them in all sorts of ways.
One of the first is SQL like this:
SELECT foo, bar FROM some_table WHERE some_table_id > $some_previous_table_id
See what's happening there? The developer is relying on an artifact of auto-incrementing values: get records from a table created after a certain record. Of course, there's nothing stopping someone from re-using an old ID (a bad practice), or from changing the IDs to a different strategy, such as a UUID (often a performance issue if not handled correctly), or (common) deleting and re-inserting a record. Thus, the SQL above has all sorts of failure modes. Worse: they're often rare failure modes, making the bugs intermittent and hard to track down.
In the case of Web applications, however, it gets even worse. There are many instances of session hijacking because someone used an auto increment id as the session id. You just bump that session ID up or down a bit and you get someone else's session. Whee!
However, it's also an information leak, like it is with weworkremotely.com (or it would be if the information couldn't be obtained by simply counting their jobs). Today, their current max job id is 1050 for a Ruby on Rails position. According to the Internet Archive, one year ago today, when the Internet Archive took their first snapshot, their job id of 1 was a customer support position for 37 signals. (Note that you can easily scan job URls to verify that this value is auto incrementing). That's about 87 jobs a month. Given that they advertise $200 per job posting, and ignoring that some posting are likely free, they're likely grossing an average of around $17K a month for the past year. You can spend an extra $50 per ad to highlight it and spend another $200 to keep it running, so it's hard to say how they're doing, though I was tempted to write a spider for the Internet Archive (a naughty thing to do) to see how they're really doing.
Adding hosting fees, salaries, marketing, and so on, that's not a huge amount of money. However, if you keep your costs low (and they do appear to have done so), that's a viable business model.
Frankly, I don't think that weworkremotely.com is that concerned about people seeing those IDs. After all, you can just count the jobs on a page and get an idea of their income. However, what if you represent other companies on your Web site and you want to appear larger than you are. Do you really want your "Featured Clients" to have URLs like /client/2, /client/5, and so on? That tells everyone that you're a small company. Instead, URLs like /client/acme-corp and /client/generic-inc are your friend.
Sometimes you want to hide that information in other ways. For example, in Veure, I don't want people seeing something like /area/50 and then manually changing that to /area/51 because I want them to explore the world, not discover it through URL hacking. The same goes for items in the game, so I've created this little PostgreSQL function:
CREATE OR REPLACE FUNCTION add_uri_name() RETURNS TRIGGER AS $$ DECLARE uri_name TEXT := NEW.name; BEGIN uri_name = regexp_replace(uri_name, E'\\s+', ' ', 'g'); uri_name = regexp_replace(uri_name, E'\\W+', '-', 'g'); uri_name = lower(uri_name); uri_name = trim(both ' ' from uri_name); NEW.uri_name = uri_name; RETURN NEW; END; $$ LANGUAGE 'plpgsql'; CREATE TRIGGER item_uri_name BEFORE INSERT ON item FOR EACH ROW EXECUTE PROCEDURE add_uri_name();
(Note: item data is added by devs, not by characters, so the above, while limited, is relatively safe, but needs work).
What the above does is, when inserting an item in the database, it generates a unique
uri_name. So an item called "French Bread" has the
french-bread. Now, instead of seeing /item/273, the player sees /item/french-bread and there's no information leak. Further, if you visit the archives, instead of seeing /archives/history/7, you'll see /archives/history/the-catastrophe. This has multiple advantages:
- Less information is leaked
- URLs using descriptive names like this are more user-friendly
- URLs using descriptive names like this are more SEO-friendly
That's not to say that you should never put IDs in your Web apps, but be careful about them. They might tell people much more than you intended.
Have any more horror stories like this? Share them below!