Meetings

Rule 17 - The “Rumsfeld Rule”

“You go to war with the army you have – not the army you might want or wish to have at a later time.”
― Donald Rumsfeld

This pithy remark from the former United States Secretary of Defense has an interesting application when designing enterprise systems.

There is a natural tendency when designing or adding to a database schema to keep things succinct. If the id of an object can be derived from objects it is related to, the usual thing to do is leave it like that and use JOIN’s or record filters within the code itself in order to drill down and get the collection of objects that is desired.

CREATE TABLE users (
  id int(11) NOT NULL auto_increment,
  username varchar(255),
  password varchar(255)
)

CREATE TABLE books (
  id int(11) NOT NULL auto_increment,
  user_id int(11),
  title varchar(255),
)

CREATE TABLE pages (
  id int(11) NOT NULL auto_increment,
  book_id int(11),
  title varchar(255),
  body text,
  view_count int(11)
)

This looks really good on paper. In the example above, if you wanted to find all the pages written by a given user, you’d simply bounce off the books table in a JOIN to get what you want. No problem, right?

Wrong. The problem comes when the system you are building starts to scale and you need to gather a large collection from a vast database. For instance, in the example above, what if there are 10,000 users, 50,000 books, and 5,000,000 pages. All of a sudden the JOIN’s are too slow, or the record filter times out entirely.

Now you’re probably thinking: “Yeah, I can see that, but how often would I need to check on something like how many pages a user has?” If your system has plan-based quotas for the number of pages a user can have, your code will be making that check every time a page is added. If your system shows the user statistics on their content on the console landing page when the user logs in, you’ll be making this kind of query every time a new session is started. It adds up.

When you hit this problem - when your site suddenly is slower than spit sliding uphill on a cold winter’s day in January - it’s too late to change the database. You’re now forced to “go to war with the army you have” - or with the database you have.

Sure, you can start adding caching and recalculating counts in a background job. There are ways to “fix the problem”. But these aren’t really fixing the problem, they’re fixing the symptom. The problem still is that traversing multiple levels deep to get owned collections is painful when your data set gets really big.

Meanwhile, adding an extra couple of columns to a table to keep track of id’s of related objects a couple of levels up the food chain is basically free. Below we’ve made a simple adjustment to the pages table: adding the id of the owning user.

CREATE TABLE pages (
  id int(11) NOT NULL auto_increment,
  book_id int(11),
  user_id int(11),
  title varchar(255),
  body text
  view_count int(11)
)

Now to get the pages owned by a user is a straight query with no joins or record selectors needed.

The bottom line here is that if you’re building a Big System, plan ahead for having to handle some massive data queries and put in extra cross-referencing columns before you need them. A few extra integer columns will not make a dramatic difference in database or record size, but having the linkage between objects be completely opaque will save you many headaches down the road.

Going a little further, having those back-references will make it easier to flip data around an axis to show more interesting analytics. For instance, suppose you want to show which users had the most popular pages. With the extra cross-references in place this becomes a quick and easy query. Again, the example here is kind of simple, but imagine a nested model tree something like:

Publisher -> Author -> Book -> Page

Now suppose you want to find out which Publisher has the most-read Pages. That would be three JOIN’s across four tables run over millions of records. But by adding back-references at each level, you can easily calculate “most-read” by Book, Author, or Publisher with equal speed.

From the perspective of a data modeling purist adding these extra back reference integer fields is wasteful. You do not need them to perform logical operations at all. You are adding size to the table to hold data you can derive, right? What you are really doing is future-proofing your data models so that they will scale and so that you can respond to pivots in the business model. Imagine, for instance, that the “most read pages” analytics turns out to be what people like most about your dot-com. That can and does happen.

You want to build your application - your “army” - for wars you may not even know it will need to fight yet.