Justin's Linklog

Pete Hunt's contrarian RDBMS tips

He posted a thread containing this list of top tips for relational database use:
1. It's often better to add tables than alter existing ones. This is especially true in a larger company. Making changes to core tables that other teams depend on is very risky and can be subject to many approvals. This reduces your team's agility a lot. Instead, try adding a new table that is wholly owned by your team. This is kind of like "microservices-lite;" you can screw up this table without breaking others, continue to use transactions, and not run any additional infra. (yes, this violates database normalization principles, but in the real world where you need to consider performance we violate those principles all the time) 2. Think in terms of indexes first. Every single time you write a query, you should first think: "which index should I use?" If no usable index exists, create it (or create a separate table with that index, see point 1). When writing the query, add a comment naming the index. Before you commit any queries to the codebase, write a script to fill up your local development DB with 100k+ rows, and run EXPLAIN on your query. If it doesn't use that index, it's not ready to be committed. Baking this into an automated test would be better, but is hard to do. 3. Consider moving non-COUNT(*) aggregations out of the DB. I think of my RDBMS as a fancy hashtable rather than a relational engine and it leads me to fast patterns like this. Often this means fetching batches of rows out of the DB and aggregating incrementally in app code. (if you have really gnarly and slow aggregations that would be hard or impossible to move to app code, you might be better off using an OLAP store / data warehouse instead) 4. Thinking in terms of "node" and "edge" tables can be useful. Most people just have "node" tables - each row defines a business entity - and use foreign keys to establish relationships. Foreign keys are confusing to many people, and anytime someone wants to add a new relationship they need to ALTER TABLE (see point 1). Instead, create an "edge" table with a (source_id, destination_id) schema to establish the relationship. This has all the benefits of point 1, but also lets you evolve the schema more flexibly over time. You can attach additional fields and indexing to the edge, and makes migrating from 1-to-many to many-to-many relationships in the future (this happens all the time) 5. Usually every table needs "created_at" and/or "updated_at" columns. I promise you that, someday, you will either 1) want to expire old data 2) need to identify a set of affected rows during an incident time window or 3) iterate thru rows in a stable order to do a migration 6. Choosing how IDs are structured is super important. Never use autoincrement. Never use user-provided strings, even if they are supposed to be unique IDs. Always use at least 64 bits. Snowflake IDs (https://en.wikipedia.org/wiki/Snowflake_ID) or ULIDs (https://github.com/ulid/spec) are a great choice. 7. Comment your queries so debugging prod issues is easier. Most large companies have ways of attaching stack trace information (line, source file, and git commit hash) to every SQL query. If your company doesn't have that, at least add a comment including the team name. Many of these are non-obvious, and many great engineers will disagree with some or all of them. And, of course, there are situations when you should not follow them. YMMV!
Number 5 is absolutely, ALWAYS true, in my experience. And I love the idea of commenting queries... must follow more of these.

(tags: rdbms databases oltp data querying storage architecture)

Archives

Pete Hunt’s contrarian RDBMS tips