Suppose that you want to be completely over-the-top paranoid about making sure that when you execute some particular SQL statement on your Postgres database, you’re doing it in the safest and least risky way?
For example, suppose it’s the production database behind your successful startup’s main commercial website. If anything even causes queries to block/pause for a few minutes then people will quickly be tweeting about how they can’t place orders and it hurt both your company’s revenue and reputation.
You know that it’s really important to save regular snapshots and keep a history of important metrics. You’re writing some of your own code to capture a few specific stats like the physical size of your most important application tables or maybe the number of outstanding orders over time. Maybe you’re writing some java code that gets scheduled by quartz, or maybe some python code that you’ll run with cron.
Or another situation might be that you’re planning to make an update to your schema – adding a new table, adding a new column to an existing table, modifying a constraint, etc. You plan to execute this change as an online operation during the weekly period of lowest activity on the system – maybe it’s very late Monday night, if you’re in an industry that’s busiest over weekends.
How can you make sure that your SQL is executed on the database in the safest possible way?
Here are a few ideas I’ve come up with:
connect_timeoutto something short, for example 2 seconds.
lock_timeoutto something appropriate. For example, 2ms on queries that shouldn’t be doing any locking. (I’ve seen entire systems brown-out because a “quick” DDL had to get in line behind an app transaction, and then all the new app transactions piled up behind the DDL that was waiting!)
statement_timeoutto something reasonable for the query you’re running – thus putting an upper bound on execution time.
- Using an appropriate client-side timeout, for cases when the server fails to kill the query using
statement_timeout. For example, in Java the Statement class has native support for this.
- When writing SQL, fully qualify names of tables and functions with the schema/namespace. (This can be a security feature; I have heard of attacks where someone manages to change the search_path for connections.)
- Check at least one explain plan and make sure it’s doing what you would expect it to be doing, and that it seems likely to be the most efficient way to get the information you need.
- Don’t use system views that join in unneeded data sources; go direct to needed raw relation or a raw function.
- Access each data source exactly once, never more than once. In that single pass, get all data that will be needed. Analytic or window functions are very useful for avoiding self-joins.
- Restrict the user to minimum needed privileges. For example, the
pg_read_all_statsrole on an otherwise unprivileged user might be useful.
- Make sure your code has back-off logic for retries when failures or unexpected results are encountered.
- Prevent connection pile-ups resulting from database slowdowns or hangs. For example, by using a dedicated client-side connection pool with dynamic sizing entirely disabled or with a small max pool size.
- Run the query against a physical replica/hot standby (e.g. pulling a metric for the physical size of important tables) or logical copy (e.g. any query against application data), instead of running the query against the primary production database. (However, note that when
hot_standby_feedbackis enabled, long-running transactions on the PostgreSQL hot standby can still impact the primary system.)
- For all DDL, carefully check the level of locking that it will require and test to get a feel for possible execution time. Watch out for table rewrites. Many DDLs that used to require a rewrite no longer do in current versions of PostgreSQL, but there are still a few out there. ALTER TABLE statements must be evaluated very carefully. Frankly ALTER TABLE is a bit notorious for being unclear about which incantations cause table rewrites and which ones don’t. (I have a friend who just tests every specific ALTER TABLE operation first on an empty table and then checks if pg_class changes show that a rewrite happened.)
What am I missing? What other ideas are out there for executing SQL in Postgres with a “paranoid” level of safety?
Note: see also Column And Table Redefinition With Minimal Locking
Pingback: Postgres 上的偏执 SQL 执行 – 霸屏ing - February 25, 2022