>
Planet, PostgreSQL, Technical

Sanitized SQL

A couple times within the past month, I’ve had people send me a message asking if I have any suggestions about where to learn postgres. I like to share the collection of links that I’ve accumulated (and please send me more, if you have good ones!) but another thing I always say is that the public postgres slack is a nice place to see people asking questions (Discord, Telegram and IRC also have thriving Postgres user communities). Trying to answer questions and help people out can be a great way to learn!

Last month there was a brief thread on the public postgres slack about the idea of sanitizing SQL and this has been stuck in my head for awhile.

The topic of sensitive data and SQL is actually pretty nuanced.

First, I think it’s important to directly address the question about how to treat databases schemas – table and column names, function names, etc. We can take our cue from the many industry vendors with data catalog, data lineage and data masking products. Schemas should be internal and confidential to a company – but they are not sensitive in the same way that PII or PCI data is. It’s generally okay to share schema information with vendors (for example, while working together on a support ticket for database performance). Within a company, it’s desirable for most schemas to be discoverable by engineers across multiple development teams – this is worth the benefits of better collaboration and better architecture of internal software.

General Principle: Schema = Source Code

Unfortunately, the versatile SQL language does not cleanly separate things. A SQL statement is a string that can mix keywords and schema and data all together. As Benoit points out in the slack thread – there are prepared (parameterized) statements, but you can easily miss a spot and end up with literal strings in queries. And I would add that most enterprises have the occasional need for manual “data fixes” which often involve simple scripts where literal values are common.

Benoit’s suggestion was to run a full parse of the query text. This is a good idea – in fact PgAnalyze already maintains a standalone open-source library which can be used to directly leverage Postgres’ query parser in many languages. This is really the best solution. However it is worth noting that I’m interested in cases of post-processing query texts from pg_stat_activity and pg_stat_statements, both of which have maximum lengths and will truncate text that’s longer. So query parsing would need to still work with truncated texts that throw syntax errors.

The PgAnalyze library approach is interesting, but I think a simple regex-based approach actually has a lot of merit. This can give very useful sanitized SQL for developers to debug, it has very low risk of exposing sensitive data, and the code is incredibly simple… especially compared with importing the entire postgres parser and trying to link to compiled C libraries in other languages!

Tonight I finally got around to a POC for this.

My design choices here were very intentional:

  • I’m stripping out comments because libraries like sqlcommenter will add unique values via comments which break any ability to aggregate and summarize and report top queries or problem queries.
  • I would always include the query_id alongside the sanitized SQL text. A user can always go back to the database later and look directly at pg_stat_statements to get the full query text as long as they have the Query ID.
  • My decision to include the first three words (excluding comments) and two words following every occurrence of FROM is very strategic. In most cases (CTEs being the exception), the first three words will tell what kind of command is being executed – SELECT or DML or DDL or some utility/misc statement. By including two words after the command, we will generally see the table name for inserts and updates. By including words after FROM, we’ll know at least one of the tables being operated on for queries and deletes. This means we always know at a glance “it’s updating table X” or “it’s querying table Y”.
  • When wait events indicate lock contention or increasing IO time, it’s extremely useful to see which tables are being operated on.
  • There may be a few cases where this algorithm’s sanitized SQL isn’t as useful as it could be. But that’s why we include the Query ID for retrieving the full query text if needed – and my main goal here is just to have something that’s cheap/easy and helpful most of the time and that we can ensure is safe for developers and operators without requiring PII/PCI data controls.
  • The likelihood of this algorithm emitting sensitive data is next-to-zero. We shouldn’t get literals from INSERTs or UPDATEs. Function and procedure calls must always include parentheses, so that’s mitigated with a simple regex to nuke anything after an open-parenthesis.
  • If the string ‘FROM’ occurs in a string literal, then we aren’t going to distinguish that from a keyword. This is worth consideration; there is an injection vector here if you can spot it. But I don’t think it’s worthwhile to get fancy and attempt to parse SQL via regex. (As fun as that would be, simplicity/readability/maintainability wins here.) The SQL language is insanely sophisticated and if we’re going to parse then it’s the PgAnalyze way. But in practice, the actual surface area and exposure/leak risk with this regex-based function is very small and likely can be mitigated.
  • This does not lessen the importance of good coding practices like parameterized SQL. This is just an additional layer of defense on top of that. Values correctly passed through parameterized SQL will never appear in a query text in the first place.

Sanitize SQL PL/pgSQL Function

https://gist.github.com/ardentperf/44e94ac484e53ff8353f6c1dc0b8f272

Here’s what the code looks like:

CREATE OR REPLACE FUNCTION sanitize_sql(sql_text text) 
RETURNS text AS $$
DECLARE
    cleaned_text text;
    first_part_regex_3words text := '([^[:space:]]+)[[:space:]]+([^[:space:]]+)[[:space:]]+([^[:space:]]+)';
    first_part_regex_2words text := '([^[:space:]]+)[[:space:]]+([^[:space:]]+)';
    first_part_regex_1words text := '([^[:space:]]+)';
    first_part text;
    match_array text;
    from_parts_regex_3words text := '(FROM)[[:space:]]+([^[:space:]]+)[[:space:]]*([^[:space:]]*)';
    from_parts text := '';
BEGIN
    -- Remove multi-line comments (/* ... */)
    cleaned_text := regexp_replace(sql_text, '/\*.*?\*/', '', 'g');
    
    -- Remove single-line comments (-- to end of line)
    cleaned_text := regexp_replace(cleaned_text, '--.*?(\n|$)', '', 'g');
    
    -- Extract the first keyword and up to two words after it
    first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_3words),' ');
    if first_part is null or first_part ILIKE '% FROM %' or first_part ILIKE '% FROM' then
      first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_2words),' ');
      if first_part is null or first_part ILIKE '% FROM' then
        first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_1words),' ');
      end if;
    end if;
    first_part := regexp_replace(first_part, '\(.*','(...)');
    
    -- Find all occurrences of FROM and two words after each
    FOR match_array IN 
        SELECT array_to_string(regexp_matches(cleaned_text,from_parts_regex_3words,'gi'),' ') 
    LOOP
        match_array := regexp_replace(match_array, '\(.*','(...)');
        from_parts := from_parts || '...' || match_array;
    END LOOP;
    
    -- Return combined result
    RETURN first_part || from_parts;
END;
$$ LANGUAGE plpgsql;

Test 1: Sensitive Data in a Function Call

postgres=# SELECT sanitize_sql($test$

SELECT pgp_sym_encrypt('123-45-6789', 'my_secret_key') AS encrypted_ssn;

$test$);

        sanitize_sql
-----------------------------
 SELECT pgp_sym_encrypt(...)
(1 row)

Test 2: Simple SELECT with Inline and Block Comments

postgres=# SELECT sanitize_sql($test$

-- Fetch active users only
SELECT id, name  -- user info
FROM users /* main table */
WHERE active = TRUE; /* status flag */

$test$);

            sanitize_sql
------------------------------------
 SELECT id, name...FROM users WHERE
(1 row)

Test 3: SELECT with Subquery and Mixed Comment Styles

postgres=# SELECT sanitize_sql($test$

SELECT id, name
FROM users
WHERE id IN (
    /* subquery for high-value customers */
    SELECT user_id  -- link to users.id
    FROM orders
    WHERE total > 100  -- filter expensive orders
);
-- end of query

$test$);

                      sanitize_sql
--------------------------------------------------------
 SELECT id, name...FROM users WHERE...FROM orders WHERE
(1 row)

Test 4: SELECT + CTE with Comments Inside and Outside

postgres=# SELECT sanitize_sql($test$

-- recent orders per user
WITH recent_orders AS (
    SELECT user_id, MAX(created_at) AS last_order
    FROM orders
    GROUP BY user_id  /* aggregation */
)
SELECT u.name, r.last_order
FROM users u
JOIN recent_orders r ON u.id = r.user_id;  -- join results

$test$);

                       sanitize_sql
----------------------------------------------------------
 WITH recent_orders AS...FROM orders GROUP...FROM users u
(1 row)

Test 5: INSERT with Comments in Values

postgres=# SELECT sanitize_sql($test$

INSERT INTO users (name, email, created_at)
VALUES (
    'Alice', -- first name
    'alice@example.com', /* email */
    NOW() /* timestamp */
);
-- new user inserted

$test$);

   sanitize_sql
-------------------
 INSERT INTO users
(1 row)

Test 6: UPDATE with Trailing and Embedded Comments

postgres=# SELECT sanitize_sql($test$

UPDATE users
SET last_login = NOW()  -- set current time
WHERE id = 42 /* target specific user */;  -- done

$test$);

   sanitize_sql
------------------
 UPDATE users SET
(1 row)

Test 7: DELETE with Multi-line Comment Block

postgres=# SELECT sanitize_sql($test$

/*
 * Delete old sessions.
 * Keep data from the last 30 days.
 * Be careful: irreversible.
 */
DELETE FROM sessions
WHERE last_access < NOW() - INTERVAL '30 days';

$test$);

         sanitize_sql
------------------------------
 DELETE...FROM sessions WHERE
(1 row)

Test 8: UPSERT (Insert … On Conflict) with Inline + Header Comments

postgres=# SELECT sanitize_sql($test$

-- Upsert settings
INSERT INTO user_settings (user_id, theme, notifications)
VALUES (
    1, /* user id */
    'dark',  -- theme
    TRUE  -- notifications on
)
ON CONFLICT (user_id)
DO UPDATE
SET theme = EXCLUDED.theme,  -- overwrite
    notifications = EXCLUDED.notifications;

$test$);

       sanitize_sql
---------------------------
 INSERT INTO user_settings
(1 row)

Test 9: CTE-Based UPDATE with Nested Comments

postgres=# SELECT sanitize_sql($test$

-- mark inactive users
WITH inactive_users AS (
    SELECT id
    FROM users
    WHERE last_login < NOW() - INTERVAL '1 year'  /* cutoff */
)
UPDATE users
SET active = FALSE
WHERE id IN (
    SELECT id FROM inactive_users  -- CTE reference
);

$test$);

                            sanitize_sql
--------------------------------------------------------------------
 WITH inactive_users AS...FROM users WHERE...FROM inactive_users );
(1 row)

Test 10: DDL with Comments Everywhere

postgres=# SELECT sanitize_sql($test$

-- create table if missing
CREATE TABLE IF NOT EXISTS audit_log (  /* audit records */
    id SERIAL PRIMARY KEY, -- identity column
    user_id INT REFERENCES users(id),  /* FK */
    action TEXT NOT NULL,  -- what happened
    created_at TIMESTAMP DEFAULT NOW() /* timestamp */
);

$test$);

  sanitize_sql
-----------------
 CREATE TABLE IF
(1 row)

Test 11: Complex Query with Multi-CTE, Inline + Block Comments

postgres=# SELECT sanitize_sql($test$

/*
   This query finds top customers.
   It uses multiple CTEs and subqueries.
*/
WITH order_totals AS (
    SELECT user_id, SUM(total) AS lifetime_value
    FROM orders
    GROUP BY user_id  -- one row per user
),
top_customers AS (
    SELECT user_id
    FROM order_totals
    WHERE lifetime_value > 10000  /* threshold */
)
SELECT u.id, u.name, o.lifetime_value  -- main output
FROM users u
JOIN order_totals o ON u.id = o.user_id
WHERE u.id IN (SELECT user_id FROM top_customers)
ORDER BY o.lifetime_value DESC  /* high to low */
LIMIT 10;  -- top 10

$test$);

                                                 sanitize_sql
---------------------------------------------------------------------------------------------------------------
 WITH order_totals AS...FROM orders GROUP...FROM order_totals WHERE...FROM users u...FROM top_customers) ORDER
(1 row)

Test 12: Function Call in the FROM Clause

postgres=# SELECT sanitize_sql($test$

SELECT * FROM generate_series(1,10);

$test$);

             sanitize_sql
--------------------------------------
 SELECT *...FROM generate_series(...)
(1 row)

Test 13: Anonymous Code Block

postgres=# SELECT sanitize_sql($test$

DO $$
DECLARE
    tbl RECORD;
BEGIN
    OPEN table_cursor;
    LOOP
        FETCH table_cursor INTO tbl;
        EXIT WHEN NOT FOUND;
        EXECUTE 'VACUUM ' || tbl.tablename;
    END LOOP;
    CLOSE table_cursor;
END $$;

$test$);

 sanitize_sql
---------------
 DO $$ DECLARE
(1 row)

Test 14: Declaring a Cursor

postgres=# SELECT sanitize_sql($test$

-- Declare a cursor for employees in Engineering
DECLARE emp_cursor CURSOR FOR
SELECT id, name, salary
FROM employees
WHERE department = 'Engineering';

$test$);

                   sanitize_sql
--------------------------------------------------
 DECLARE emp_cursor CURSOR...FROM employees WHERE
(1 row)

Test 15: Joining Multiple Tables and FROM in a String Literal

postgres=# SELECT sanitize_sql($test$

SELECT
    c.name AS customer_name,
    o.order_id,
    o.order_date,
    oi.product_name,
    oi.quantity,
    'Orders coming from customers are listed below' AS description
FROM customers c, orders o, order_items oi
WHERE c.customer_id = o.customer_id
  AND o.order_id = oi.order_id
ORDER BY c.name, o.order_date;

$test$);

                       sanitize_sql
-----------------------------------------------------------
 SELECT c.name AS...from customers are...FROM customers c,
(1 row)

.

Unknown's avatar

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

Discussion

One thought on “Sanitized SQL

  1. Hettie D.'s avatar

    Could you summarize what would be the primary purpose for such sanitization?

    Like

    Posted by Hettie D. | October 18, 2025, 4:36 am

Leave a reply to Hettie D. Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Disclaimer

This is my personal website. The views expressed here are mine alone and may not reflect the views of my employer.

contact: 312-725-9249 or schneider @ ardentperf.com


https://about.me/jeremy_schneider

oaktableocmaceracattack

(a)

Enter your email address to receive notifications of new posts by email.

Join 76 other subscribers