Skip to content

Conversation

@analytically
Copy link
Contributor

Significant performance improvements to SQL sanitization:

  • SanitizeSQL: 1632ns → 923ns/op (-43.4%)
  • Sanitize: 362ns → 348ns/op (-3.9%)
  • Memory: unchanged (544 B/op, 9 allocs/op)

Key optimizations:

  1. ASCII fast-path in lexer state functions (rawState, singleQuoteState, doubleQuoteState, oneLineCommentState) - avoids UTF-8 decoding overhead for the 99%+ of SQL that is ASCII

  2. Direct byte checks for lookaheads (e', --, /*, '', "") instead of UTF-8 decoding when checking for ASCII characters

  3. Adaptive QuoteString allocation strategy:

    • Short strings (≤64 bytes): worst-case preallocate
    • Long strings (>64 bytes): scan-first for exact allocation

All optimizations maintain full UTF-8 safety and correctness. Benchmarked on Apple M1 Pro (darwin/arm64).

Significant performance improvements to SQL sanitization:
- SanitizeSQL: 1632ns → 923ns/op (-43.4%)
- Sanitize: 362ns → 348ns/op (-3.9%)
- Memory: unchanged (544 B/op, 9 allocs/op)

Key optimizations:
1. ASCII fast-path in lexer state functions (rawState, singleQuoteState,
   doubleQuoteState, oneLineCommentState) - avoids UTF-8 decoding overhead
   for the 99%+ of SQL that is ASCII

2. Direct byte checks for lookaheads (e', --, /*, '', "") instead of
   UTF-8 decoding when checking for ASCII characters

3. Adaptive QuoteString allocation strategy:
   - Short strings (≤64 bytes): worst-case preallocate
   - Long strings (>64 bytes): scan-first for exact allocation

All optimizations maintain full UTF-8 safety and correctness.
Benchmarked on Apple M1 Pro (darwin/arm64).

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
@analytically
Copy link
Contributor Author

Failing CI doesn't seem related but more to do with CockroachDB?

@jackc
Copy link
Owner

jackc commented Nov 28, 2025

I'm a bit concerned about making changes to a security sensitive portion of the code.

How much does this impact real world performance? It should only happen when using the simple protocol which I hope is fairly rare.

Also, if I understand the ASCII fast path portions correctly, wouldn't it be impossible for the state characters (e.g. ', -, /) to be reachable in the UTF8 path? I would think all that the UTF8 path would do is consume characters.

@analytically
Copy link
Contributor Author

Sounds fair. Rare code path vs stability risk for 0.0000+% performance change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants