Skip to main content

Making Encrypted Records Searchable (Without Losing Sleep Over Your Audit)

TL;DR

I needed full-text search across compliance records in Humadroid — some of which are encrypted at the application layer. The naive answer is “just decrypt everything into a search index.” The real answer involves understanding exactly what you’re trading, making that trade-off explicit and per-organization, and designing the index so it reveals as little as possible. Here’s the pattern I built, what I considered, and what I’d tell an auditor who asks about it.

Why Encrypted Records Exist in the First Place

Humadroid is a GRC (Governance, Risk, and Compliance) platform. Our customers use it to manage SOC 2 controls, store implementation notes, track evidence, and maintain security documentation. Some of that content is sensitive. Think: “Here’s how we configured our AWS CloudTrail logging” or “Our penetration test found these three critical vulnerabilities.”

We encrypt sensitive fields at the application layer using Rails’ built-in encrypts directive. Not just database-level encryption at rest (we have that too) — actual column-level encryption where the plaintext never touches the database.

This matters because a database breach doesn’t expose the content. An attacker with a SQL dump sees gibberish. Your DBA can’t read implementation notes. The data is genuinely protected at rest in a way that disk encryption alone doesn’t provide.

(If you’re building a compliance tool and you don’t encrypt this stuff… I have questions.)

In practice, the split looks something like this: record titles and identifiers live unencrypted because they need to be sortable, filterable, and displayable in lists. The actual substance — implementation details, policy content, audit findings — gets encrypted. Not every model has encrypted fields, but the ones that carry real security context do.

And here’s the problem. You can’t run WHERE content ILIKE '%cloudtrail%' on an encrypted column. The database doesn’t know what’s in there. That’s the entire point.

The Search Problem

When you have 200+ compliance controls, dozens of policy documents, and a growing pile of evidence artifacts, finding things matters. A lot. During an audit, someone asks “show me how you handle access reviews” and you need to find the relevant control, its implementation notes, and the supporting evidence. Fast.

Without search, people do what people always do: they scroll. They open tabs. They Ctrl+F inside individual documents. They message a colleague asking “where did we put the thing about the thing?”

I’ve been there. During our own SOC 2 prep, I watched myself doing exactly this — clicking through screens in my own product looking for records I knew existed. Not great when you’re building the tool that’s supposed to solve this exact workflow.

So. We need search. But some of the most valuable content lives in encrypted columns.

The Options (And Why Most of Them Suck)

I spent more time than I’d like to admit thinking about this. Here’s the landscape:

Option 1: Client-side search. Decrypt everything in the browser and search there. Works for small datasets, absolutely falls apart at scale. You’d need to load every record to search across them. Plus, the initial load time would be brutal, and you’re shipping all that decrypted data to the client just in case someone searches.

Option 2: Searchable encryption schemes. There’s academic work on this — order-preserving encryption, homomorphic encryption, encrypted search indexes. Fascinating stuff. Completely impractical for a startup that needs to ship features this quarter. The libraries are immature, the performance characteristics are unpredictable, and explaining to an auditor how your custom cryptographic search works sounds like a nightmare I don’t need.

Option 3: Blind indexes. Hash the content with HMAC and search against hashes. Works for exact matches (like email lookups), useless for full-text search. You can’t stem, rank, or fuzzy-match a hash.

Option 4: Decrypt and index separately. Take the encrypted content, decrypt it at the application layer, tokenize it, and store the tokens in a dedicated search index table. Accept the trade-off explicitly.

I went with Option 4. But the devil — and the audit readiness — is in the details.

The Trade-Off (Let’s Be Honest About It)

Here’s what I’m not going to do: pretend this is a zero-cost decision.

When you decrypt content and write it into a search index, you’re creating a second representation of that data. It’s not plaintext — PostgreSQL tsvector data consists of stemmed, normalized word tokens, not readable sentences. But it’s not nothing either.

For the string “Annual penetration test report for AWS infrastructure,” the tsvector stores something like:

'annual':1 'aws':6 'infrastructur':7 'penetr':2 'report':4 'test':3

It’s not the original text. You can’t reconstruct sentences or context. But an attacker with database access could determine that a record mentions “penetration” and “AWS.” That’s more information than the fully encrypted column reveals.

The index also stores record titles in cleartext — used for rendering search results without loading and decrypting the actual records. Titles like “Q3 Penetration Test Results” or “AWS IAM Access Review Policy” are visible as-is. This is the most exposed piece.

This is a conscious trade-off. And I think the right approach is to make it:

  1. Explicit — organizations opt into search knowing what it means
  2. Reversible — toggle it off and the entire index is purged immediately
  3. Layered — the index still lives behind RBAC, account scoping, encryption at rest, and TLS in transit

Which brings us to the architecture.

The Architecture: Separate Table, Opt-In, Purgeable

The core design principle: the search index is a secondary, disposable data structure. It can be rebuilt from the source records at any time and destroyed without losing anything.

The Index Table

The search index lives in its own table, completely decoupled from the source records. Polymorphic references point back to whatever was indexed:

create_table :search_index, id: :uuid do |t|
  t.references :account, type: :uuid, null: false,
               foreign_key: { on_delete: :cascade }
  t.references :searchable, polymorphic: true, type: :uuid, null: false
  t.string  :title, null: false, default: ""
  t.string  :context_label
  t.string  :url_path
  t.tsvector :tsv

  t.timestamps
end

add_index :search_index, :tsv, using: :gin
add_index :search_index, %i[searchable_type searchable_id], unique: true
add_index :search_index, :title,
          using: :gin, opclass: :gin_trgm_ops

A few design decisions worth explaining.

The title and context_label columns exist purely for rendering. When someone types in a command palette, I need to show results fast — without loading each matching record and decrypting its fields just to display a label. Precomputing these at index time makes the UI feel instant.

The url_path is precomputed for the same reason. No route generation at query time.

The unique index on [searchable_type, searchable_id] gives us upsert semantics — one index entry per record, always overwritten on update, never duplicated.

We also enable pg_trgm for fuzzy matching on titles. Because people misspell “penetration” more often than you’d think.

Per-Organization Opt-In

This isn’t a global feature flag. Each organization explicitly enables search indexing through a boolean on their account record:

after_commit :handle_search_toggle,
             if: :saved_change_to_search_indexing_enabled?

def handle_search_toggle
  if search_indexing_enabled?
    SearchRebuildJob.perform_async(id)
  else
    SearchPurgeJob.perform_async(id)
  end
end

Toggle on → background job decrypts and indexes everything. Toggle off → one DELETE statement wipes every index entry for that organization. Gone.

The toggle itself is gated behind a Flipper flag for controlled rollout, and only account admins can flip it. Feature flag + explicit opt-in + admin-only access means nobody gets indexed by accident.

Making Models Searchable

Rather than scattering index logic across the codebase, I built a concern that each searchable model includes. The model defines a simple contract — what to index and at what weight — and the concern handles the lifecycle:

module Searchable
  extend ActiveSupport::Concern

  included do
    after_commit :reindex_search, on: %i[create update], if: :should_reindex?
    after_commit :remove_from_index, on: :destroy
  end

  private

  def should_reindex?
    return false unless account&.search_indexing_enabled?
    previously_new_record? ||
      (self.class.search_tracked_columns & previous_changes.keys).any?
  end
end

The should_reindex? check is important. We only reindex when tracked columns actually change — updating a record’s date or status shouldn’t trigger decryption and reindexing of its content. This keeps the write overhead minimal.

Each model declares what to index and at what priority:

def self.search_content_definition
  [[:title, "A"], [:identifier, "A"],
   [:implementation_notes, "B"], [:description, "C"]]
end

The weight system (A through D) feeds PostgreSQL’s setweight() function, so a title match ranks higher than a match buried in a description. Identifiers like “CC-1.1” get weight A because when someone searches for that string, they want that exact record.

During upsert, the content parts are assembled into a single weighted tsvector:

weighted_sql = content_parts.map do |(text, weight)|
  "setweight(to_tsvector('english', coalesce(#{connection.quote(text)}, '')), '#{weight}')"
end.join(" || ")

This is where the encrypted fields get decrypted — at the application layer, in memory, just long enough to tokenize them. The plaintext never hits the database; only the tsvector does.

Smart Search: Full-Text First, Fuzzy Fallback

The search itself uses a two-pass strategy:

def self.smart_search(query, limit: 20)
  results = fulltext_search(query, limit: limit)
  results = fuzzy_search(query, limit: limit) if results.empty?
  results
end

First pass: PostgreSQL full-text search with prefix matching (:* operator) and ts_rank_cd for relevance scoring. This handles the “I know approximately what I’m looking for” case.

Second pass: trigram similarity on the title column. This catches typos and partial matches that full-text search misses. Searching for “Penetraton Tets” should still find “Penetration Test Report.” (Yes, I test with actual typos I’ve made.)

The prefix matching is particularly nice for command palettes. Typing “cloud” immediately matches “CloudTrail Configuration” before you finish the word.

What I’d Tell an Auditor

(Because I’m going through SOC 2 myself, this isn’t hypothetical.)

“What does the search index contain?”

PostgreSQL tsvector tokens — stemmed word fragments with positional information. Not the original text. Also: record titles and context labels in cleartext, used for rendering search results.

“Can you reconstruct the original content from the index?”

No. Tsvectors are one-way transformations. You can determine that a record contains the word stems “penetr” and “aws” but you cannot reconstruct the sentence “Our annual penetration test of AWS infrastructure revealed three findings.”

“What about the titles?”

Titles are stored in cleartext. This is the explicit trade-off: organizations that enable search accept that record titles are visible in the index table. Titles are generally less sensitive than implementation details, but we document this in our security practices.

“What protections exist on the index?”

Five layers: account-scoped multi-tenancy (data isolation), role-based access control checked before any results are rendered, encryption at rest on the database volume, TLS for all data in transit, and instant purge capability by disabling the feature.

“Can an organization opt out?”

Immediately. Toggling the setting triggers a background job that purges every index entry for that organization. The feature is disabled by default — organizations must explicitly enable it.

“Why not use a more sophisticated encryption scheme?”

Because the mature, production-ready options for full-text search over encrypted data don’t exist in a form I’d trust for a compliance product. Partially homomorphic encryption and order-preserving encryption are fascinating research areas, but deploying unproven cryptographic schemes in a security product is worse than a well-understood trade-off with clear documentation and user consent.

I’d rather have an auditable, understandable system than a clever one.

The Rebuild-Purge Lifecycle

One thing I wanted to get right: the index should be entirely rebuildable and entirely destroyable at any point. No orphaned entries, no stale data, no “well, we mostly cleaned up.”

The rebuild job iterates through every searchable model type, decrypts content at the application layer, tokenizes it into tsvectors, and upserts into the index. Each record is processed independently with error handling — one bad record doesn’t kill the whole rebuild.

The purge is simpler: one SQL statement. Done.

Both operations are idempotent. Run the rebuild twice and you get the same index. Run the purge when there’s nothing to purge and nothing breaks. This matters for reliability — background jobs fail, retry, and sometimes run out of order.

Things I’m Still Thinking About

This solution works. It shipped. People are using it. But I’m not going to pretend it’s perfect.

Index staleness. If the rebuild job fails partway through, some records might have stale index entries until the next update triggers a reindex. I’m not losing sleep over this — search results being slightly outdated is a UX annoyance, not a security issue — but it’s not zero.

Scaling. Right now, the index lives in the same PostgreSQL database as everything else. For our current scale, this is fine. If we hit the point where the tsvector GIN index causes write amplification problems, I’ll probably move to a dedicated search service. But premature optimization is… well, you know.

Content vs. metadata sensitivity. I’m treating titles as “acceptable to store in cleartext” and implementation details as “must be tokenized.” But some titles might be sensitive too. A future iteration might let organizations configure sensitivity per field. For now, the current granularity matches what our customers have told us they’re comfortable with.

The philosophical question. Is it better to have no search and force people to manually browse through hundreds of records (which has its own security implications — people share screenshots, export to spreadsheets, ask colleagues on Slack), or to have fast search with a well-documented, auditable index that reveals word stems? I think the latter. But reasonable people can disagree.

The Boring Conclusion

There’s no cryptographic magic here. No breakthrough. Just a careful trade-off with clear boundaries: separate table, per-organization opt-in, instant purge, tsvector tokens instead of plaintext, documented threat model, layered defenses.

Sometimes the right engineering decision isn’t the clever one. It’s the one you can explain to an auditor in plain English, that your customers can enable or disable with a toggle, and that you can blow away in one SQL statement if something goes wrong.

That’s the kind of security I want in a compliance tool. Boring, understandable, and honest about what it does.

(Now if you’ll excuse me, I need to go search for that one control I wrote last month. With cmd+k, thankfully.)