March 1st 2023

Earlier this year, Cendyne wrote a blog post covering the use of HKDF, building partially upon my own blog post about HKDF and the KDF security definition, but moreso inspired by a cryptographic issue they identified in another company’s product (dubbed AnonCo).

Huawei Nova 12s Pre-Sales Commence in…
The Advantages Of Playing Online Casi…
Is the TP-Link Tapo C100 Security Cam…
A Comprehensive Guide to Freelance Wr…
Painting Stolen in a Heist 30 Years A…

At the bottom they teased:

Database cryptography is hard. The above sketch is not complete and does not address several threats! This article is quite long, so I will not be sharing the fixes.
Cendyne

If you read Cendyne’s post, you may have nodded along with that remark and not appreciate the degree to which our naga friend was putting it mildly. So I thought I’d share some of my knowledge about real-world database cryptography in an accessible and fun format in the hopes that it might serve as an introduction to the specialization.

Note: I’m also not going to fix Cendyne’s sketch of AnonCo’s software here–partly because I don’t want to get in the habit of assigning homework or required reading, but mostly because it’s kind of obvious once you’ve learned the basics.

I’m including art of my fursona in this post… as is tradition for furry blogs.

If you don’t like furries, please feel free to leave this blog and read about this topic elsewhere.

Thanks to CMYKat for the awesome stickers.

Database Cryptography?
Cryptography for Relational Databases
- The Perils of Built-in Encryption Functions
- Application-Layer Relational Database Cryptography
  - Confused Deputies
  - Canonicalization Attacks
  - Multi-Tenancy
Cryptography for NoSQL Databases
- NoSQL is Built Different
- Record Authentication
  - Bonus: A Maximally Schema-Free, Upgradeable Authentication Design
Searchable Encryption
- Order-{Preserving, Revealing} Encryption
- Deterministic Encryption
- Homomorphic Encryption
- Searchable Symmetric Encryption (SSE)
- You Can Have Little a HMAC, As a Treat
Intermission
Case Study: MongoDB Client-Side Encryption
- MongoCrypt: The Good
  - How is Queryable Encryption Implemented?
- MongoCrypt: The Bad
- MongoCrypt: The Ugly
Wrapping Up

Database Cryptography?

The premise of database cryptography is deceptively simple: You have a database, of some sort, and you want to store sensitive data in said database.

The consequences of this simple premise are anything but simple. Let me explain.

Art: ScruffKerfluff

The sensitive data you want to store may need to remain confidential, or you may need to provide some sort of integrity guarantees throughout your entire system, or sometimes both. Sometimes all of your data is sensitive, sometimes only some of it is. Sometimes the confidentiality requirements of your data extends to where within a dataset the record you want actually lives. Sometimes that’s true of some data, but not others, so your cryptography has to be flexible to support multiple types of workloads.

Other times, you just want your disks encrypted at rest so if they grow legs and walk out of the data center, the data cannot be comprehended by an attacker. And you can’t be bothered to work on this problem any deeper. This is usually what compliance requirements cover. Boxes get checked, executives feel safer about their operation, and the whole time nobody has really analyzed the risks they’re facing.

But we’re not settling for mere compliance on this blog. Furries have standards, after all.

So the first thing you need to do before diving into database cryptography is threat modelling. The first step in any good threat model is taking inventory; especially of assumptions, requirements, and desired outcomes. A few good starter questions:

What database software is being used? Is it up to date?
What data is being stored in which database software?
How are databases oriented in the network of the overall system?
- Is your database properly firewalled from the public Internet?
How does data flow throughout the network, and when do these data flows intersect with the database?
- Which applications talk to the database? What languages are they written in? Which APIs do they use?
How will cryptography secrets be managed?
- Is there one key for everyone, one key per tenant, etc.?
- How are keys rotated?
- Do you use envelope encryption with an HSM, or vend the raw materials to your end devices?

The first two questions are paramount for deciding how to write software for database cryptography, before you even get to thinking about the cryptography itself.

(This is not a comprehensive set of questions to ask, either. A formal threat model is much deeper in the weeds.)

The kind of cryptography protocol you need for, say, storing encrypted CSV files an S3 bucket is vastly different from relational (SQL) databases, which in turn will be significantly different from schema-free (NoSQL) databases.

Furthermore, when you get to the point that you can start to think about the cryptography, you’ll often need to tackle confidentiality and integrity separately.

If that’s unclear, think of a scenario like, “I need to encrypt PII, but I also need to digitally sign the lab results so I know it wasn’t tampered with at rest.”

My point is, right off the bat, we’ve got a three-dimensional matrix of complexity to contend with:

On one axis, we have the type of database.
- Flat-file
- Relational
- Schema-free
On another, we have the basic confidentiality requirements of the data.
- Field encryption
- Row encryption
- Column encryption
- Unstructured record encryption
- Encrypting entire collections of records
Finally, we have the integrity requirements of the data.
- Field authentication
- Row/column authentication
- Unstructured record authentication
- Collection authentication (based on e.g. Sparse Merkle Trees)

And then you have a fourth dimension that often falls out of operational requirements for databases: Searchability.

Why store data in a database if you have no way to index or search the data for fast retrieval?

Credit: Harubaki

If you’re starting to feel overwhelmed, you’re not alone. A lot of developers drastically underestimate the difficulty of the undertaking, until they run head-first into the complexity.

Some just phone it in with AES_Encrypt() calls in their MySQL queries. (Too bad ECB mode doesn’t provide semantic security!)

Which brings us to the meat of this blog post: The actual cryptography part.

Cryptography is the art of transforming information security problems into key management problems.
Former coworker

Note: In the interest of time, I’m skipping over flat files and focusing instead on actual database technologies.

Cryptography for Relational Databases

Encrypting data in an SQL database seems simple enough, even if you’ve managed to shake off the complexity I teased from the introduction.

You’ve got data, you’ve got a column on a table. Just encrypt the data and shove it in a cell on that column and call it a day, right?

But, alas, this is a trap. There are so many gotchas that I can’t weave a coherent, easy-to-follow narrative between them all.

So let’s start with a simple question: where and how are you performing your encryption?

The Perils of Built-in Encryption Functions

MySQL provides functions called AES_Encrypt and AES_Decrypt, which many developers have unfortunately decided to rely on in the past.

It’s unfortunate because these functions implement ECB mode. To illustrate why ECB mode is bad, I encrypted one of my art commissions with AES in ECB mode:

Art by Riley, encrypted with AES-ECB

The problems with ECB mode aren’t exactly “you can see the image through it,” because ECB-encrypting a compressed image won’t have redundancy (and thus can make you feel safer than you are).

ECB art is a good visual for the actual issue you should care about, however: A lack of semantic security.

ECB art isn’t to be confused with ECB poetry, which looks like this:

Oh little one, you’re growing up
You’ll soon be writing C
You’ll treat your ints as pointers
You’ll nest the ternary
You’ll cut and paste from github
And try cryptography
But even in your darkest hour
Do not use ECB

CBC’s BEASTly when padding’s abused
And CTR’s fine til a nonce is reused
Some say it’s a CRIME to compress then encrypt
Or store keys in the browser (or use javascript)
Diffie Hellman will collapse if hackers choose your g
And RSA is full of traps when e is set to 3
Whiten! Blind! In constant time! Don’t write an RNG!
But failing all, and listen well: Do not use ECB

They’ll say “It’s like a one-time-pad!
The data’s short, it’s not so bad
the keys are long–they’re iron clad
I have a PhD!”
And then you’re front page Hacker News
Your passwords cracked–Adobe Blues.
Don’t leave your penguins showing through,
Do not use ECB
— Ben Nagy, PoC||GTFO 0x04:13

Most people reading this probably know better than to use ECB mode already, and don’t need any of these reminders, but there is still a lot of code that inadvertently uses ECB mode to encrypt data in the database.

Also, SHOW processlist; leaks your encryption keys. Oops.

Credit: CMYKatt

Application-layer Relational Database Cryptography

Whether burned by ECB or just cautious about not giving your secrets to the system that stores all the ciphertext protected by said secret, a common next step for developers is to simply encrypt in their server-side application code.

And, yes, that’s part of the answer. But how you encrypt is important.

Credit: Harubaki

“I’ll encrypt with CBC mode.”
If you don’t authenticate your ciphertext, you’ll be sorry. Maybe try again?

“Okay, fine, I’ll use an authenticated mode like GCM.”
Did you remember to make the table and column name part of your AAD? What about the primary key of the record?

“What on Earth are you talking about, Soatok?”
Welcome to the first footgun of database cryptography!

Confused Deputies

Encrypting your sensitive is data, but not sufficient. You need to also bind your ciphertexts to the specific context in which they are stored.

To understand why, let’s take a step back: What specific threat does encrypting your database records protect against?

We’ve already established that “your disks walk out of the datacenter” is a “full disk encryption” problem, so if you’re using application-layer cryptography to encrypt data in a relational database, your threat model probably involves unauthorized access to the database server.

What, then, stops an attacker from copying ciphertexts around?

Credit: CMYKatt

Let’s say I have a legitimate user account with an ID 12345, and I want to read your street address, but it’s encrypted in the database. But because I’m a clever hacker, I have unfettered access to your relational database server.

All I would need to do is simply…

UPDATE table SET addr_encrypted = 'your-ciphertext' WHERE id = 12345

…and then access the application through my legitimate access. Bam, data leaked. As an attacker, I can probably even copy fields from other columns and it will just decrypt. Even if you’re using an authenticated mode.

We call this a confused deputy attack, because the deputy (the component of the system that has been delegated some authority or privilege) has become confused by the attacker, and thus undermined an intended security goal.

The fix is to use the AAD parameter from the authenticated mode to bind the data to a given context. (AAD = Additional Authenticated Data.)

- $addr = aes_gcm_encrypt($addr, $key);
+ $addr = aes_gcm_encrypt($addr, $key, canonicalize([
+     $tableName,
+     $columnName,
+     $primaryKey
+ ]);

Now if I start cutting and pasting ciphertexts around, I get a decryption failure instead of silently decrypting plaintext.

This may sound like a specific vulnerability, but it’s more of a failure to understand an important general lesson with database cryptography:

Where your data lives is part of its identity, and MUST be authenticated.
Soatok’s Rule of Database Cryptography

Canonicalization Attacks

In the previous section, I introduced a pseudocode called canonicalize(). This isn’t a pasto from some reference code; it’s an important design detail that I will elaborate on now.

First, consider you didn’t do anything to canonicalize your data, and you just joined strings together and called it a day…

function dumbCanonicalize(
    string $tableName,
    string $columnName,
    string|int $primaryKey
): string {
    return $tableName . '_' . $columnName . '#' . $primaryKey;
}

Consider these two inputs to this function:

dumbCanonicalize('customers', 'last_order_uuid', 123);
dumbCanonicalize('customers_last_order', 'uuid', 123);

In this case, your AAD would be the same, and therefore, your deputy can still be confused (albeit in a narrower use case).

In Cendyne’s article, AnonCo did something more subtle: The canonicalization bug created a collision on the inputs to HKDF, which resulted in an unintentional key reuse.

Up until this point, their mistake isn’t relevant to us, because we haven’t even explored key management at all. But the same design flaw can re-emerge in multiple locations, with drastically different consequence.

Multi-Tenancy

Once you’ve implemented a mitigation against Confused Deputies, you may think your job is done. And it very well could be.

Often times, however, software developers are tasked with building support for Bring Your Own Key (BYOK).

This is often spawned from a specific compliance requirement (such as cryptographic shredding; i.e. if you erase the key, you can no longer recover the plaintext, so it may as well be deleted).

Other times, this is driven by a need to cut costs: Storing different users’ data in the same database server, but encrypting it such that they can only encrypt their own records.

Two things can happen when you introduce multi-tenancy into your database cryptography designs:

Invisible Salamanders becomes a risk, due to multiple keys being possible for any given encrypted record.
Failure to address the risk of Invisible Salamanders can undermine your protection against Confused Deputies, thereby returning you to a state before you properly used the AAD.

So now you have to revisit your designs and ensure you’re using a key-committing authenticated mode, rather than just a regular authenticated mode.

Isn’t cryptography fun?

Art: CMYKat

We’re not quite done with relational databases yet, but we should talk about NoSQL databases for a bit. The final topic in scope applies equally to both, after all.

Cryptography for NoSQL Databases

Most of the topics from relational databases also apply to NoSQL databases, so I shall refrain from duplicating them here. This article is already sufficiently long to read, after all, an I dislike redundancy.

NoSQL is Built Different

The main thing that NoSQL databases offer in the service of making cryptographers lose sleep at night is the schema-free nature of NoSQL designs.

What this means is that, if you’re using a client-side encryption library for a NoSQL database, the previous concerns about confused deputy attacks are amplified by the malleability of the document structure.

Additionally, the previously discussed cryptographic attacks against the encryption mode may be less expensive for an attacker to pull off.

Consider the following record structure, which stores a bunch of data stored with AES in CBC mode:

{
  "encrypted-data-key": "",
  "name": "",
  "address": [
    "",
    ""
  ],
  "social-security": "",
  "zip-code": ""
}

If this record is decrypted with code that looks something like this:

$decrypted = [];
// ... snip ...
foreach ($record['address'] as $i => $addrLine) {
    try {
        $decrypted['address'][$i] = $this->decrypt($addrLine);
    } catch (Throwable $ex) {
        // You'd never deliberately do this, but it's for illustration
        $this->doSomethingAnOracleCanObserve($i);
        
        // This is more believable, of course:
        $this->logDecryptionError($ex, $addrLine);
        $decrypted['address'][$i] = '';
    }
}

Then you can keep appending rows to the "address" field to reduce the number of writes needed to exploit a padding oracle attack against any of the fields.

Art: Harubaki

This isn’t to say that NoSQL is less secure than SQL, from the context of client-side encryption. However, the powerful feature sets that NoSQL users are accustomed to may also give attackers a more versatile toolkit to work with.

Record Authentication

A pedant may point out that record authentication applies to both SQL and NoSQL. However, I mostly only observe this feature in NoSQL databases and document storage systems in the wild, so I’m shoving it in here.

Encrypting fields is nice and all, but sometimes what you want to know is that your unencrypted data hasn’t been tampered with as it flows through your system.

The trivial way this is done is by using a digital signature algorithm over the whole record, and then appending the signature to the end. When you go to verify the record, all of the information you need is right there.

This works well enough for most use cases, and everyone can pack up and go home. Nothing more to see here.

Except…

When you’re working with NoSQL databases, you often want systems to be able to write to additional fields, and since you’re working with schema-free blobs of data rather than a normalized set of relatable tables, the most sensible thing to do is to is to append this data to the same record.

Except, oops! You can’t do that if you’re shoving a digital signature over the record. So now you need to specify which fields are to be included in the signature.

And you need to think about how to model that in a way that doesn’t prohibit schema upgrades nor allow attackers to perform downgrade attacks. (See below.)

I don’t have any specific real-world examples here that I can point to of this problem being solved well.

Art: CMYKat

Furthermore, as with preventing confused deputy and/or canonicalization attacks above, you must also include the fully qualified path of each field in the data that gets signed.

As I said with encryption before, but also true here:

Where your data lives is part of its identity, and MUST be authenticated.
Soatok’s Rule of Database Cryptography

This requirement holds true whether you’re using symmetric-key authentication (i.e. HMAC) or asymmetric-key digital signatures (e.g. EdDSA).

Bonus: A Maximally Schema-Free, Upgradeable Authentication Design

Art: Harubaki

Okay, how do you solve this problem so that you can perform updates and upgrades to your schema but without enabling attackers to downgrade the security? Here’s one possible design.

Let’s say you have two metadata fields on each record:

A compressed binary string representing which fields should be authenticated. This field is, itself, not authenticated. Let’s call this meta-auth.
A compressed binary string representing which of the authenticated fields should also be encrypted. This field is also authenticated. This is at most the same length as the first metadata field. Let’s call this meta-enc.

Furthermore, you will specify a canonical field ordering for both how data is fed into the signature algorithm as well as the field mappings in meta-auth and meta-enc.

{
  "example": {
    "credit-card": {
      "number": /* encrypted */,
      "expiration": /* encrypted */,
      "ccv": /* encrypted */
    },
    "superfluous": {
      "rewards-member": null
    }
  },
  "meta-auth": compress_bools([
    true,  /* example.credit-card.number */
    true,  /* example.credit-card.expiration */
    true,  /* example.credit-card.ccv */
    false, /* example.superfluous.rewards-member */
    true   /* meta-enc */
  ]),
  "meta-enc": compress_bools([
    true,  /* example.credit-card.number */
    true,  /* example.credit-card.expiration */
    true,  /* example.credit-card.ccv */
    false  /* example.superfluous.rewards-member */
  ]),
  "signature": /* -- snip -- */
}

When you go to append data to an existing record, you’ll need to update meta-auth to include the mapping of fields based on this canonical ordering to ensure only the intended fields get validated.

When you update your code to add an additional field that is intended to be signed, you can roll that out for new records and the record will continue to be self-describing:

New records will have the additional field flagged as authenticated in meta-auth (and meta-enc will grow)
Old records will not, but your code will still sign them successfully
To prevent downgrade attacks, simply include a schema version ID as an additional plaintext field that gets authenticated. An attacker who tries to downgrade will need to be able to produce a valid signature too.

You might think meta-auth gives an attacker some advantage, but this only includes which fields are included in the security boundary of the signature or MAC, which allows unauthenticated data to be appended for whatever operational purpose without having to update signatures or expose signing keys to a wider part of the network.

{
  "example": {
    "credit-card": {
      "number": /* encrypted */,
      "expiration": /* encrypted */,
      "ccv": /* encrypted */
    },
    "superfluous": {
      "rewards-member": null
    }
  },
  "meta-auth": compress_bools([
    true,  /* example.credit-card.number */
    true,  /* example.credit-card.expiration */
    true,  /* example.credit-card.ccv */
    false, /* example.superfluous.rewards-member */
    true,  /* meta-enc */
    true   /* meta-version */
  ]),
  "meta-enc": compress_bools([
    true,  /* example.credit-card.number */
    true,  /* example.credit-card.expiration */
    true,  /* example.credit-card.ccv */
    false, /* example.superfluous.rewards-member */
    true   /* meta-version */
  ]),
  "meta-version": 0x01000000,
  "signature": /* -- snip -- */
}

If an attacker tries to use the meta-auth field to mess with a record, the best they can hope for is an Invalid Signature exception (assuming the signature algorithm is secure to begin with).

Even if they keep all of the fields the same, but play around with the structure of the record (e.g. changing the XPath or equivalent), so long as the path is authenticated with each field, breaking this is computationally infeasible.

Searchable Encryption

If you’ve managed to make it through the previous sections, congratulations, you now know enough to build a secure but completely useless database.

Art: CMYKat

Okay, put away the pitchforks; I will explain.

Part of the reason why we store data in a database, rather than a flat file, is because we want to do more than just read and write. Sometimes computer scientists want to compute. Almost always, you want to be able to query your database for a subset of records based on your specific business logic needs.

And so, a database which doesn’t do anything more than store ciphertext and maybe signatures is pretty useless to most people. You’d have better luck selling Monkey JPEGs to furries than convincing most businesses to part with their precious database-driven report generators.

Art: Sophie

So whenever one of your users wants to actually use their data, rather than just store it, they’re forced to decide between two mutually exclusive options:

Encrypting the data, to protect it from unauthorized disclosure, but render it useless
Doing anything useful with the data, but leaving it unencrypted in the database

This is especially annoying for business types that are all in on the Zero Trust buzzword.

Fortunately, the cryptographers are at it again, and boy howdy do they have a lot of solutions for this problem.

Order-{Preserving, Revealing} Encryption

On the fun side of things, you have things like Order-Preserving and Order-Revealing Encryption, which Matthew Green wrote about at length.

[D]atabase encryption has been a controversial subject in our field. I wish I could say that there’s been an actual debate, but it’s more that different researchers have fallen into different camps, and nobody has really had the data to make their position in a compelling way. There have actually been some very personal arguments made about it.
Attack of the week: searchable encryption and the ever-expanding leakage function

The problem with these designs is that they have a significant enough leakage that it no longer provides semantic security.

From Grubbs, et al. (GLMP, 2019.)
Colors inverted to fit my blog’s theme better.

To put it in other words: These designs are only marginally better than ECB mode, and probably deserve their own poems too.

Order revealing
Reveals much more than order
Softcore ECB

Order preserving
Semantic security?
Only in your dreams
Haiku for your consideration

Deterministic Encryption

Here’s a simpler, but also terrible, idea for searchable encryption: Simply give up on semantic security entirely.

If you recall the AES_{De,En}crypt() functions built into MySQL I mentioned at the start of this article, those are the most common form of deterministic encryption I’ve seen in use.

 SELECT * FROM foo WHERE bar = AES_Encrypt('query', 'key');

However, there are slightly less bad variants. If you use AES-GCM-SIV with a static nonce, your ciphertexts are fully deterministic, and you can encrypt a small number of distinct records safely before you’re no longer secure.

From Page 14 of the previous linked PDF

That’s certainly better than nothing, but you also can’t mitigate confused deputy attacks. But we can do better than this.

Homomorphic Encryption

In a safer plane of academia, you’ll find homomorphic encryption, which researchers recently demonstrated with serving Wikipedia pages in a reasonable amount of time.

Homomorphic encryption allows computations over the ciphertext, which will be reflected in the plaintext, without ever revealing the key to the entity performing the computation.

If this sounds vaguely similar to the conditions that enable chosen-ciphertext attacks, you probably have a good intuition for how it works: RSA is homomorphic to multiplication, AES-CTR is homomorphic to XOR. Fully homomorphic encryption uses lattices, which enables multiple operations but carries a relatively enormous performance cost.

This post first appeared on Dhole Moments - Software, Security, Cryptography, And The Furry Fandom, please read the originial post: here