New to AppSec?

AppSec is a diverse and complex field, but getting started doesn't mean mastering all the concepts at once. Check out my "Intro to AppSec" guide for a quick guide to understanding the basics.

Looking to Learn?

Already in the field? Check out the career curriculum to learn how to take your skills to the next level.

Real Training, Really Free

Tired of reading articles? Ready for real hands on training? Check out the Training section for tips on building a training lab and practicing your skills in a safe (legal) environment!

MySQL and Efficient Data Encoding

Posted on by Brooks Garrett - No Comments

As I’ve been working to expose the National Software Reference List via a new webservice, I’ve had to find ways to store data efficiently to avoid nuking my server. One of the biggest issues was the shear size of the database.

Each file record has 2 hashes, a SHA-1 and a MD5 hex-encoded value. Currently, there are 58,272,836 files hashed as part of the NSRL effort. This means 58,272,836 rows of data and 116,545,672 hash values.

So, let’s do some quick math. An MD5 hash is 32 Hexadecimal characters which is 32 bytes of data. SHA-1 is represented by 40 Hexadecimal characters which means 40 bytes of data. So combined, a string representation of the hashes will require 72 bytes for every row. Some quick math gives us the space requirements just the hashes!

  • 72B * 58,272,836 Records = 4,195,644,192 Bytes = 4.2 GB

That means nearly 4.2GB of storage JUST FOR THE HASHES!!! The solution lies in understanding what the hash actually represents and further, how MySQL can store that value. A hash is represented by a Hex string, but that is just a representation of an underlying number. That means we can store the hash as an INT or similar raw number, but we can shrink our requirements even more by storing the hash as a binary value. MySQL even provides a native method [hex() unhex()] for converting between a HEX string and binary.

By storing the hash as a binary object, we realize IMMENSE space savings. Warning: More Math! A SHA-1 hash is 160 bits while a MD5 hash is 128 bits. All told, that is 288 bits per row where previously we were using 576 bits.

  • 288 bits / 576 bits = 50% Savings
  • 36 B * 58,272,836 RecordsĀ = 2,097,822,096 B = 2.1GB

And there you have it, we reduced our storage utilization by 50% down to 2.1GB. Now that’s efficient!

blog comments powered by Disqus