It’s almost a year since I blogged about qdda (the Quick & Dirty Dedupe Analyzer).

qdda is a tool that lets you scan any Linux disk or file (or multiple disks) and predicts potential thin, dedupe and compression savings if you would move that disk/file to an All Flash array like DellEMC XtremIO or VMAX All-flash. In contrast to similar (usually vendor-based) tools, qdda can run completely independent. It does NOT require a registration or sending a binary dataset back to the mothership (which would be a security risk). Anyone can inspect the source code and run it so there are no hidden secrets.

It’s based upon the most widely deployed database engine, SQLite, and uses MD5 hashing and LZ4 compression to produce data reduction estimates.

The reason it took a while to follow-up is because I spent a lot of evening hours to almost completely rewrite the tool. A summary of changes:

Run completely as non-privileged user (i.e. ‘nobody’) to make it safe to run on production systems
Increased the hash to 60 bits so it scales to at least 80 Terabyte without compromising accuracy
Decreased the database space consumption by 50%
Multithreading so there are separate readers, workers and a single database updater which allows qdda to use multiple CPU cores
Many other huge performance improvements (qdda has demonstrated to scan data at about 7GB/s on a fast server, bottleneck was IO and theoretically could handle double that bandwidth before maxing out on database updates)
Very detailed embedded man page (manual). The qdda executable itself can show its own man page (on Linux with ‘man’ installed)
Improved standard reports and detailed reports with compression and dedupe histograms
Option to define your own custom array definitions
Removed system dependencies (SQLite, LZ4, and other libraries) to allow qdda to run at almost any Linux system and can be downloaded as a single executable (no more requirements to install RPM packages)
Many other small improvements and additions
Completely moved to github – where you can also download the binary

Read the overview and animated demo on the project homepage here: https://github.com/outrunnl/qdda

HTML version of the detailed manual page: https://github.com/outrunnl/qdda/blob/master/doc/qdda.md

As qdda is licensed under GPL it offers no guarantee on anything. My recommendation is to use it for learning purposes or do a first what-if analysis, and if you’re interested in data reduction numbers from the vendor, then ask them for a formal analysis using their own tools. That said, I did a few comparison tests and the data reduction numbers were within 1% of the results from vendor-supported tools. The manpage has a section on accuracy explaining the differences.

Example output

Standard run with a generated data testset:

qdda 2.0.4 - The Quick & Dirty Dedupe Analyzer
Use for educational purposes only - actual array reduction results may vary

Database info (/home/bart/qdda.db):
database size       = 1.13 MiB
array id            = XtremIO X2
blocksize           = 16 KiB

Overview:
total               =     2048.00 MiB (    131072 blocks)
free (zero)         =      512.00 MiB (     32768 blocks)
used                =     1536.00 MiB (     98304 blocks)
dedupe savings      =      640.00 MiB (     40960 blocks)
deduped             =      896.00 MiB (     57344 blocks)
compressed          =      451.93 MiB (     49.56 %)
allocated           =      483.25 MiB (     30928 blocks)

Details:
used                =     1536.00 MiB (     98304 blocks)
compressed raw      =      774.37 MiB (     49.59 %)
unique data         =      512.00 MiB (     32768 blocks)
non-unique data     =     1024.00 MiB (     65536 blocks)

Summary:
percentage used     =       75.00 %
percentage free     =       25.00 %
deduplication ratio =        1.71
compression ratio   =        1.85
thin ratio          =        1.33
combined            =        4.24
raw capacity        =     2048.00 MiB
net capacity        =      483.25 MiB

You can see the total data reduction (1:4.24) for an XtremIO X2.

A detailed report with histograms shows more insights:

qdda 2.0.4 - The Quick & Dirty Dedupe Analyzer
Use for educational purposes only - actual array reduction results may vary
File list:
file      blksz     blocks         MiB date               url
1         16384       8192         128 20180514_0659      workstation:/dev/urandom
2         16384      16384         256 20180514_0659      workstation:/dev/urandom
3         16384      32768         512 20180514_0659      workstation:/dev/zero
4         16384      32768         512 20180514_0659      workstation:/dev/urandom

Dedupe histogram:
dup            blocks         perc          MiB
0               32768        25.00       512.00
1               32768        25.00       512.00
2               32768        25.00       512.00
4               32768        25.00       512.00
Total:         131072       100.00      2048.00

Compression Histogram (XtremIO X2):
size          buckets         perc       blocks          MiB
1                3360         5.86          210         3.28
2                3670         6.40          459         7.17
3                3526         6.15          662        10.34
4                3601         6.28          901        14.08
5                3629         6.33         1135        17.73
6                3621         6.31         1358        21.22
7                3498         6.10         1531        23.92
8                3474         6.06         1737        27.14
9                3530         6.16         1986        31.03
10               3582         6.25         2239        34.98
11               3582         6.25         2463        38.48
12               3533         6.16         2650        41.41
13               3651         6.37         2967        46.36
15               7319        12.76         6862       107.22
16               3768         6.57         3768        58.88
Total:          57344       100.00        30928       483.25

The detailed report shows deeper insights in the distribution of dedupe and compression ratios. The example is using a synthetic reference test sets. Future posts will show actual reports using databases. For a real example, look at my previous blogpost.

Things I have learned

Multithreading

As QDDA is written in C++, it was quite a learning experience to make it multithreaded. Having multiple threads reading/writing to a chunk of shared memory is tricky and the devil is in the details. As programmer you need to be very strict in making sure multiple threads don’t change data at the same time, avoid deadlock or race conditions and much more. I now have the deepest respect for system programmers who write operating systems or database engines that have to deal with all this and at the same time provide the best possible performance.

Databases

On SQLite, I developed the first version of QDDA with SQLite because it’s free, widely used and relatively simple to implement. But my initial opinion was that it was only good for small, simple stuff. I changed my mind. SQLIte scales to very large amounts of data, is completely ACID-compliant and can be VERY fast. There are some limitations of course but I think SQLite can sometimes replace or enhance (parts of) an enterprise RDBMS. For example, using SQLite for ETL or similar big data analytics processing could sometimes be more efficient than using a full-fledged RDBMS which requires a client/server connection and careful storage management (such as creating tablespaces and setting parameters).

Performance

Tuning a C++ application is yet another thing I got my hands dirty on. Dealing with debuggers, inserting timestamps to find out where most time is spent in the process, learning a few things about compilers and profilers, and a lot of trial and error was a completely new experience as well. An example: on my home lab server with a NVMe flash, qdda could easily achieve about 600-700 MB/s throughput and I/O bandwidth was clearly the bottleneck. The multithreaded version running on one of our labs using VMAX AFA achieved about 1.5 GB/s and was still limited by I/O. Then I got my hands on a Dell 940 completely filled with Intel NVMe SSDs (over 50TB of flash!). Qdda maxed out at about 2GB/s but now the bottleneck shifted to CPU. Careful inspection learned that the way I updated the staging table in SQLite was not efficient, improved it and now I achieved 3 GB/s. Still maxed out on CPU (with 72 cores available! – but only one thread was running at 100%). It turned out to be the DB updater thread (again). Careful testing showed that clearing the data buffers after use with memset() was stalling the updater and driving it to 100%. Moving the memset() to the other workers solved the issue (later I completely eliminated the requirement to clear at all).

Now QDDA could scan data at 7GB/s and was again I/O limited (updater thread running at less than 50% so theoretically qdda could scan at 15GB/s as long as there is enough bandwidth and CPU power available)

Note that this server could achieve much higher bandwidth but that would require multiple reader threads. I did Oracle testing on this machine and achieved over 70GB/s total bandwidth (!) with a multi-threaded Oracle Copy table as Select script using SLOB data. Interesting to see that a single Dell server can achieve such huge bandwidth numbers (FYI the avg CPU utilization was less than 35% so plenty of CPU available to also do CPU intensive data processing on the data at the same time).

Real All-flash arrays

One of the things I have found is that modern CPUs can compress and hash data at an incredible rate. My modest i4-4440 can compress and hash at about 500MB/s per CPU core. Consider a fast dual socket server CPU with multiple cores (usually these days 22 cores per sockets is pretty standard). Such a machine has 44 cores where each core can likely compress and hash faster than my workstation. Say it can do 1GB/s per core. Then on 44 cores it could compress+hash data at 22GB/s (each core needs to compress AND hash the same data blocks). Although the example is not accurate (for example, XtremIO uses different compress and hash algorithms and needs also to maintain hash tables etc) you still get an impression on why inline compression and dedupe has negligible overhead. There is no reason to delay such operations to a post-processing method anymore – at least CPU performance is no valid reason.

Cross-compiling

I downloaded qdda from github on a Raspberry Pi (just for the fun of it – yes I’m a geek). Not for any real purpose but just to see if it would compile and run. I got a few warnings because the de facto Raspberry OS (Raspbian Linux) is 32 bit and I am using some 64-bit integer stuff. But it worked out of the box.

I also tried to compile on Solaris X86 due to a customer request. Had to outcomment a few specific Linux syscalls and then it compiled. But running it basically does nothing (it fails to start reading data for some reason), something I still need to look at. So for now qdda is Linux-only, although the manual describes a way to read data from other (UNIX-like) systems over a network pipe.

In the future I intend to write some blog posts on typical use cases (including but not limited to Oracle databases).

Announcing qdda 2.0