qdda is a tool that lets you scan any Linux disk or file (or multiple disks) and predicts potential thin, dedupe and compression savings if you would move that disk/file to an All Flash array like DellEMC XtremIO or VMAX All-flash. In contrast to similar (usually vendor-based) tools, qdda can run completely independent. It does NOT require a registration or sending a binary dataset back to the mothership (which would be a security risk). Anyone can inspect the source code and run it so there are no hidden secrets.
The reason it took a while to follow-up is because I spent a lot of evening hours to almost completely rewrite the tool. A summary of changes:
- Run completely as non-privileged user (i.e. ‘nobody’) to make it safe to run on production systems
- Increased the hash to 60 bits so it scales to at least 80 Terabyte without compromising accuracy
- Decreased the database space consumption by 50%
- Multithreading so there are separate readers, workers and a single database updater which allows qdda to use multiple CPU cores
- Many other huge performance improvements (qdda has demonstrated to scan data at about 7GB/s on a fast server, bottleneck was IO and theoretically could handle double that bandwidth before maxing out on database updates)
- Very detailed embedded man page (manual). The qdda executable itself can show its own man page (on Linux with ‘man’ installed)
- Improved standard reports and detailed reports with compression and dedupe histograms
- Option to define your own custom array definitions
- Removed system dependencies (SQLite, LZ4, and other libraries) to allow qdda to run at almost any Linux system and can be downloaded as a single executable (no more requirements to install RPM packages)
- Many other small improvements and additions
- Completely moved to github – where you can also download the binary
Read the overview and animated demo on the project homepage here: https://github.com/outrunnl/qdda
HTML version of the detailed manual page: https://github.com/outrunnl/qdda/blob/master/doc/qdda.md
As qdda is licensed under GPL it offers no guarantee on anything. My recommendation is to use it for learning purposes or do a first what-if analysis, and if you’re interested in data reduction numbers from the vendor, then ask them for a formal analysis using their own tools. That said, I did a few comparison tests and the data reduction numbers were within 1% of the results from vendor-supported tools. The manpage has a section on accuracy explaining the differences.
Standard run with a generated data testset:
qdda 2.0.4 - The Quick & Dirty Dedupe Analyzer Use for educational purposes only - actual array reduction results may vary Database info (/home/bart/qdda.db): database size = 1.13 MiB array id = XtremIO X2 blocksize = 16 KiB Overview: total = 2048.00 MiB ( 131072 blocks) free (zero) = 512.00 MiB ( 32768 blocks) used = 1536.00 MiB ( 98304 blocks) dedupe savings = 640.00 MiB ( 40960 blocks) deduped = 896.00 MiB ( 57344 blocks) compressed = 451.93 MiB ( 49.56 %) allocated = 483.25 MiB ( 30928 blocks) Details: used = 1536.00 MiB ( 98304 blocks) compressed raw = 774.37 MiB ( 49.59 %) unique data = 512.00 MiB ( 32768 blocks) non-unique data = 1024.00 MiB ( 65536 blocks) Summary: percentage used = 75.00 % percentage free = 25.00 % deduplication ratio = 1.71 compression ratio = 1.85 thin ratio = 1.33 combined = 4.24 raw capacity = 2048.00 MiB net capacity = 483.25 MiB
You can see the total data reduction (1:4.24) for an XtremIO X2.
A detailed report with histograms shows more insights:
qdda 2.0.4 - The Quick & Dirty Dedupe Analyzer Use for educational purposes only - actual array reduction results may vary File list: file blksz blocks MiB date url 1 16384 8192 128 20180514_0659 workstation:/dev/urandom 2 16384 16384 256 20180514_0659 workstation:/dev/urandom 3 16384 32768 512 20180514_0659 workstation:/dev/zero 4 16384 32768 512 20180514_0659 workstation:/dev/urandom Dedupe histogram: dup blocks perc MiB 0 32768 25.00 512.00 1 32768 25.00 512.00 2 32768 25.00 512.00 4 32768 25.00 512.00 Total: 131072 100.00 2048.00 Compression Histogram (XtremIO X2): size buckets perc blocks MiB 1 3360 5.86 210 3.28 2 3670 6.40 459 7.17 3 3526 6.15 662 10.34 4 3601 6.28 901 14.08 5 3629 6.33 1135 17.73 6 3621 6.31 1358 21.22 7 3498 6.10 1531 23.92 8 3474 6.06 1737 27.14 9 3530 6.16 1986 31.03 10 3582 6.25 2239 34.98 11 3582 6.25 2463 38.48 12 3533 6.16 2650 41.41 13 3651 6.37 2967 46.36 15 7319 12.76 6862 107.22 16 3768 6.57 3768 58.88 Total: 57344 100.00 30928 483.25
The detailed report shows deeper insights in the distribution of dedupe and compression ratios. The example is using a synthetic reference test sets. Future posts will show actual reports using databases. For a real example, look at my previous blogpost.
Things I have learned
As QDDA is written in C++, it was quite a learning experience to make it multithreaded. Having multiple threads reading/writing to a chunk of shared memory is tricky and the devil is in the details. As programmer you need to be very strict in making sure multiple threads don’t change data at the same time, avoid deadlock or race conditions and much more. I now have the deepest respect for system programmers who write operating systems or database engines that have to deal with all this and at the same time provide the best possible performance.
On SQLite, I developed the first version of QDDA with SQLite because it’s free, widely used and relatively simple to implement. But my initial opinion was that it was only good for small, simple stuff. I changed my mind. SQLIte scales to very large amounts of data, is completely ACID-compliant and can be VERY fast. There are some limitations of course but I think SQLite can sometimes replace or enhance (parts of) an enterprise RDBMS. For example, using SQLite for ETL or similar big data analytics processing could sometimes be more efficient than using a full-fledged RDBMS which requires a client/server connection and careful storage management (such as creating tablespaces and setting parameters).
Tuning a C++ application is yet another thing I got my hands dirty on. Dealing with debuggers, inserting timestamps to find out where most time is spent in the process, learning a few things about compilers and profilers, and a lot of trial and error was a completely new experience as well. An example: on my home lab server with a NVMe flash, qdda could easily achieve about 600-700 MB/s throughput and I/O bandwidth was clearly the bottleneck. The multithreaded version running on one of our labs using VMAX AFA achieved about 1.5 GB/s and was still limited by I/O. Then I got my hands on a Dell 940 completely filled with Intel NVMe SSDs (over 50TB of flash!). Qdda maxed out at about 2GB/s but now the bottleneck shifted to CPU. Careful inspection learned that the way I updated the staging table in SQLite was not efficient, improved it and now I achieved 3 GB/s. Still maxed out on CPU (with 72 cores available! – but only one thread was running at 100%). It turned out to be the DB updater thread (again). Careful testing showed that clearing the data buffers after use with memset() was stalling the updater and driving it to 100%. Moving the memset() to the other workers solved the issue (later I completely eliminated the requirement to clear at all).
Now QDDA could scan data at 7GB/s and was again I/O limited (updater thread running at less than 50% so theoretically qdda could scan at 15GB/s as long as there is enough bandwidth and CPU power available)
Note that this server could achieve much higher bandwidth but that would require multiple reader threads. I did Oracle testing on this machine and achieved over 70GB/s total bandwidth (!) with a multi-threaded Oracle Copy table as Select script using SLOB data. Interesting to see that a single Dell server can achieve such huge bandwidth numbers (FYI the avg CPU utilization was less than 35% so plenty of CPU available to also do CPU intensive data processing on the data at the same time).
Real All-flash arrays
One of the things I have found is that modern CPUs can compress and hash data at an incredible rate. My modest i4-4440 can compress and hash at about 500MB/s per CPU core. Consider a fast dual socket server CPU with multiple cores (usually these days 22 cores per sockets is pretty standard). Such a machine has 44 cores where each core can likely compress and hash faster than my workstation. Say it can do 1GB/s per core. Then on 44 cores it could compress+hash data at 22GB/s (each core needs to compress AND hash the same data blocks). Although the example is not accurate (for example, XtremIO uses different compress and hash algorithms and needs also to maintain hash tables etc) you still get an impression on why inline compression and dedupe has negligible overhead. There is no reason to delay such operations to a post-processing method anymore – at least CPU performance is no valid reason.
I downloaded qdda from github on a Raspberry Pi (just for the fun of it – yes I’m a geek). Not for any real purpose but just to see if it would compile and run. I got a few warnings because the de facto Raspberry OS (Raspbian Linux) is 32 bit and I am using some 64-bit integer stuff. But it worked out of the box.
I also tried to compile on Solaris X86 due to a customer request. Had to outcomment a few specific Linux syscalls and then it compiled. But running it basically does nothing (it fails to start reading data for some reason), something I still need to look at. So for now qdda is Linux-only, although the manual describes a way to read data from other (UNIX-like) systems over a network pipe.
In the future I intend to write some blog posts on typical use cases (including but not limited to Oracle databases).
This post first appeared on Dirty Cache by Bart Sjerps. Copyright © 2011 – 2018. All rights reserved. Not to be reproduced for commercial purposes without written permission.
4,618 total views