Dirty Cache

As announced in my last blogpost, qdda is a tool that analyzes potential storage savings by scanning data and giving a deduplication, compression and thin provisioning estimate. The results are an indication whether a modern All-Flash Array (AFA) like Dell EMC XtremIO would be worth considering.

In this (lenghty) post I will go over the basics of qdda and run a few synthetic test scenarios to show what’s possible. The next posts will cover more advanced scenarios such as running against Oracle database data, multiple nodes and other exotic ones such as running against ZFS storage pools.

[ Warning: Lengthy technical content, Rated T, parental advisory required ]

Since the introduction of qdda I made lots of improvements, including:

Solved performance issues when processing large datasets (more than 500GB) by using a non-indexed staging table which is merged later into the destination table. Scanning more data than a few hundred GB still requires some patience 🙂 I tested with a 4 TB dataset which was manageable. Expect to scale reasonably well up to 10-20TB, maybe more
Added import function to allow merging data collected on various runs (or hosts) – maybe such as Exadata storage cells
Change hash from CRC32 to MD5, as CRC32 turned out to create too many hash collisions on large datasets (see Wikipedia – Birthday problem) . QDDA now uses a truncated MD5 hash of 6 bytes. MD5 is actually 16 bytes (128 bits) but SQLite integers are limited to 8 bytes and are signed. Truncating to 6 bytes offers a good trade-off to performance, database size and accuracy on large datasets. It’s possible to recompile qdda with 7 bytes hashing if you really need it. Why not another hash algorithm? MD5 is known to be vulnerable to hash collision attacks? Sure, but MD5 outperforms the others and for my purpose has attacks are not relevant. A few hash collisions at very large data sets is manageable for this purpose so I think more than 6 bytes (48 bits) is not needed (let me know if you disagree)
Added histogram reports that show the distribution of compress ratios and dedup counts
Various minor bugfixes and changes in user interface
Source dumped on GitHub. Now everyone can complain about my bad C++ coding style (or better, improve the code and submit a Pull Request)

Preparation

I created a VMware Workstation VM running CentOS 6 minimal install, with a 16GiB bootdisk and 3 data disks of (4, 1 and 1 GiB). Disks are on a flash drive to speed up I/O. After first login, installed some tools and set up storage space.

# Install EPEL repository and 7zip archiver (needed later)
yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
yum install p7zip

# Install Outrun Extras repository (the YUM RPM repo where 
# I publish my tools and provide updates)
# and install qdda and asmdisks
yum install http://yum.outrun.nl/outrun-extras.rpm
yum install qdda asmdisks

# An overview of SCSI disks on the system
asm disks
Device                SCSI       Size Type     Target
/dev/sda         [0:0:0:0]    16.0 GB dos      -
/dev/sdb         [0:0:1:0]     4.0 GB blank    -
/dev/sdc         [0:0:2:0]     1.0 GB blank    -
/dev/sdd         [0:0:3:0]     1.0 GB blank    -

# Create a few scripts in /usr/local/bin: 
ls -ad /usr/local/bin/*
/usr/local/bin/cacheflush  
/usr/local/bin/fsreclaim
# cacheflush: syncs file system buffers and drops write cache 
# (flush dirty cache buffers to disk)
# fsreclaim: simple script that writes a zeroed file into a 
# file system and then deletes it

*** WARNING ***

Use of “dd” can potentially destroy data. Use only within a disposable test environment (VM) or if you know what you are doing!

# Blanking the available devices just to be sure they don't contain any data
dd if=/dev/zero bs=1M status=none of=/dev/sdb
dd if=/dev/zero bs=1M status=none of=/dev/sdc
dd if=/dev/zero bs=1M status=none of=/dev/sdd

# Creating LVM volume groups, lvols and disk partitions
# For the main tests we will use an LVM LV, but we also create 2 file systems
# on deliberately mis-aligned disk partitions for one of the tests.
# We use XFS as file system but place the journal on separate LVs to keep
# the data as clean as possible

parted /dev/sdb mklabel gpt
parted /dev/sdb mkpart primary 1M 100%
printf "%s\n" o n p 1 " " " " w | fdisk /dev/sdc
printf "%s\n" o n p 1 " " " " w | fdisk /dev/sdd

vgcreate -Ay data /dev/sdb1 
lvcreate -Ay -nqdda -L1G data
lvcreate -Ay -nlog0 -L16M data
lvcreate -Ay -nlog1 -L16M data
lvcreate -Ay -nlog2 -L16M data

First run

Let’s run qdda against the blank test logical volume to see what a completely blank disk looks like:

qdda /dev/data/qdda 
qdda 1.7.2 - The Quick & Dirty Dedupe Analyzer
File 01, 131072 blocks (1024 MiB) processed, 186/200 MB/s, 189 MB/s avg
Merging 131072 blocks (1024 MiB) with 0 blocks (0 MiB)
Indexing in 0.19 sec (687540 blocks/s, 5371 MiB/s), 
Joining in 0.05 sec (2787935 blocks/s, 21780 MiB/s)
                      *** Details ***
blocksize           =           8 KiB
total               =     1024.00 MiB (    131072 blocks)
free                =     1024.00 MiB (    131072 blocks)
used                =        0.00 MiB (         0 blocks)
unique              =        0.00 MiB (         0 blocks)
dupcount = 2        =        0.00 MiB (         0 blocks)
dupcount > 2        =        0.00 MiB (         0 blocks)
deduped             =        0.00 MiB (         0 blocks)
compressed (full)   =        0.00 MiB (    100.00 %)
compressed (bucket) =        0.00 MiB (         0 blocks)
                      *** Summary ***
percentage used     =            0.00 %
percentage free     =          100.00 %
deduplication ratio =            0.00
compression ratio   =            0.00
thin ratio          =            0.00
combined            =            0.00
raw capacity        =     1024.00 MiB
net capacity        =        0.00 MiB

Because all blocks are blank, the output is not very interesting, it only shows 1GiB of unallocated data.

Creating filesystems

# Creating 3 filesystems with the journal on separate LV.
mkfs.xfs -l logdev=/dev/data/log0 /dev/data/qdda
mkfs.xfs -l logdev=/dev/data/log1 /dev/sdc1
mkfs.xfs -l logdev=/dev/data/log2 /dev/sdd1

# Adding the filesystems to fstab (for convenience), making
# mountpoints and mount the stuff
mkdir /dedup /d1 /d2
cat <<EOF >> /etc/fstab
/dev/mapper/data-qdda   /dedup                  xfs     noatime,logdev=/dev/data/log0 0 0
/dev/sdd1               /d1                     xfs     noatime,logdev=/dev/data/log1 0 0
/dev/sdc1               /d2                     xfs     noatime,logdev=/dev/data/log2 0 0
EOF
mount -a;df -P
...
/dev/mapper/data-qdda     1048576  32928   1015648       4% /dedup
/dev/sdd1                 1044192  32928   1011264       4% /d1
/dev/sdc1                 1044192  32928   1011264       4% /d2

Synthetic tests

These synthetic tests are intended to show how qdda works and run against artificially created data. Later we will run against actual OS and app devices Let’s run qdda again on the data device, holding only the structures of the freshly created XFS file system:

qdda /dev/data/qdda 
qdda 1.7.2 - The Quick & Dirty Dedupe Analyzer
File 01, 131072 blocks (1024 MiB) processed, 187/200 MB/s, 189 MB/s avg
Merging 131072 blocks (1024 MiB) with 0 blocks (0 MiB)
Indexing in 0.19 sec (693825 blocks/s, 5420 MiB/s), 
Joining in 0.05 sec (2799427 blocks/s, 21870 MiB/s)
                      *** Details ***
blocksize           =           8 KiB
total               =     1024.00 MiB (    131072 blocks)
free                =     1023.92 MiB (    131062 blocks)
used                =        0.08 MiB (        10 blocks)
unique              =        0.05 MiB (         7 blocks)
dupcount = 2        =        0.00 MiB (         0 blocks)
dupcount > 2        =        0.01 MiB (         1 blocks)
deduped             =        0.06 MiB (         8 blocks)
buckets  2k         =        0.02 MiB (         8 buckets)
compressed (full)   =        0.00 MiB (     97.85 %)
compressed (bucket) =        0.02 MiB (         2 blocks)
                      *** Summary ***
percentage used     =            0.01 %
percentage free     =           99.99 %
deduplication ratio =            1.25
compression ratio   =            4.00
thin ratio          =        13107.20
combined            =        65536.00
raw capacity        =     1024.00 MiB
net capacity        =        0.02 MiB

You can see that out of 1GiB (131072 8K blocks), 10 blocks have been written to hold XFS data structures. 7 blocks are unique and 1 block appears more than once, and even more than twice. A very small dedupe effect kicks in already, XFS has 10 blocks in use but after deduplication it would only consume 7 blocks.

XFS metadata seems to use only small parts of 8K blocks, because all of the 7 (deduped) blocks can be compressed into less than 2K and we can fit 4 2K buckets in an 8K block, bringing the compression ratio to 4:1 or 4.00 (the best you can get on XtremIO with 8K blocks – more on this later). If we would just run LZ4 compression against the blocks then the compression ratio would be very high: 7 blocks (57344 bytes) get compressed into less than 10 KB. qdda rounds down the MB number so it reports 0.00, but shows compression of 97.85%, so the compressed bytes would be roughly 57344*(1-0.9785) = 1233 bytes (inspecting the SQLite database reveals that the sum of compressed bytes = 1406 but at such small values you get some rounding errors).

The optimization ratios for dedupe, compression and thin are unrealistic for the same reason (rounding errors due to very low amount of blocks). But we need to keep in mind that XFS roughly uses 10 or so blocks for metadata (it will start to consume more if you write files to the file system).

Let’s write some random data to the file system. We will create a file with random data first as we need it again later on.

dd if=/dev/urandom bs=1M count=100 of=/var/tmp/random100
cp /var/tmp/random100 /dedup/

If you run qdda now against the logical volume, you would see that nothing has yet been written to the file system (a small bit is written to the log). I tried running “sync” but it did not help. You can unmount/mount the file system but that’s not practical. The cause is that the Linux kernel keeps written data in write buffers. The solution is using the drop_caches special file as listed in my cacheflush script (familiar name BTW 😉

cacheflush
qdda /dev/data/qdda
qdda 1.7.2 - The Quick & Dirty Dedupe Analyzer
File 01, 131072 blocks (1024 MiB) processed, 186/200 MB/s, 188 MB/s avg
Merging 131072 blocks (1024 MiB) with 0 blocks (0 MiB)
Indexing in 0.24 sec (548526 blocks/s, 4285 MiB/s), 
Joining in 0.08 sec (1678882 blocks/s, 13116 MiB/s)
                      *** Details ***
blocksize           =           8 KiB
total               =     1024.00 MiB (    131072 blocks)
free                =      923.92 MiB (    118262 blocks)
used                =      100.08 MiB (     12810 blocks)
unique              =      100.05 MiB (     12807 blocks)
dupcount = 2        =        0.00 MiB (         0 blocks)
dupcount > 2        =        0.01 MiB (         1 blocks)
deduped             =      100.06 MiB (     12808 blocks)
buckets  2k         =        0.02 MiB (         8 buckets)
buckets  8k         =      100.00 MiB (     12800 buckets)
compressed (full)   =      100.00 MiB (      0.06 %)
compressed (bucket) =      100.02 MiB (     12802 blocks)
                      *** Summary ***
percentage used     =            9.77 %
percentage free     =           90.23 %
deduplication ratio =            1.00
compression ratio   =            1.00
thin ratio          =           10.23
combined            =           10.24
raw capacity        =     1024.00 MiB
net capacity        =      100.02 MiB

Now we see the random data which is 100% unique (as it’s random) consuming 100MiB or 12800 blocks. The additional 10 blocks (7 deduped) are still the ones for XFS metadata. We get more realistic values now for the ratios: random data does not compress so every 8K block needs a real 8K block to be stored. Percentage used is nearly 10% (100MiB/1024MiB).

Let’s write a zeroed file to the file system:

dd if=/dev/zero bs=1M count=100 of=/dedup/zero100
cacheflush

Running the qdda tool again gives the EXACT same result as shown above (therefore not shown again). Good to know: Zero blocks in a file actually get reported by qdda as “free” (a real AFA like XtremIO would not consume real space either for zeroed file content – the effect of thin aka virtual provisioning).

Note: From here I will cut irrelevant output and commands, ignore file system metadata bias and only focus on the stuff that matters.

What if we write blocks that have already been written once? Let’s write half of the random file again to a new file (in such a way that this new 50MB is similar to data already written):

dd if=/var/tmp/random100 bs=1M count=50 of=/dedup/random50
qdda /dev/data/qdda
free                =      873.92 MiB (    111862 blocks)
used                =      150.08 MiB (     19210 blocks)
unique              =       50.05 MiB (      6407 blocks)
dupcount = 2        =       50.00 MiB (      6400 blocks)
dupcount > 2        =        0.01 MiB (         1 blocks)
deduped             =      100.06 MiB (     12808 blocks)

Now 150MiB in the file system is non-zero. After deduplication we see that 50% is unique (2nd half of the first random file) and 50% is duplicated twice (first half of random file which is written twice). Deduped capacity is 100MiB.

If we remove the first 100MiB random file (and the zeroed file) then you would expect the deduped capacity drops to 50MiB:

rm /dedup/random100 /dedup/zero100
qdda /dev/data/qdda
free                =      873.92 MiB (    111862 blocks)
used                =      150.08 MiB (     19210 blocks)
unique              =       50.05 MiB (      6407 blocks)
dupcount = 2        =       50.00 MiB (      6400 blocks)
dupcount > 2        =        0.01 MiB (         1 blocks)
deduped             =      100.06 MiB (     12808 blocks)

Nothing changed, but we deleted 100MiB random data? This is the result of the file system just marking the data as deleted and available for overwriting. An All-Flash Array (AFA) has no way of knowing that blocks with randomish data is not used anymore, so it’s listed as allocated. If you repeat such a process over and over again then due to fragmentation, eventually the required capacity will increase until no free space is listed anymore. How could we reclaim capacity? There are two ways:

Overwriting free blocks with zeroes, or
Somehow telling the AFA that the blocks are not needed anymore (usually called something like trim, fstrim, unmap etc such as implemented by VMware VAAI).

We don’t have a real AFA so we need to use method 1. Let’s use the “fsreclaim” script I wrote before. It writes a zeroed file called “reclaim” that fills up the filesystem and then deletes it again (note this is for test purposes only – in production this could lead to file system full notices and other problems).

fsreclaim /dedup
qdda /dev/data/qdda
free                =      973.92 MiB (    124662 blocks)
used                =       50.08 MiB (      6410 blocks)
deduped             =       50.06 MiB (      6408 blocks)

Reclaimed 50MiB by writing zeroes. Let’s see if we can reclaim all capacity after deleting all files:

rm /dedup/random50
fsreclaim /dedup
qdda /dev/data/qdda
free                =     1023.92 MiB (    131062 blocks)
used                =        0.08 MiB (        10 blocks)
deduped             =        0.06 MiB (         8 blocks)

Back to the original values. Nice!

In the next post I will cover more advanced scenarios such as looking at data compression, different block sizes, the effect of wrong alignment and looking deeper into the qdda inner workings.

The Quick and Dirty Dedupe Analyzer – Part 1 – Hands on

2 thoughts on “The Quick and Dirty Dedupe Analyzer – Part 1 – Hands on”

Yaniv Kaul says:

2017-05-21 at 07:56

1. oflag=direct is missing from your ‘dd’ commands, which would have prevented the need to flush.
2. Use of ‘ddpt’ would probably have made things faster.
3. Test ‘dd’ with bigger block sizes. 8MB for example.
4. I’m sure there are faster hash functions than md5. Murmur for example.

1. Bart Sjerps says:
  
  2017-05-22 at 13:32
  
  Hi Yaniv,
  
  Good points. I wasn’t aware of ddpt but will certainly look into it. That said, I only use dd for synthetic tests as in the next few posts I will analyze real OS and DB data. For small datasets (1GB or less) dd just works fine and as a bonus many Linux admins understand it.
  
  I was looking for a mount option in XFS that prevents write buffering (or another OS setting to achieve it) but couldn’t find any. Ideas welcome 🙂
  
  When using dd with larger volumes I will surely use larger block sizes. The bottleneck however is the hashing. I can do roughly 600MB/s on my i5-4440. The scan process also has to do LZ4 compress and insert a row in the staging table (serial process) so I can achieve about 400MB/s scanning – about 1.4 TB per hour, which is reasonable for now. Going forward, I might make the hashing and compression process separate threads to speed up things if needed.
  
  Investigating other hash algorithms was on my todo list. I started with CRC32 because of speed and low space requirements but bumped into too many collissions, so I switched to MD5 as it was quick to implement and relatively fast. But I will keep it in mind (probaby need murmur3 as murmur2 is 32bit only).
  
  Thanks for commenting!

The Quick and Dirty Dedupe Analyzer – Part 1 – Hands on

Preparation

First run

Creating filesystems

Synthetic tests

Like this:

2 thoughts on “The Quick and Dirty Dedupe Analyzer – Part 1 – Hands on”

Leave a Reply to Bart SjerpsCancel reply

Preparation

First run

Creating filesystems

Synthetic tests

Share this:

Like this:

2 thoughts on “The Quick and Dirty Dedupe Analyzer – Part 1 – Hands on”

Leave a Reply to Bart SjerpsCancel reply