PyPGIO

If you need to generate lots of I/O on an Oracle database, the de-facto I/O generator tool is SLOB (Silly Little Oracle Benchmark).

In the recent years however, PostgreSQL has gotten a lot of traction and many customers are starting to use it, even for mission-critical workloads.

Last year, one of our customers wanted to run a proof-of-concept for Postgres on all-flash storage platforms to compare different vendors. A major KPI in the POC was the result of running I/O performance tests on Postgres.

So, how do you stresstest the I/O capabilities of a Postgres database?

Late 2019, Kevin Closson published pgio, an I/O generator for Postgres using the same principles as SLOB for Oracle. The tool works great, as intended, but I found it a little awkward to work with, requiring manual edit of config files and a lot of CLI magic. Having done a lot of work with Python (and Postgres) in the recent years, I decided to write a Python-based version of PGIO using the same principles but now packaged in a nice(r) command-line tool with a more intuitive way of using it. In lack of more creative names, the Python version is called PyPGIO…

Requirements

I wanted the new tool to be as easy to install, configure and use as possible (still being a CLI, no graphical interface). In particular:

Distributed as a single, standalone file or package
No dependencies on root or sysadmin rights (run as any Linux user)
Using well-known Postgres database adapter for Python (Psycopg)
Command-completion and help for all CLI options
Built-in tracking and decent reporting of performance results
Re-use code from PGIO where possible
Easy cleanup after usage (no mess of temp files and logs lying around, removal of related database structures)
Option to use Postgres tablespaces to test different storage locations from the same database

ZipApp

There are several ways to distribute Python code. Most of them require Python knowledge to setup and configure, and distribution as a standalone package is not very common. But, not many people know that Python can execute code bundled in standard ZIP files. In order for that to work, there must be a file called

__main__.py

in the zip file. If this file exists, you can run

python <zipfile>

and it will run the contents of the zip file from the main function. It is also possible to bundle non-python files as content in the zipfile (such as SQL scripts or data files). To make things even simpler, the ZIP file format allows prepending the data in the file with a shebang, allowing the zipfile to be executed directly, without prepending the ‘python’ command, so if we add ‘#/usr/bin/env python3’ before the rest of the ZIP data (and make the zip file executable), then ‘python3’ will be used to execute the file:

# xxd -c 32 zipfile | head -1
00000000: 2321 2f75 7372 2f62 696e 2f65 6e76 2070 7974 686f 6e33 0a50 4b03 0414 0000 0000 #!/usr/bin/env python3.PK.......

Even better, Python provides a tool named ZipApp to create such special ZIP packages for us.

FYI – this is also how self-extracting ZIP files work

We now can do something like

python3 -m zipapp --python "/usr/bin/env python3" --main app:main --output fancyzipapp

Note that the resulting ZIP file does not need to have the .zip extension, making it even cleaner as now you can run the tool simply by typing ‘fancyzipapp’ on the CLI (provided the zipapp is executable and in the PATH).

So we achieved the goal of putting all python and SQL files in one package consisting of a single file. The file itself is simply a normal ZIP file except that it starts with the python shebang. It can be unzipped and inspected using normal ZIP tools – also helping in making the code more transparent compared to binary executables.

Additional Python modules

One problem remains: we need to have the psycopg package to access our Postgres database. Psycopg is not by default bundled with Python on most distro’s – so it needs to be bundled or installed separately. The problem is, we don’t know upfront which exact Python version the user has installed, and Psycopg contains some binary parts that may be version dependent.

Therefore we cannot bundle Psycopg (or many other modules) with our ZipApp, defeating the goal of having to simply download and run our PyPGIO ZipApp file – it will complain about missing modules and refuse to work.

The standard solution – provide install instructions to the user to setup a Python virtual environment – violates the simplicity rule and requires Postgres DBAs to have advanced Python skills.

I considered bundling PyPGIO with a separate installer script, but abandoned the idea as it now requires two files to be distributed instead of one. Other solutions:

Use PyInstaller or cx_freeze to build a binary from the app (creates a very large executable, depends on fusermount, OK for larger projects but not for a simple CLI tool)
Pex or Shiv – tools that automate building the virtual environment – experimented with those, found it hard to work with and complex – I may give it another try in the future

In the first release I settled for another method: pypgio would attempt to load the required modules, and upon failure, automatically call a bootstrap module that would install the virtual environment and requirements. The next time you would run pypgio, it would add the path of the modules to Python so it works without problems.

It kind of worked but required some ugly code in the main program. In search of a better solution, I found that you can actually run a different main module in ZipApp than the default one. The non-default one can be a built-in installer. Nice! How does it work? You add an additional __main__.py in a subdirectory (such as /install). Now call the ZipApp like so:

python3 fancyzipapp/install
# or
python3 fancyzipapp/install -u

Python will run the fancyzipapp but call the main function from the /install subdirectory within the ZipApp instead. It is like having multiple Python tools in one!

So in case of PyPGIO, the installer will:

Create a bash wrapper script $HOME/bin/pgio which will call pypgio after loading the virtual environment. All the user now has to do is run “pgio”.
Create the virtual environment under $HOME/pgio_venv and install the required modules in it using the system’s version of Python 3.
Create bash completions files such that after re-login, the user can enter ‘pgio <TAB><TAB>’ and see all possible CLI options.

When called with -u or –uninstall, the installer will remove the wrapper script, the virtual environment and bash completion file.

Introducing PyPGIO

PyPGIO is available from Github: PyPGIO

Installing PyPGIO

Github is nice for developers as you can clone any project and get the entire codebase, change history and the kitchen sink with it. For end users, this is overkill and problematic. One needs versioning, and some kind of release mechanism to distribute only the finished app, not the entire codebase with history. For this, Github has the “releases” feature. One can tag a specific commit with a version tag, then create a release from that and add distribution files. So that is what PyPGIO uses, the latest release is always available using the releases link. A one-line quick install is also available: PGIO quick install – This will download pypgio, put it in $HOME/bin and make it executable.

Setup the environment

At this point you can run pypgio but there is no virtual environment:

[bart ~]$ pypgio 
ERROR : No module named 'psycopg'
INFO : Try running python3 /home/bart/bin/pypgio/install
[bart ~]$

So, just follow the instructions and see the magic happen:

[bart ~]$ python3 /home/bart/bin/pypgio/install
Saving /home/bart/bin/pgio
Saving /home/bart/.local/share/bash-completion/completions/complete_pgio
/home/bart/.bash_completion already exists, skipping
Creating virtualenv on /home/bart/pgio_venv
Installing prettytable
Looking in indexes: https://pypi.org/simple, http://pip.lan
Collecting prettytable
Using cached prettytable-3.10.2-py3-none-any.whl (28 kB)
Collecting wcwidth
Using cached wcwidth-0.2.13-py2.py3-none-any.whl (34 kB)
Installing collected packages: wcwidth, prettytable
Successfully installed prettytable-3.10.2 wcwidth-0.2.13
Installing psycopg
Looking in indexes: https://pypi.org/simple, http://pip.lan
Collecting psycopg
Using cached psycopg-3.2.1-py3-none-any.whl (197 kB)
Collecting typing-extensions>=4.4
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, psycopg
Successfully installed psycopg-3.2.1 typing-extensions-4.12.2
Installing psycopg-binary
Looking in indexes: https://pypi.org/simple, http://pip.lan
Collecting psycopg-binary
Using cached psycopg_binary-3.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
Installing collected packages: psycopg-binary
Successfully installed psycopg-binary-3.2.1
Installing wcwidth
Looking in indexes: https://pypi.org/simple, http://pip.lan
Requirement already satisfied: wcwidth in ./pgio_venv/lib/python3.10/site-packages (0.2.13)
Virtual environment setup finished on /home/bart/pgio_venv

Package Version
----------------- -------
pip 22.0.2
prettytable 3.10.2
psycopg 3.2.1
psycopg-binary 3.2.1
setuptools 59.6.0
typing_extensions 4.12.2
wcwidth 0.2.13

Now logout and login again to activate pgio bash completion, or run:

source ~/.bash_completion

So after re-login (or manually loading the completions) we can simply run pgio. It will complain that the connection failed, so we need to create a database and setup the connection parameters.

Creating the database schema

A database create script is provided in the github’s scripts folder. Nothing fancy, just create a pgio user with createdb privileges, and use this user to create a database. In my case, I also created a tablespace ‘bulk’ that uses storage on a large SAS spinning disk – and gave ‘pgio’ privileges to use that tablespace (My Postgres server ‘pg01’ is just a simple VM with 2vCPUs and 8 GB memory, nothing fancy).

[bart ~]$ pgio 
ERROR : Connecting to database failed: connection failed: connection to server at "127.0.0.1", port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
[bart ~]$ pgio configure --db
--dbhost --dbname --dbuser --dbpass --dbport 
[bart ~]$ pgio configure --dbhost pg01.lan --dbname pgio --dbuser pgio 
dbhost pg01.lan
dbname pgio
dbpass pgio
dbuser pgio
rows 131072
schema_size 1.00G
schemas 4
update_pct 0
update_unit 8
work_unit 255

Note that the default settings are configured (schema_size per user 1GB, 4 schemas/users, update percentage 0, etc). Run ‘pgio configure -h’ to see the configurable options.

Side note about the word “schema” – there is actually only one database schema, but the terminology is a leftover from the original SLOB for Oracle – where each user had its own schema and table, i.e. user1.cf1, user2.cf1, and so on. So in PyPGIO we just kept the same naming…

At this point we can also set a tablespace and then run the schema creation:

[bart ~]$ pgio configure --tablespace bulk
[bart ~]$ pgio setup
INFO : Purging excess tables...
INFO : Creating seed table with 131072 rows
INFO : Seed table created in 1.22 seconds
INFO : Started 1 threads...
INFO : Worker 0 - Creating table pgio0
INFO : Worker 0 - Creating table pgio1
INFO : Worker 0 - Creating table pgio2
INFO : Worker 0 - Creating table pgio3
INFO : Data tables created in 43.92 seconds

[bart ~]$ psql -h pg01.lan pgio -U pgio 
Password for user pgio: 
psql (16.3 (Ubuntu 16.3-1.pgdg22.04+1))
Type "help" for help.

pgio=> \d
List of relations
 Schema |                Name |     Type | Owner 
--------+---------------------+----------+-------
 public | pgio0               | table    | pgio
 public | pgio1               | table    | pgio
 public | pgio2               | table    | pgio
 public | pgio3               | table    | pgio
 public | pgio_dbstats        | table    | pgio
 public | pgio_dbstats_id_seq | sequence | pgio
 public | pgio_seed           | table    | pgio
 public | pgio_table_stats    | table    | pgio
(8 rows)

pgio=>

At this point, the 4 user schemas are created and filled with test data. Also a few special tables are created: pgio_seed (used for seeding the schemas), pgio_dbstats and pgio_table_stats.

Running the I/O stresstest

Now we are ready to run the I/O/ test:

[bart ~]$ pgio run -h
usage: pypgio run [-h] [-v] [-f] <runtime> <threads>

positional arguments:
<runtime> Runtime in seconds
<threads> Number of workers per schema

options:
-h, --help     show this help message and exit
-v, --verbose  Extra details
-f, --fts      Full Table Scans
[bart ~]$ pgio run 60 4
INFO : PyPGIO 1.1.0
INFO : Date: 2024-07-30 14:07:34
INFO : Server: pg01.lan
INFO : Database: pgio
INFO : Shared Buffers: 128 (MiB)
INFO : Runtime: 60
INFO : Workers: 4
INFO : Update %: 0
INFO : Work_unit: 255
INFO : Update_unit: 8
INFO : Testing 4 thread(s) accessing 1.00G (131072 blocks) each.
INFO : Started 4 threads...
+-------------------------------------------------------------------------------------------------------------------------+
| Database I/O stats                                                                                                      |
+-----------+---------+--------+----------+----------+----------+---------+--------+---------+---------+----------+-------+
| timestamp | runtime | hits   | reads    | returned | fetched  | updated | read % | fetch/s | reads/s | writes/s | MiB/s |
+-----------+---------+--------+----------+----------+----------+---------+--------+---------+---------+----------+-------+
|  14:08:34 |   60.05 | 584298 | 14100161 | 14451529 | 14446893 |       6 |     97 |  240578 |  234804 |        0 |  1834 |
+-----------+---------+--------+----------+----------+----------+---------+--------+---------+---------+----------+-------+

At this point you can change the schema (user table) size, the update percentage (to generate writes vs just reads), change the number of concurrent users, and so on. There is also an option to run a full table scan (SELECT * FROM entire table) but at this point, this is single-threaded (1 concurrent process only).

Reminder that Postgres reports reads and writes as provided by the underlying file system. So in this case pgio reports 235K IOPS, but at least some of this will not be real physical I/O due to host caching. In order to avoid this, reduce the host cache size (tool for that is available in the Github contrib directory) and/or increase the size of the data set (more users, larger tables).

To monitor the real I/O on the system, use tools from the sysstat package (SAR, iostat, and the like – will discuss this in a future post).

Cleaning up

After you are done with testing, either drop the database or clean up the pgio tables, and remove the pgio helper files:

[bart ~]$ pgio destroy
INFO : Dropping pgio0
INFO : Dropping pgio1
INFO : Dropping pgio2
INFO : Dropping pgio3
INFO : Dropping master data

[bart ~]$ python3 $HOME/bin/pypgio/install -u
Removing virtualenv on /home/bart/pgio_venv
Deleting /home/bart/bin/pgio
Deleting /home/bart/.local/share/bash-completion/completions/complete_pgio

The pypgio package remains in $HOME/bin in case you want to use it again sometime :)

Summary

With PyPGIO, you have another tool available in the toolkit to perform I/O performance testing of a PostgreSQL database.

More details to be published in the future on the Dirty Cache Wiki: PyPGIO on Dirty Cache Wiki

If you find it useful and/or have ideas for improvement, feel free to drop me a message!

PyPGIO – An I/O generator for PostgreSQL