If you need to generate lots of I/O on an Oracle database, the de-facto I/O generator tool is SLOB (Silly Little Oracle Benchmark).
In the recent years however, PostgreSQL has gotten a lot of traction and many customers are starting to use it, even for mission-critical workloads.
Last year, one of our customers wanted to run a proof-of-concept for Postgres on all-flash storage platforms to compare different vendors. A major KPI in the POC was the result of running I/O performance tests on Postgres.
So, how do you stresstest the I/O capabilities of a Postgres database?
PyPGIO
Late 2019, Kevin Closson published pgio, an I/O generator for Postgres using the same principles as SLOB for Oracle. The tool works great, as intended, but I found it a little awkward to work with, requiring manual edit of config files and a lot of CLI magic. Having done a lot of work with Python (and Postgres) in the recent years, I decided to write a Python-based version of PGIO using the same principles but now packaged in a nice(r) command-line tool with a more intuitive way of using it. In lack of more creative names, the Python version is called PyPGIO…
Requirements
I wanted the new tool to be as easy to install, configure and use as possible (still being a CLI, no graphical interface). In particular:
- Distributed as a single, standalone file or package
- No dependencies on root or sysadmin rights (run as any Linux user)
- Using well-known Postgres database adapter for Python (Psycopg)
- Command-completion and help for all CLI options
- Built-in tracking and decent reporting of performance results
- Re-use code from PGIO where possible
- Easy cleanup after usage (no mess of temp files and logs lying around, removal of related database structures)
- Option to use Postgres tablespaces to test different storage locations from the same database
ZipApp
There are several ways to distribute Python code. Most of them require Python knowledge to setup and configure, and distribution as a standalone package is not very common. But, not many people know that Python can execute code bundled in standard ZIP files. In order for that to work, there must be a file called
__main__.py
in the zip file. If this file exists, you can run
python <zipfile>
and it will run the contents of the zip file from the main function. It is also possible to bundle non-python files as content in the zipfile (such as SQL scripts or data files). To make things even simpler, the ZIP file format allows prepending the data in the file with a shebang, allowing the zipfile to be executed directly, without prepending the ‘python’ command, so if we add ‘#/usr/bin/env python3’ before the rest of the ZIP data (and make the zip file executable), then ‘python3’ will be used to execute the file:
# xxd -c 32 zipfile | head -1 00000000: 2321 2f75 7372 2f62 696e 2f65 6e76 2070 7974 686f 6e33 0a50 4b03 0414 0000 0000 #!/usr/bin/env python3.PK.......
Even better, Python provides a tool named ZipApp to create such special ZIP packages for us.
FYI – this is also how self-extracting ZIP files work 🙂
We now can do something like
python3 -m zipapp --python "/usr/bin/env python3" --main app:main --output fancyzipapp
Note that the resulting ZIP file does not need to have the .zip extension, making it even cleaner as now you can run the tool simply by typing ‘fancyzipapp’ on the CLI (provided the zipapp is executable and in the PATH).
So we achieved the goal of putting all python and SQL files in one package consisting of a single file. The file itself is simply a normal ZIP file except that it starts with the python shebang. It can be unzipped and inspected using normal ZIP tools – also helping in making the code more transparent compared to binary executables.
Additional Python modules
One problem remains: we need to have the psycopg package to access our Postgres database. Psycopg is not by default bundled with Python on most distro’s – so it needs to be bundled or installed separately. The problem is, we don’t know upfront which exact Python version the user has installed, and Psycopg contains some binary parts that may be version dependent.
Therefore we cannot bundle Psycopg (or many other modules) with our ZipApp, defeating the goal of having to simply download and run our PyPGIO ZipApp file – it will complain about missing modules and refuse to work.
The standard solution – provide install instructions to the user to setup a Python virtual environment – violates the simplicity rule and requires Postgres DBAs to have advanced Python skills.
I considered bundling PyPGIO with a separate installer script, but abandoned the idea as it now requires two files to be distributed instead of one. Other solutions:
- Use PyInstaller or cx_freeze to build a binary from the app (creates a very large executable, depends on fusermount, OK for larger projects but not for a simple CLI tool)
- Pex or Shiv – tools that automate building the virtual environment – experimented with those, found it hard to work with and complex – I may give it another try in the future
In the first release I settled for another method: pypgio would attempt to load the required modules, and upon failure, automatically call a bootstrap module that would install the virtual environment and requirements. The next time you would run pypgio, it would add the path of the modules to Python so it works without problems.
It kind of worked but required some ugly code in the main program. In search of a better solution, I found that you can actually run a different main module in ZipApp than the default one. The non-default one can be a built-in installer. Nice! How does it work? You add an additional __main__.py in a subdirectory (such as /install). Now call the ZipApp like so:
python3 fancyzipapp/install # or python3 fancyzipapp/install -u
Python will run the fancyzipapp but call the main function from the /install subdirectory within the ZipApp instead. It is like having multiple Python tools in one!
So in case of PyPGIO, the installer will:
- Create a bash wrapper script $HOME/bin/pgio which will call pypgio after loading the virtual environment. All the user now has to do is run “pgio”.
- Create the virtual environment under $HOME/pgio_venv and install the required modules in it using the system’s version of Python 3.
- Create bash completions files such that after re-login, the user can enter ‘pgio <TAB><TAB>’ and see all possible CLI options.
When called with -u or –uninstall, the installer will remove the wrapper script, the virtual environment and bash completion file.
Introducing PyPGIO
PyPGIO is available from Github: PyPGIO
Installing PyPGIO
Github is nice for developers as you can clone any project and get the entire codebase, change history and the kitchen sink with it. For end users, this is overkill and problematic. One needs versioning, and some kind of release mechanism to distribute only the finished app, not the entire codebase with history. For this, Github has the “releases” feature. One can tag a specific commit with a version tag, then create a release from that and add distribution files. So that is what PyPGIO uses, the latest release is always available using the releases link. A one-line quick install is also available: PGIO quick install – This will download pypgio, put it in $HOME/bin and make it executable.
Setup the environment
At this point you can run pypgio but there is no virtual environment:
[bart ~]$ pypgio ERROR : No module named 'psycopg' INFO : Try running python3 /home/bart/bin/pypgio/install [bart ~]$
So, just follow the instructions and see the magic happen:
[bart ~]$ python3 /home/bart/bin/pypgio/install Saving /home/bart/bin/pgio Saving /home/bart/.local/share/bash-completion/completions/complete_pgio /home/bart/.bash_completion already exists, skipping Creating virtualenv on /home/bart/pgio_venv Installing prettytable Looking in indexes: https://pypi.org/simple, http://pip.lan Collecting prettytable Using cached prettytable-3.10.2-py3-none-any.whl (28 kB) Collecting wcwidth Using cached wcwidth-0.2.13-py2.py3-none-any.whl (34 kB) Installing collected packages: wcwidth, prettytable Successfully installed prettytable-3.10.2 wcwidth-0.2.13 Installing psycopg Looking in indexes: https://pypi.org/simple, http://pip.lan Collecting psycopg Using cached psycopg-3.2.1-py3-none-any.whl (197 kB) Collecting typing-extensions>=4.4 Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB) Installing collected packages: typing-extensions, psycopg Successfully installed psycopg-3.2.1 typing-extensions-4.12.2 Installing psycopg-binary Looking in indexes: https://pypi.org/simple, http://pip.lan Collecting psycopg-binary Using cached psycopg_binary-3.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB) Installing collected packages: psycopg-binary Successfully installed psycopg-binary-3.2.1 Installing wcwidth Looking in indexes: https://pypi.org/simple, http://pip.lan Requirement already satisfied: wcwidth in ./pgio_venv/lib/python3.10/site-packages (0.2.13) Virtual environment setup finished on /home/bart/pgio_venv Package Version ----------------- ------- pip 22.0.2 prettytable 3.10.2 psycopg 3.2.1 psycopg-binary 3.2.1 setuptools 59.6.0 typing_extensions 4.12.2 wcwidth 0.2.13 Now logout and login again to activate pgio bash completion, or run: source ~/.bash_completion
So after re-login (or manually loading the completions) we can simply run pgio. It will complain that the connection failed, so we need to create a database and setup the connection parameters.
Creating the database schema
A database create script is provided in the github’s scripts folder. Nothing fancy, just create a pgio user with createdb privileges, and use this user to create a database. In my case, I also created a tablespace ‘bulk’ that uses storage on a large SAS spinning disk – and gave ‘pgio’ privileges to use that tablespace (My Postgres server ‘pg01’ is just a simple VM with 2vCPUs and 8 GB memory, nothing fancy).
[bart ~]$ pgio ERROR : Connecting to database failed: connection failed: connection to server at "127.0.0.1", port 5432 failed: Connection refused Is the server running on that host and accepting TCP/IP connections? [bart ~]$ pgio configure --db --dbhost --dbname --dbuser --dbpass --dbport [bart ~]$ pgio configure --dbhost pg01.lan --dbname pgio --dbuser pgio dbhost pg01.lan dbname pgio dbpass pgio dbuser pgio rows 131072 schema_size 1.00G schemas 4 update_pct 0 update_unit 8 work_unit 255
Note that the default settings are configured (schema_size per user 1GB, 4 schemas/users, update percentage 0, etc). Run ‘pgio configure -h’ to see the configurable options.
Side note about the word “schema” – there is actually only one database schema, but the terminology is a leftover from the original SLOB for Oracle – where each user had its own schema and table, i.e. user1.cf1, user2.cf1, and so on. So in PyPGIO we just kept the same naming…
At this point we can also set a tablespace and then run the schema creation:
[bart ~]$ pgio configure --tablespace bulk [bart ~]$ pgio setup INFO : Purging excess tables... INFO : Creating seed table with 131072 rows INFO : Seed table created in 1.22 seconds INFO : Started 1 threads... INFO : Worker 0 - Creating table pgio0 INFO : Worker 0 - Creating table pgio1 INFO : Worker 0 - Creating table pgio2 INFO : Worker 0 - Creating table pgio3 INFO : Data tables created in 43.92 seconds [bart ~]$ psql -h pg01.lan pgio -U pgio Password for user pgio: psql (16.3 (Ubuntu 16.3-1.pgdg22.04+1)) Type "help" for help. pgio=> \d List of relations Schema | Name | Type | Owner --------+---------------------+----------+------- public | pgio0 | table | pgio public | pgio1 | table | pgio public | pgio2 | table | pgio public | pgio3 | table | pgio public | pgio_dbstats | table | pgio public | pgio_dbstats_id_seq | sequence | pgio public | pgio_seed | table | pgio public | pgio_table_stats | table | pgio (8 rows) pgio=>
At this point, the 4 user schemas are created and filled with test data. Also a few special tables are created: pgio_seed (used for seeding the schemas), pgio_dbstats and pgio_table_stats.
Running the I/O stresstest
Now we are ready to run the I/O/ test:
[bart ~]$ pgio run -h usage: pypgio run [-h] [-v] [-f] <runtime> <threads> positional arguments: <runtime> Runtime in seconds <threads> Number of workers per schema options: -h, --help show this help message and exit -v, --verbose Extra details -f, --fts Full Table Scans [bart ~]$ pgio run 60 4 INFO : PyPGIO 1.1.0 INFO : Date: 2024-07-30 14:07:34 INFO : Server: pg01.lan INFO : Database: pgio INFO : Shared Buffers: 128 (MiB) INFO : Runtime: 60 INFO : Workers: 4 INFO : Update %: 0 INFO : Work_unit: 255 INFO : Update_unit: 8 INFO : Testing 4 thread(s) accessing 1.00G (131072 blocks) each. INFO : Started 4 threads... +-------------------------------------------------------------------------------------------------------------------------+ | Database I/O stats | +-----------+---------+--------+----------+----------+----------+---------+--------+---------+---------+----------+-------+ | timestamp | runtime | hits | reads | returned | fetched | updated | read % | fetch/s | reads/s | writes/s | MiB/s | +-----------+---------+--------+----------+----------+----------+---------+--------+---------+---------+----------+-------+ | 14:08:34 | 60.05 | 584298 | 14100161 | 14451529 | 14446893 | 6 | 97 | 240578 | 234804 | 0 | 1834 | +-----------+---------+--------+----------+----------+----------+---------+--------+---------+---------+----------+-------+
At this point you can change the schema (user table) size, the update percentage (to generate writes vs just reads), change the number of concurrent users, and so on. There is also an option to run a full table scan (SELECT * FROM entire table) but at this point, this is single-threaded (1 concurrent process only).
Reminder that Postgres reports reads and writes as provided by the underlying file system. So in this case pgio reports 235K IOPS, but at least some of this will not be real physical I/O due to host caching. In order to avoid this, reduce the host cache size (tool for that is available in the Github contrib directory) and/or increase the size of the data set (more users, larger tables).
To monitor the real I/O on the system, use tools from the sysstat package (SAR, iostat, and the like – will discuss this in a future post).
Cleaning up
After you are done with testing, either drop the database or clean up the pgio tables, and remove the pgio helper files:
[bart ~]$ pgio destroy INFO : Dropping pgio0 INFO : Dropping pgio1 INFO : Dropping pgio2 INFO : Dropping pgio3 INFO : Dropping master data [bart ~]$ python3 $HOME/bin/pypgio/install -u Removing virtualenv on /home/bart/pgio_venv Deleting /home/bart/bin/pgio Deleting /home/bart/.local/share/bash-completion/completions/complete_pgio The pypgio package remains in $HOME/bin in case you want to use it again sometime :)
Summary
With PyPGIO, you have another tool available in the toolkit to perform I/O performance testing of a PostgreSQL database.
More details to be published in the future on the Dirty Cache Wiki: PyPGIO on Dirty Cache Wiki
If you find it useful and/or have ideas for improvement, feel free to drop me a message!
This post first appeared on Dirty Cache by Bart Sjerps. Copyright © 2011 – 2024. All rights reserved. Not to be reproduced for commercial purposes without written permission.