I will preface this with, I am not a DBA. I've run PostgreSQL databases off and on for the past 8 years, but I'm not a full-time DBA. I don't follow all of the ins and outs and daily updates with PGSQL. I'm a simple bit pusher who runs a few dozen DB's. And with that out of the way..... I got an alarm last week about high disk usage on one of our PostgreSQL database instances. This was really odd as we've never seen a high disk alarm on them before. Also, it was just the master that was alarming; none of the replicas were alarming even though they have identically-sized drives. If the DB data were truly growing quickly, every machine would alarm. So I checked the DB size graphs in OpenNMS and sure enough, they were OK: These are 1.6TB drives so nothing wild going on here. Running du in the pg_data directory showed that pg_xlogs were responsible for all of the new disk usage. But why? None of the WAL and checkpoint settings had been changed in months. Nothing new in the output logs. After searching high and low about the actual calculations for retaining WAL files (most of them are wrong including the official docs), I came to the conclusion that our WAL and checkpoint settings were not responsible for the high disk usage. But what was? In short, it was PostgreSQL's new replication feature called replication slots. In short, PGSQL now has the ability to track replication process on a per replica (per slot really) basis, If you have 10 replicas and 1 of them is behind, the master will keep enough WAL files to catch up the one laggard. This sounds great in theory, but I think it's a case of premature nerd optimization. No one willingly lags their replicas that much. Even if a machine is down for more than a few days due to hardware failure, it's really easy to take a new copy of the master with the features in PGSQL 9.4+. There's no longer a need to take everything down and copy all the WAL's. A single command will setup streaming replication from master to client. We don't typically use replication slots, so it was a matter of identifying and removing the entry. I guess someone was playing around with the feature in our DR site and didn't quite clean up after themselves: Finding the unused replication slot
The active column is a boolean; so here you can see that the slot isn't active and is safe to delete. According to the docs, PGSQL won't let you delete an active replication slot, but it's a good idea to verify and check its use. To delete it: Dropping the unused replication slot
After deleting the unused replication slot, PGSQL happily goes about its business and purges/rotates the old, unnecessary WAL files. If you need to release the disk space immediately, you should look into the pg_archivecleanup command, of which tutorials abound.
Happy computing! Comments are closed.
|
AuthorA NOLA native just trying to get by. I live in San Francisco and work as a digital plumber for the joint that runs this thing. (Square/Weebly) Thoughts are mine, not my company's. Archives
May 2021
Categories
All
|