Description of problem:when C&U real-time records are not purged consistently, the number of unpurged records grows so larger that the huorly metrics_## tables only grow in size eventually filling the entire VMDB filesystem. Version-Release number of selected component (if applicable):5.7.1.3 How reproducible: needs a test environment simulating the behavior of a multi-thousand VM environment where C&U realtime data is captured over several days. In this specific customer case, there are about 5k VM instances each collecting C&U. Purgeing is failing due to the timeout of the purge message with the 600 seconds timeout value. this is a vm ware environment with 5k vms, so there are about 5x10^3 * 1.8x10^2 => 900,000 realtime rows expected to be captured per hour, Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Original database maintenance scripts had a provision to change from REINDEX of hourly metrics_## tables to TRUNCATE but the script comments of the original are not preserved in the current scripts. I think it might be a good idea to change from REINDEX and to always TRUNCATE the tables after 23 hours to avoid the probems which we know will surface if the VMDB filesystem is allowed to fill.
proposed modified script should look like: ++++++++++++++++++++++++++++++++++++++++++++++ #!/bin/bash source /etc/default/evm LOGFILE=/var/www/miq/vmdb/log/hourly_continuous_pg_maint_stdout.log TABLE_NAME=metrics_$(date -u +"%H" --date='+1 hours') echo "current time is $(date) -> target for TRUNCATE TABLE is '$TABLE_NAME' table" >> $LOGFILE psql -U postgres vmdb_production -a -e -c "TRUNCATE TABLE $TABLE_NAME" >> $LOGFILE 2>&1 echo "TRUNCATE TABLE $TABLE_NAME completed at $(date)" >> $LOGFILE echo "=================" >> $LOGFILE ++++++++++++++++++++++++++++++++++++++++++++++
The BZ is open because it represents a failure in the product to consistently remove realtime C&U tuples from realtime tables. This is not the only customer who has reported this problem with the latest CFME 4.2 code, so the bug still exists. The case is closed because I provided the customer with a work around which removes the exposure of this failure causing his filesystem to fill because we are failing to remove realtime tuples, so *his* problem is addressed while *the product problem* persists.
I'm dropping the priority here because it looks like there's a KB-style workaround. That can give GTs team time to look over the possible options here.
BZ: https://github.com/ManageIQ/manageiq/pull/15312
Verified on 5.9.0.1.