During a Newton to Ocata upgrade those issues were encontered. They take their roots in a wsrep problem in the galera configuration that led all the purging cron job to "silently" fail. First Gnocchi gnocchi-upgrade failed to run due ot maximum number of row exceeded. Looking at the values that _should_ have been in the galera.cnf (wsrep_max_ws_rows = 0 and wsrep_max_ws_size = 2G) they seem to have not been set correctly and were showing values of wsrep_max_ws_rows = 128k and wsrep_max_ws_size = 1G. wsrep_max_ws_rows = 0 (before 131072) wsrep_max_ws_size = 2147483648 (before 1073741824) We set them to the correct values and restarted galera We also have to truncate the keystone token DB as it was too large. Re-ran gnocchi-upgrade. Success. Second Nova. nova db_cell migration timed out during the upgrade (controller step5) because the number of vm was ~325000. The estimated time to run to completion was 10 hours. It appeared that: nova-manage db archive_deleted_rows --max_rows 100 >>/dev/null 2>&1 which is a daily cron job was failing because of a wsrep issue but with an hard to track error message. The wsrep issue was solved by raising wsrep related parameters in Gnocchi step. Running while true; do date; nova-manage db archive_deleted_rows --max_rows 100000 --verbose | tee -a /tmp/nova_db_archive; done purged the db in 1h, and then the cell migration was instantaneous. Then we had an issue with heat-dbsync during the postupgrade step (overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.ControllerDeployment_Step3.0): Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout As we can see, that commands timed out as well. The heat-manage purge_deleted -g days 30 >>/dev/null 2>&1 was silently failing as well and the number of stack was ~200 000. heat-manage purge_deleted --batch_size 5000 -g days 10 We go a little bit further than 30 to reduce the amount of records was used to reach a comfortable rate of deletion. So those failing cron jobs led to times out, which in turn were very hard to debug because heat doesn't provide error output at that point. Looking at /var/lib/heat-config/* was the only was to know where we are. So we should find a way to fail early in the upgrade process based on: - either an "acceptable" number of entries (but that may be impossible to calculate properly) - check that the cron jobs successfully run recently - run all the configured purge before upgrade; - it would be good to have a DB health check before the actual upgrade process starts. Either verifying purge tasks have been running or at least notifying the operators of the DB table sizes. - something else ?
OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828