Bug 1527607 - Purge cron job were silently failing causing database related command to timeout during upgrade.
Summary: Purge cron job were silently failing causing database related command to time...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Sergii Golovatiuk
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks: 1526117
TreeView+ depends on / blocked
 
Reported: 2017-12-19 15:59 UTC by Sofer Athlan-Guyot
Modified: 2018-06-22 12:30 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-22 12:30:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sofer Athlan-Guyot 2017-12-19 15:59:29 UTC
During a Newton to Ocata upgrade those issues were encontered.  They take their roots in a wsrep problem in the galera configuration that led all the purging cron job to "silently" fail.


First Gnocchi
gnocchi-upgrade failed to run due ot maximum number of row exceeded. 
Looking at the values that _should_ have been in the galera.cnf (wsrep_max_ws_rows = 0 and wsrep_max_ws_size = 2G) they seem to have not been set correctly and were showing values of wsrep_max_ws_rows = 128k and wsrep_max_ws_size = 1G. 

wsrep_max_ws_rows = 0 (before 131072)
wsrep_max_ws_size = 2147483648 (before 1073741824)

We set them to the correct values and restarted galera

We also have to truncate the keystone token DB as it was too large. 

Re-ran gnocchi-upgrade. Success.


Second Nova.
nova db_cell migration timed out during the upgrade (controller step5) because the number of vm was ~325000.  The estimated time to run to completion was 10 hours.

It appeared that:

nova-manage db archive_deleted_rows --max_rows 100  >>/dev/null 2>&1

which is a daily cron job was failing because of a wsrep issue but with an hard to track error message.  The wsrep issue was solved by raising wsrep related parameters in Gnocchi step.


Running 

while true; do date;  nova-manage db archive_deleted_rows --max_rows 100000 --verbose | tee -a /tmp/nova_db_archive; done
purged the db in 1h, and then the cell migration was instantaneous.


Then we had an issue with heat-dbsync during the postupgrade step (overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.ControllerDeployment_Step3.0):

   Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout                                                                                            
   Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout

As we can see, that commands timed out as well.

The

  heat-manage purge_deleted -g days 30 >>/dev/null 2>&1

was silently failing as well and the number of stack was ~200 000.

  heat-manage purge_deleted --batch_size 5000 -g days 10 

We go a little bit further than 30 to reduce the amount of records

was used to reach a comfortable rate of deletion.

So those failing cron jobs led to times out, which in turn were very hard to debug because heat doesn't provide error output at that point.  Looking at /var/lib/heat-config/* was the only was to know where we are.

So we should find a way to fail early in the upgrade process based on:
 - either an "acceptable" number of entries (but that may be impossible to calculate properly)
 - check that the cron jobs successfully run recently
 - run all the configured purge before upgrade;
 - it would be good to have a DB health check before the actual upgrade process starts. Either verifying purge tasks have been running or at least notifying the operators of the DB table sizes. 
 - something else ?

Comment 1 Scott Lewis 2018-06-22 12:30:19 UTC
OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828


Note You need to log in before you can comment on or make changes to this bug.