Bug 1526117
Summary: | Overcloud upgrade to Ocata failed | ||
---|---|---|---|
Product: | [Community] RDO | Reporter: | David Manchado <dmanchad> |
Component: | openstack-tripleo | Assignee: | Michael Bayer <mbayer> |
Status: | CLOSED UPSTREAM | QA Contact: | Shai Revivo <srevivo> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | Ocata | CC: | apevec, aschultz, kforde, owalsh, sathlang |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-01-12 18:03:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1517977, 1527607, 1533511 | ||
Bug Blocks: |
Description
David Manchado
2017-12-14 19:12:01 UTC
Diagnosis from analyzing logs is that the database might have been under more load than expected. Load on the undercloud is 25-30 due to telemetry services. After stopping openstack-ceilometer-collector load was reduced to 6-8. After stopping openstack-gnocchi-metricd load was reduced to 2-3 The deploy has failed after 90min running due to openstack-glance-registry and openstack-cinder-api not being active. This time we ran the deploy skipping validations ( SkipUpgradeConfigTags: [validation] ) status: CREATE_FAILED fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "openstack-cinder-api"], "delta": "0:00:00.004204", "end": "2017-12-14 23:00:42.733982", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-14 23:00:42.729778", "stderr": "", "stderr_lines": [], "stdout": "disabled", "stdout_lines": ["disabled"]} fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "openstack-glance-registry"], "delta": "0:00:00.004203", "end": "2017-12-14 23:00:51.448954", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-14 23:00:51.444751", "stderr": "", "stderr_lines": [], "stdout": "disabled", "stdout_lines": ["disabled"]} fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "async task did not complete within the requested time"} openstack-glance-registry is inactive on the three controllers openstack-cinder-api is unknown on the three controllers The lp is around gnocchi but the error is the same about wsrep_max_ws_rows exceeded. Per http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-max-ws-rows this defaults to 0 which means no limit, and wsrep_max_ws_size defaults to 2G, however the latter is recent, it was formerly 1G in earlier mariadb-galera versions (like the 5.5 series we've been using for a long time). However looking at an overcloud I have here I see the settings in galera.cnf are: wsrep_max_ws_rows=131072 wsrep_max_ws_size=1073741824 so not only are we not allowing the defaults to take place, we're trying to limit the number of rows which seems unnecessary given that the max size of the transaction has a configured limit in any case. Also, if we look at https://www.percona.com/blog/2015/10/26/how-big-can-your-galera-transactions-be/ , https://bugs.launchpad.net/codership-mysql/+bug/1373909, https://mariadb.com/kb/en/library/galera-cluster-system-variables/#wsrep_max_ws_rows, there is evidence that the max_rows variable has no effect for most versions of Mysql / MariaDB-galera and in mariadb the claim is it only works as of 10.0.27 / 10.1.17. which means, as I'm looking in a dockerized overcloud and seeing that we've bumped out of the mariadb 5.x series finally into 10.1.20, we suddenly have this problem based on a setting that we never should have set. We have max_ws_size already set so that would be the top limit of the size of a transaction. It doesn't seem wise to also set a row limit within that. I would like to grep through git commits and try to figure out who set this number and if there was some rationale (though it's likely there was not, as the variable didn't even have an effect until recently). I would therefore lean towards setting this to "0" / no effect, and also we might want to look into bringing up max_ws_size back to the default, e.g. don't set it, or set it to 2G assuming we're on mariadb-galera versions that support it. OK so far the source of the wsrep_max settings can be traced all the way back to when pacemaker / galera was first added to tripleo: https://review.openstack.org/#/c/177765/ So far I see no rationale for these settings except they came from whatever the developer was working with before tripleo did this. going to see if i can track more but as max_ws_rows setting didn't even have an effect when this review was made, looks like a good guess it was someone's random idea to set it. oh. that number is the default value for mariadb-galera back before the number had any effect, it's right at https://mariadb.com/kb/en/library/galera-cluster-system-variables/#wsrep_max_ws_rows. so they both changed the default to "0" and allowed the variable to have an effect as of 10.0.27 / 10.1.17, to allow the change to not break anything...unless someone had the great idea to restate this default :) My recommendation is to remove both of these configuration parameters from tripleo entirely for as long as they aren't externally configurable in any case. We should go for the database-provided defaults for both wsrep_max_ws_rows and wsrep_max_ws_size, which in mariadb both default to the max available. added a gerrit at https://review.openstack.org/#/c/528765/ Galera itself advertises wsrep_max_ws_rows being recognized at http://galeracluster.com/2016/08/announcing-galera-cluster-5-5-50-and-5-6-31-with-galera-3-17/ and again, they have defaulted it to 0 / no limit. We are going thru the ansible playbook shown in the lastest json file under /var/lib/heat-config/deployed and seems that nova-manage cell_v2 map_instances $UUID is taking long time leading the deploy to timeout and fail. Using strace we can see it is actually trying to do something and we also see some potential duplicate entry continously. Not sure if this is expected or some step is not really idempotent and might have led to some unexpected situation. Deploy was timing out when mapping instances to cells nova-manage cell_v2 map_instances --cell_uuid $CELL_uuid We realized that nova.instances had ~325,000 instances because the purge process (cron-based) did not work as expected and the mapping process would take ~10h to complete. So we stopped the above stated command and focused on cleaning up nova database to make it smaller running the following command until it reported it had finished: nova-manage db archive_deleted_rows --max_rows 100000 --verbose | tee -a /tmp/nova_db_archive Then we deleted all previous mappings by using mysql -e 'truncate nova_api.instance_mappings;' After fixing nova next deploy failed due to heat. Same situation than nova, purge did not work and we had 300,000 stacks that were reduced by running the below command, after ~1.5h the amount of stacks were just 18,000 stacks heat-manage purge_deleted -g days 9 --batch_size 1000 We had to set pacemaker back to managed as it was left in unmanaged status when the deploy failed pcs property set maintenance-mode=false We had an issue about an already existing column when trying to run 'heat-manage db_sync' so we had to replace temporarily the content of /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py with the following one --> http://paste.openstack.org/show/629396/ cp -a /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py /root/ vim /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py heat-manage db_sync cp -a /root/079_resource_properties_data.py /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/ After fixing heat next deploy failed due to aodh. Error: /Stage[main]/Aodh::Db::Sync/Exec[aodh-db-sync]: Failed to call refresh: aodh-dbsync --config-file /etc/aodh/aodh.conf returned 1 instead of one of [0] Error: /Stage[main]/Aodh::Db::Sync/Exec[aodh-db-sync]: aodh-dbsync --config-file /etc/aodh/aodh.conf returned 1 instead of one of [0] It complained about an already existing column timestamp_tx, as long as our table had no data we just drop it. We set the pacemaker back to managed and run aodh-dbsync mysql -e 'alter table aodh.alarm drom column timestamp_ts:' pcs property set maintenance-mode=false aodh-dbsync --config-file /etc/aodh/aodh.conf This time the deploy finally get to UPDATE_COMPLETE status after 57' Upgrade was done using workarounds and joint retrospective with Upgrades team held, preventive actions will be tracked in the Upgrades backlog. Next checkpoint will be rdocloud Ocata->Pike upgrade, planned soon, before Ocata upstream EOL (Feb 26, 2018). |