Description of problem: While doing an overcloud upgrade from Newton to Ocata, upgrade fails in the initial validation process. Note we did not add the skip-validation-upgrade template to the deploy script. Version-Release number of selected component (if applicable): Ocata RDO trunk current-passed-ci How reproducible: Not sure Steps to Reproduce: 1. Update Newton to latest bits 2. Upgrade undercloud to Ocata 3. Upgrade overcloud to Ocata Actual results: UPDATE_FAILED In the log we can see some different errors in the upgrade log: UPDATE_FAILED resources[24]: StackValidationFailed: resources.ManagementPort: Property error: ManagementPort.Properties.ControlPlaneIP: The server has either erred or is incapable of performing the requested operation. (HTTP 500) UPDATE_FAILED resources[2]: StackValidationFailed: resources.TenantPort: Property error: TenantPort.Properties.ControlPlaneIP: Unexpected API Error. UPDATE_FAILED resources.Controller: resources[2]: StackValidationFailed: resources.TenantPort: Property error: TenantPort.Properties.ControlPlaneIP: Unexpected API Error. UPDATE_FAILED resources.TripleOCICompute: resources[8]: StackValidationFailed: resources.TripleOCICompute: Property error: TripleOCICompute.Properties.networks[0].network: Error validating value 'f11564aa-0905-419d-a40b-8d5a9c1d5ed5' Expected results: Update complete Additional info: TripleO related installed RPMs: openstack-tripleo-0.0.8-0.3.4de13b3git.el7.noarch openstack-tripleo-common-6.1.4-1.el7.noarch openstack-tripleo-heat-templates-6.2.7-0.20171204202021.62256b2.el7.centos.noarch openstack-tripleo-image-elements-6.1.2-1.el7.noarch openstack-tripleo-puppet-elements-6.2.4-1.el7.noarch openstack-tripleo-ui-3.2.2-1.el7.noarch openstack-tripleo-validations-5.6.2-1.el7.noarch puppet-tripleo-6.5.6-0.20171201053321.4c1a677.el7.centos.noarch python-tripleoclient-6.2.3-0.20171129065402.62ab203.el7.centos.noarch
Diagnosis from analyzing logs is that the database might have been under more load than expected. Load on the undercloud is 25-30 due to telemetry services. After stopping openstack-ceilometer-collector load was reduced to 6-8. After stopping openstack-gnocchi-metricd load was reduced to 2-3
The deploy has failed after 90min running due to openstack-glance-registry and openstack-cinder-api not being active. This time we ran the deploy skipping validations ( SkipUpgradeConfigTags: [validation] ) status: CREATE_FAILED fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "openstack-cinder-api"], "delta": "0:00:00.004204", "end": "2017-12-14 23:00:42.733982", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-14 23:00:42.729778", "stderr": "", "stderr_lines": [], "stdout": "disabled", "stdout_lines": ["disabled"]} fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "openstack-glance-registry"], "delta": "0:00:00.004203", "end": "2017-12-14 23:00:51.448954", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-14 23:00:51.444751", "stderr": "", "stderr_lines": [], "stdout": "disabled", "stdout_lines": ["disabled"]} fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "async task did not complete within the requested time"}
openstack-glance-registry is inactive on the three controllers openstack-cinder-api is unknown on the three controllers
The lp is around gnocchi but the error is the same about wsrep_max_ws_rows exceeded.
Per http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-max-ws-rows this defaults to 0 which means no limit, and wsrep_max_ws_size defaults to 2G, however the latter is recent, it was formerly 1G in earlier mariadb-galera versions (like the 5.5 series we've been using for a long time). However looking at an overcloud I have here I see the settings in galera.cnf are: wsrep_max_ws_rows=131072 wsrep_max_ws_size=1073741824 so not only are we not allowing the defaults to take place, we're trying to limit the number of rows which seems unnecessary given that the max size of the transaction has a configured limit in any case. Also, if we look at https://www.percona.com/blog/2015/10/26/how-big-can-your-galera-transactions-be/ , https://bugs.launchpad.net/codership-mysql/+bug/1373909, https://mariadb.com/kb/en/library/galera-cluster-system-variables/#wsrep_max_ws_rows, there is evidence that the max_rows variable has no effect for most versions of Mysql / MariaDB-galera and in mariadb the claim is it only works as of 10.0.27 / 10.1.17. which means, as I'm looking in a dockerized overcloud and seeing that we've bumped out of the mariadb 5.x series finally into 10.1.20, we suddenly have this problem based on a setting that we never should have set. We have max_ws_size already set so that would be the top limit of the size of a transaction. It doesn't seem wise to also set a row limit within that. I would like to grep through git commits and try to figure out who set this number and if there was some rationale (though it's likely there was not, as the variable didn't even have an effect until recently). I would therefore lean towards setting this to "0" / no effect, and also we might want to look into bringing up max_ws_size back to the default, e.g. don't set it, or set it to 2G assuming we're on mariadb-galera versions that support it.
OK so far the source of the wsrep_max settings can be traced all the way back to when pacemaker / galera was first added to tripleo: https://review.openstack.org/#/c/177765/ So far I see no rationale for these settings except they came from whatever the developer was working with before tripleo did this. going to see if i can track more but as max_ws_rows setting didn't even have an effect when this review was made, looks like a good guess it was someone's random idea to set it.
oh. that number is the default value for mariadb-galera back before the number had any effect, it's right at https://mariadb.com/kb/en/library/galera-cluster-system-variables/#wsrep_max_ws_rows. so they both changed the default to "0" and allowed the variable to have an effect as of 10.0.27 / 10.1.17, to allow the change to not break anything...unless someone had the great idea to restate this default :) My recommendation is to remove both of these configuration parameters from tripleo entirely for as long as they aren't externally configurable in any case. We should go for the database-provided defaults for both wsrep_max_ws_rows and wsrep_max_ws_size, which in mariadb both default to the max available.
added a gerrit at https://review.openstack.org/#/c/528765/
Galera itself advertises wsrep_max_ws_rows being recognized at http://galeracluster.com/2016/08/announcing-galera-cluster-5-5-50-and-5-6-31-with-galera-3-17/ and again, they have defaulted it to 0 / no limit.
We are going thru the ansible playbook shown in the lastest json file under /var/lib/heat-config/deployed and seems that nova-manage cell_v2 map_instances $UUID is taking long time leading the deploy to timeout and fail. Using strace we can see it is actually trying to do something and we also see some potential duplicate entry continously. Not sure if this is expected or some step is not really idempotent and might have led to some unexpected situation.
Deploy was timing out when mapping instances to cells nova-manage cell_v2 map_instances --cell_uuid $CELL_uuid We realized that nova.instances had ~325,000 instances because the purge process (cron-based) did not work as expected and the mapping process would take ~10h to complete. So we stopped the above stated command and focused on cleaning up nova database to make it smaller running the following command until it reported it had finished: nova-manage db archive_deleted_rows --max_rows 100000 --verbose | tee -a /tmp/nova_db_archive Then we deleted all previous mappings by using mysql -e 'truncate nova_api.instance_mappings;' After fixing nova next deploy failed due to heat. Same situation than nova, purge did not work and we had 300,000 stacks that were reduced by running the below command, after ~1.5h the amount of stacks were just 18,000 stacks heat-manage purge_deleted -g days 9 --batch_size 1000 We had to set pacemaker back to managed as it was left in unmanaged status when the deploy failed pcs property set maintenance-mode=false We had an issue about an already existing column when trying to run 'heat-manage db_sync' so we had to replace temporarily the content of /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py with the following one --> http://paste.openstack.org/show/629396/ cp -a /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py /root/ vim /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py heat-manage db_sync cp -a /root/079_resource_properties_data.py /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/ After fixing heat next deploy failed due to aodh. Error: /Stage[main]/Aodh::Db::Sync/Exec[aodh-db-sync]: Failed to call refresh: aodh-dbsync --config-file /etc/aodh/aodh.conf returned 1 instead of one of [0] Error: /Stage[main]/Aodh::Db::Sync/Exec[aodh-db-sync]: aodh-dbsync --config-file /etc/aodh/aodh.conf returned 1 instead of one of [0] It complained about an already existing column timestamp_tx, as long as our table had no data we just drop it. We set the pacemaker back to managed and run aodh-dbsync mysql -e 'alter table aodh.alarm drom column timestamp_ts:' pcs property set maintenance-mode=false aodh-dbsync --config-file /etc/aodh/aodh.conf This time the deploy finally get to UPDATE_COMPLETE status after 57'
Upgrade was done using workarounds and joint retrospective with Upgrades team held, preventive actions will be tracked in the Upgrades backlog. Next checkpoint will be rdocloud Ocata->Pike upgrade, planned soon, before Ocata upstream EOL (Feb 26, 2018).