Bug 1526117

Summary:	Overcloud upgrade to Ocata failed
Product:	[Community] RDO	Reporter:	David Manchado <dmanchad>
Component:	openstack-tripleo	Assignee:	Michael Bayer <mbayer>
Status:	CLOSED UPSTREAM	QA Contact:	Shai Revivo <srevivo>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	Ocata	CC:	apevec, aschultz, kforde, owalsh, sathlang
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-01-12 18:03:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1517977, 1527607, 1533511
Bug Blocks:

Description David Manchado 2017-12-14 19:12:01 UTC

Description of problem:
While doing an overcloud upgrade from Newton to Ocata, upgrade fails in the initial validation process.
Note we did not add the skip-validation-upgrade template to the deploy script.

Version-Release number of selected component (if applicable):
Ocata RDO trunk current-passed-ci

How reproducible:
Not sure

Steps to Reproduce:
1. Update Newton to latest bits
2. Upgrade undercloud to Ocata
3. Upgrade overcloud to Ocata

Actual results:
UPDATE_FAILED

In the log we can see some different errors in the upgrade log:
UPDATE_FAILED  resources[24]: StackValidationFailed: resources.ManagementPort: Property error: ManagementPort.Properties.ControlPlaneIP: The server has either erred or is incapable of performing the requested operation. (HTTP 500)
UPDATE_FAILED  resources[2]: StackValidationFailed: resources.TenantPort: Property error: TenantPort.Properties.ControlPlaneIP: Unexpected API Error.
UPDATE_FAILED  resources.Controller: resources[2]: StackValidationFailed: resources.TenantPort: Property error: TenantPort.Properties.ControlPlaneIP: Unexpected API Error.
UPDATE_FAILED  resources.TripleOCICompute: resources[8]: StackValidationFailed: resources.TripleOCICompute: Property error: TripleOCICompute.Properties.networks[0].network: Error validating value 'f11564aa-0905-419d-a40b-8d5a9c1d5ed5'



Expected results:
Update complete

Additional info:

TripleO related installed RPMs:
openstack-tripleo-0.0.8-0.3.4de13b3git.el7.noarch
openstack-tripleo-common-6.1.4-1.el7.noarch
openstack-tripleo-heat-templates-6.2.7-0.20171204202021.62256b2.el7.centos.noarch
openstack-tripleo-image-elements-6.1.2-1.el7.noarch
openstack-tripleo-puppet-elements-6.2.4-1.el7.noarch
openstack-tripleo-ui-3.2.2-1.el7.noarch
openstack-tripleo-validations-5.6.2-1.el7.noarch
puppet-tripleo-6.5.6-0.20171201053321.4c1a677.el7.centos.noarch
python-tripleoclient-6.2.3-0.20171129065402.62ab203.el7.centos.noarch

Comment 5 David Manchado 2017-12-14 23:18:24 UTC

Diagnosis from analyzing logs is that the database might have been under more load than expected.

Load on the undercloud is 25-30 due to telemetry services.

After stopping openstack-ceilometer-collector load was reduced to 6-8.
After stopping openstack-gnocchi-metricd load was reduced to 2-3

Comment 6 David Manchado 2017-12-14 23:22:30 UTC

The deploy has failed after 90min running due to openstack-glance-registry and openstack-cinder-api not being active.
This time we ran the deploy skipping validations (  SkipUpgradeConfigTags: [validation] )


status: CREATE_FAILED
    fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "openstack-cinder-api"], "delta": "0:00:00.004204", "end": "2017-12-14 23:00:42.733982", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-14 23:00:42.729778", "stderr": "", "stderr_lines": [], "stdout": "disabled", "stdout_lines": ["disabled"]}
    fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["systemctl", "is-enabled", "openstack-glance-registry"], "delta": "0:00:00.004203", "end": "2017-12-14 23:00:51.448954", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-14 23:00:51.444751", "stderr": "", "stderr_lines": [], "stdout": "disabled", "stdout_lines": ["disabled"]}
    fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "async task did not complete within the requested time"}

Comment 7 David Manchado 2017-12-14 23:30:29 UTC

openstack-glance-registry is inactive on the three controllers
openstack-cinder-api is unknown on the three controllers

Comment 16 Sofer Athlan-Guyot 2017-12-18 13:43:19 UTC

The lp is around gnocchi but the error is the same about wsrep_max_ws_rows exceeded.

Comment 17 Michael Bayer 2017-12-18 15:07:51 UTC

Per http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-max-ws-rows this defaults to 0 which means no limit, and wsrep_max_ws_size defaults to 2G, however the latter is recent, it was formerly 1G in earlier mariadb-galera versions (like the 5.5 series we've been using for a long time).   However looking at an overcloud I have here I see the settings in galera.cnf are:

wsrep_max_ws_rows=131072
wsrep_max_ws_size=1073741824

so not only are we not allowing the defaults to take place, we're trying to limit the number of rows which seems unnecessary given that the max size of the transaction has a configured limit in any case.

Also, if we look at https://www.percona.com/blog/2015/10/26/how-big-can-your-galera-transactions-be/ , https://bugs.launchpad.net/codership-mysql/+bug/1373909, https://mariadb.com/kb/en/library/galera-cluster-system-variables/#wsrep_max_ws_rows, there is evidence that the max_rows variable has no effect for most versions of Mysql / MariaDB-galera and in mariadb the claim is it only works as of 10.0.27 / 10.1.17. which means, as I'm looking in a dockerized overcloud and seeing that we've bumped out of the mariadb 5.x series finally into 10.1.20, we suddenly have this problem based on a setting that we never should have set.

We have max_ws_size already set so that would be the top limit of the size of a transaction.    It doesn't seem wise to also set a row limit within that.

I would like to grep through git commits and try to figure out who set this number and if there was some rationale (though it's likely there was not, as the variable didn't even have an effect until recently).    I would therefore lean towards setting this to "0" / no effect, and also we might want to look into bringing up max_ws_size back to the default, e.g. don't set it, or set it to 2G assuming we're on mariadb-galera versions that support it.

Comment 18 Michael Bayer 2017-12-18 16:56:23 UTC

OK so far the source of the wsrep_max settings can be traced all the way back to when pacemaker / galera was first added to tripleo:

https://review.openstack.org/#/c/177765/

So far I see no rationale for these settings except they came from whatever the developer was working with before tripleo did this.   going to see if i can track more but as max_ws_rows setting didn't even have an effect when this review was made, looks like a good guess it was someone's random idea to set it.

Comment 19 Michael Bayer 2017-12-18 17:01:30 UTC

oh.  that number is the default value for mariadb-galera back before the number had any effect, it's right at https://mariadb.com/kb/en/library/galera-cluster-system-variables/#wsrep_max_ws_rows.   so they both changed the default to "0" and allowed the variable to have an effect as of 10.0.27 / 10.1.17, to allow the change to not break anything...unless someone had the great idea to restate this default :)

My recommendation is to remove both of these configuration parameters from tripleo entirely for as long as they aren't externally configurable in any case.  We should go for the database-provided defaults for both wsrep_max_ws_rows and wsrep_max_ws_size, which in mariadb both default to the max available.

Comment 20 Michael Bayer 2017-12-18 17:10:45 UTC

added a gerrit at https://review.openstack.org/#/c/528765/

Comment 21 Michael Bayer 2017-12-18 17:13:24 UTC

Galera itself advertises wsrep_max_ws_rows being recognized at http://galeracluster.com/2016/08/announcing-galera-cluster-5-5-50-and-5-6-31-with-galera-3-17/ and again, they have defaulted it to 0 / no limit.

Comment 22 David Manchado 2017-12-19 10:06:17 UTC

We are going thru the ansible playbook shown in the lastest json file under /var/lib/heat-config/deployed and seems that nova-manage cell_v2 map_instances $UUID is taking long time leading the deploy to timeout and fail.

Using strace we can see it is actually trying to do something and we also see some potential duplicate entry continously.
Not sure if this is expected or some step is not really idempotent and might have led to some unexpected situation.

Comment 25 David Manchado 2017-12-19 23:21:48 UTC

Deploy was timing out when mapping instances to cells

nova-manage cell_v2 map_instances --cell_uuid $CELL_uuid

We realized that nova.instances had ~325,000 instances because the purge process (cron-based) did not work as expected and the mapping process would take ~10h to complete.

So we stopped the above stated command and focused on cleaning up nova database to make it smaller running the following command until it reported it had finished:

nova-manage db archive_deleted_rows --max_rows 100000 --verbose | tee -a /tmp/nova_db_archive

Then we deleted all previous mappings by using

mysql -e 'truncate nova_api.instance_mappings;'



After fixing nova next deploy failed due to heat.
Same situation than nova, purge did not work and we had 300,000 stacks that were reduced by running the below command, after ~1.5h the amount of stacks were just 18,000 stacks

heat-manage purge_deleted -g days 9 --batch_size 1000

We had to set pacemaker back to managed as it was left in unmanaged status when the deploy failed

pcs property set maintenance-mode=false

We had an issue about an already existing column when trying to run 'heat-manage db_sync' so we had to replace temporarily the content of /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py with the following one --> http://paste.openstack.org/show/629396/

cp -a /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py /root/
vim /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/079_resource_properties_data.py
heat-manage db_sync
cp -a /root/079_resource_properties_data.py /usr/lib/python2.7/site-packages/heat/db/sqlalchemy/migrate_repo/versions/



After fixing heat next deploy failed due to aodh.

Error: /Stage[main]/Aodh::Db::Sync/Exec[aodh-db-sync]: Failed to call refresh: aodh-dbsync --config-file /etc/aodh/aodh.conf returned 1 instead of one of [0]
    Error: /Stage[main]/Aodh::Db::Sync/Exec[aodh-db-sync]: aodh-dbsync --config-file /etc/aodh/aodh.conf returned 1 instead of one of [0]

It complained about an already existing column timestamp_tx, as long as our table had no data we just drop it. 
We set the pacemaker back to managed and run aodh-dbsync

mysql -e 'alter table aodh.alarm drom column timestamp_ts:'
pcs property set maintenance-mode=false
aodh-dbsync --config-file /etc/aodh/aodh.conf


This time the deploy finally get to UPDATE_COMPLETE status after 57'

Comment 26 Alan Pevec 2018-01-12 18:03:09 UTC

Upgrade was done using workarounds and joint retrospective with Upgrades team held, preventive actions will be tracked in the Upgrades backlog.
Next checkpoint will be rdocloud Ocata->Pike upgrade, planned soon, before Ocata upstream EOL (Feb 26, 2018).