Description of problem: ----------------------- Overcloud upgrade converge failed: openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ --control-scale 3 \ --control-flavor controller \ --compute-scale 2 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \ -e /home/stack/virt/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/enable-tls.yaml \ -e /home/stack/virt/inject-trust-anchor.yaml \ -e /home/stack/virt/hostnames.yml \ -e /home/stack/virt/debug.yaml \ -e /home/stack/virt/public_vip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml 2018-06-26 14:22:56Z [AllNodesDeploySteps.ControllerDeployment_Step4.1]: CREATE_FAILED Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 2018-06-26 14:24:21Z [AllNodesDeploySteps.ControllerDeployment_Step4.0]: SIGNAL_IN_PROGRESS Signal: deployment 86543e7e-a7d7-4755-a76f-097132ffb089 succeeded 2018-06-26 14:24:22Z [AllNodesDeploySteps.ControllerDeployment_Step4.0]: CREATE_COMPLETE state changed 2018-06-26 14:24:22Z [AllNodesDeploySteps.ControllerDeployment_Step4]: CREATE_FAILED Resource CREATE failed: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 2018-06-26 14:24:23Z [AllNodesDeploySteps.ControllerDeployment_Step4]: CREATE_FAILED Error: resources.ControllerDeployment_Step4.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6 2018-06-26 14:24:23Z [AllNodesDeploySteps]: CREATE_FAILED Resource CREATE failed: Error: resources.ControllerDeployment_Step4.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6 2018-06-26 14:24:24Z [AllNodesDeploySteps]: CREATE_FAILED Error: resources.AllNodesDeploySteps.resources.ControllerDeployment_Step4.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6 2018-06-26 14:24:24Z [overcloud]: UPDATE_FAILED Error: resources.AllNodesDeploySteps.resources.ControllerDeployment_Step4.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6 Stack overcloud UPDATE_FAILED openstack stack resource list --filter status=FAILED -n5 overcloud -f yaml - physical_resource_id: 8c105c72-c8e0-4ca2-bac5-591516502e95 resource_name: AllNodesDeploySteps resource_status: CREATE_FAILED resource_type: OS::TripleO::PostDeploySteps stack_name: overcloud updated_time: '2018-06-26T14:03:56Z' - physical_resource_id: 80d73748-0b64-441c-b0d0-9a51d80fc5bb resource_name: ControllerDeployment_Step4 resource_status: CREATE_FAILED resource_type: OS::Heat::StructuredDeploymentGroup stack_name: overcloud-AllNodesDeploySteps-epmolx5a24x3 updated_time: '2018-06-26T14:03:57Z' - physical_resource_id: eb7ee291-8d46-4b34-bdab-97e18440107d resource_name: '1' resource_status: CREATE_FAILED resource_type: OS::Heat::StructuredDeployment stack_name: overcloud-AllNodesDeploySteps-epmolx5a24x3-ControllerDeployment_Step4-iksgenqifsmp updated_time: '2018-06-26T14:15:40Z' - physical_resource_id: 1e87c4c6-64a5-4e49-b59f-83907f2e0cf4 resource_name: '2' resource_status: CREATE_FAILED resource_type: OS::Heat::StructuredDeployment stack_name: overcloud-AllNodesDeploySteps-epmolx5a24x3-ControllerDeployment_Step4-iksgenqifsmp updated_time: '2018-06-26T14:15:40Z' On controller-1 and controller-2 next error present: ... Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: cinder-manage db sync returned 1 instead of one of [0]\u001b[0m\n", "deploy_status_code": 6} Version-Release number of selected component (if applicable): ------------------------------------------------------------- openstack-tripleo-heat-templates-5.3.10-7.el7ost.noarch puppet-tripleo-5.6.8-7.el7ost.noarch python-cinder-9.1.4-33.el7ost.noarch puppet-cinder-9.5.0-6.el7ost.noarch python-cinderclient-1.9.0-6.el7ost.noarch openstack-cinder-9.1.4-33.el7ost.noarch Steps to Reproduce: ------------------- 1. Upgrade UC to RHOS-10 2. Launch VM with floating ip on OC 3. Setup rhos-10 repos on OC 4. Start ping test to VM's fip 5. Run 9->10 upgrade procedure Actual results: --------------- Upgrade fails on converge step Additional info: ---------------- Virtual env: 3controllers + 2computes + 3ceph
Created attachment 1454764 [details] openstack stack failures list overcloud --long
So, from the live environment we can see that the failure was[0] Debug: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Sleeping for 5.0 seconds between tries Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT". Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Traceback (most recent call last): Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/bin/cinder-manage", line 10, in <module> Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: sys.exit(main()) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib/python2.7/site-packages/cinder/cmd/manage.py", line 697, in main Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: logging.setup(CONF, "cinder") Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 269, in setup Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: _setup_logging_from_conf(conf, product_name, version) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 366, in _setup_logging_from_conf Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: filelog = file_handler(logpath) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib64/python2.7/logging/handlers.py", line 392, in __init__ Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: logging.FileHandler.__init__(self, filename, mode, encoding, delay) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__ Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: StreamHandler.__init__(self, self._open()) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: stream = open(self.baseFilename, self.mode) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: IOError: [Errno 13] Permission denied: '/var/log/cinder/cinder-manage.log' Notice: /Stage[main]/Cinder::Deps/Anchor[cinder::service::begin]: Triggered 'refresh' from 1 events on both controller 1 and 2. On those node permission are: [root@controller-1 ~]# ls -dlrthZ /var/log/cinder/ drwxr-x---. cinder root system_u:object_r:cinder_log_t:s0 /var/log/cinder/ [root@controller-1 ~]# ls -dlrthZ /var/log/cinder/* -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/api.log -rw-r--r--. root root unconfined_u:object_r:cinder_log_t:s0 /var/log/cinder/cinder-manage.log -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/scheduler.log [root@controller-1 ~]# ls -dlrth /var/log/cinder/ drwxr-x---. 2 cinder root 67 Jun 26 11:53 /var/log/cinder/ [root@controller-1 ~]# ls -dlrth /var/log/cinder/* -rw-r--r--. 1 cinder cinder 948K Jun 26 14:20 /var/log/cinder/api.log -rw-r--r--. 1 root root 41K Jun 26 14:56 /var/log/cinder/cinder-manage.log -rw-r--r--. 1 cinder cinder 742K Jun 26 17:14 /var/log/cinder/scheduler.log On controller 0 (which is the bootstrap node) [root@controller-0 ~]# ls -lrthdZ /var/log/cinder/ drwxr-x---. cinder root system_u:object_r:cinder_log_t:s0 /var/log/cinder/ [root@controller-0 ~]# ls -lrthdZ /var/log/cinder/* -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/api.log -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/cinder-manage.log -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/volume.log -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/scheduler.log But why are we running the cinder-manage db_sync on all three controller node is unclear to me. This conditional[1] which is there since forever seems to indicate that if step >= 4 then we run cinder db_sync, even if we are not on the bootstrap node. Is that really intended ? I couldn't find such run happening on a working convergence run logs though, so maybe there is another mechanism which I'm missing for the db sync to not happen. [0] see the full log in the attachment above [1] https://github.com/openstack/puppet-tripleo/blob/stable/newton/manifests/profile/base/cinder/api.pp#L42..L43
So after reviewing a working installation, but no test vm on it, it appears that: cinder-manage db_sync is run a lot of time on during deployment of OSP10 (which happen during the converge step): On the bootstrap node: - ControllerDeployment_Step3 - ControllerDeployment_Step4 - ControllerDeployment_Step5 On non-bootstrap nodes: - ControllerDeployment_Step4 - ControllerDeployment_Step5 for a grant total of 7 times. This is "as expected" by cold reading the code, as, as soon as we include cinder::api in [1], we trigger a db_sync. This has been there forever and even if it's a waste of time it doesn't appear to cause problem. What we had here, is that the permission were wrong on the /var/log/cinder directory. Currently I've got no explanation about that. [1] https://github.com/openstack/puppet-tripleo/blob/stable/newton/manifests/profile/base/cinder/api.pp#L42..L43
I agree it's odd that the code is triggering db_sync on all nodes, and this could be fixed with a simple patch in puppet-tripleo. But, as noted, this is unrelated to the upgrade failure. Something TBD appears to be changing the ownership of /var/log/cinder/cinder-manage.log on controller-1 and controller-2. I say that because the db_sync command explicitly wants to run as the 'cinder' user [1]. [1] https://github.com/openstack/puppet-cinder/blob/stable/newton/manifests/db/sync.pp#L22
I looked at the contents of the cinder-manage.log files on controller-1 and -2 and the logs are definitely associated with a db_sync operation. This makes me think _something_ is running the db_sync as root. And if it's the first time the cinder-manage command has been invoked on that host, then this would explain why the log file is owned by root (it was the user that created file). I found one other place that runs cinder's db_sync, and that's [1]. [1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh#L59 The code block should only execute on the bootstrap node, but if something went wrong and it _did_ execute, then it would explain how cinder-manage.log got created with root as the owner. And that would be the precondition that causes the failure when puppet-cinder attempts to do another db_sync, this time as the 'cinder' user.
(In reply to Alan Bishop from comment #6) > I looked at the contents of the cinder-manage.log files on controller-1 and > -2 and the logs are definitely associated with a db_sync operation. This > makes me think _something_ is running the db_sync as root. And if it's the > first time the cinder-manage command has been invoked on that host, then > this would explain why the log file is owned by root (it was the user that > created file). > > I found one other place that runs cinder's db_sync, and that's [1]. > > [1] > https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/ > extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh#L59 As you said, it should only run on the bootstrap node and the log confirms that it's not run on ctl-1 (similar log on ctl-2) [1]. The log[2] on controller-0 add more confidence in that statement as when we enter this if statement we have additional output where the db syncs are run. Furthermore that isn't causing any change in the permission of on ctl-0 (certainly because it was created before) But, this is definitively a bug should be addressed. So a su - cinder before the db sync should be done. As a matter of fact we should check if a user change is needed for the other sync commands. Unfortunately I don't think it's the root cause here, as this part of the code is not run on ctl-1 and ctl-2. But I may have missed something. Maybe a brand new fresh reproducer is in order here. Note to yprokule. So when you re-ran the controller upgrade, the cinder::db::sync was added to the mix, but as nothing was changed in cinder it wasn't triggered. If you change something in the cinder configuration file (without restarting cinder), I'm pretty confident that the error will pop up as the cinder::db::sync will then be triggered (it's a refreshed only puppet resource) [1] One can see that output in sosreport-failed-cinder-controller-1-20180626145541/sos_commands/pacemaker/crm_report/controller-1/journal.log we see that the node isn't considered bootstrap node at that time. Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:47 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Node isn't bootstrap, skipping check for galera to be started here Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Starting or enabling redis Jun 26 11:12:49 controller-1 os-collect-config[3048]: controller-1 pacemaker not active, skipping enable redis here Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Node isn't bootstrap, skipping check for redis to be started h ere Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Going to check_resource_systemd for mongod to be started Jun 26 11:12:49 controller-1 os-collect-config[3048]: mongod is started [2] on sosreport-failed-cinder-controller-0-20180626145350/sos_commands/pacemaker/crm_report/controller-0/journal.log Jun 26 11:15:06 controller-0 os-collect-config[3024]: redis has started Jun 26 11:15:06 controller-0 os-collect-config[3024]: Tue Jun 26 11:14:45 UTC 2018 7ed2320f-ca02-48c4-a2ec-21b79a4bbceb tripleo-upgrade controller-0 Going to check_resource_systemd for mongod to be started Jun 26 11:15:06 controller-0 os-collect-config[3024]: mongod is started Jun 26 11:15:06 controller-0 os-collect-config[3024]: HTTP/1.1 200 OK Jun 26 11:15:06 controller-0 os-collect-config[3024]: Content-Type: text/plain Jun 26 11:15:06 controller-0 os-collect-config[3024]: Connection: close Jun 26 11:15:06 controller-0 os-collect-config[3024]: Content-Length: 32 Jun 26 11:15:06 controller-0 os-collect-config[3024]: Galera cluster node is synced. Jun 26 11:15:06 controller-0 os-collect-config[3024]: Running upgrade for neutron ... Jun 26 11:15:06 controller-0 os-collect-config[3024]: OK
Re-running converge command succeeded but cinder-manage.log files on controller-1,2 are owned by root
(In reply to Yurii Prokulevych from comment #9) > Re-running converge command succeeded but cinder-manage.log files on > controller-1,2 are owned by root As said in the previous comment, you need to change something in cinder.conf to trigger the puppet run of the sync as this resource is refreshonly. The error will then shows up again.
Further analysis reveals this problem will only occur in an artificial environment in which the upgrade to OSP-10 occurs immediately after the initial deployment of OSP-9. If the OSP-9 deployment runs for a day, cinder's db_purge cron job will run, and that logs to cinder-manage.log. This has the side effect of creating the cinder-manage.log file with the correct 'cinder' ownership prior to the OSP-10 upgrade. So, when upgrading to OSP-10, the log file will already have been created with the proper ownership that will allow the OSP-10 db_sync operation to succeed. With this in mind, I'm lowering the BZ priority and severity. To fully address the problem, puppet-tripleo needs to be fixed to stop running db_sync on non-bootstrap nodes. I've filed a bug upstream, and will soon submit a patch to fix it.
Thanks for OPS9 and background tip/info mentioned on #11. Guessing verification wise: 1. Deploy OSP9 + ceph 2. Immediately after (same day to beat cron job) upgrade to OSP10. 3. A successfully completed upgrade => fix is verified, correct?
Forgot to set needinfo on #20 ^
Canceling needinfo, confident about steps, feel free to correct if wrong. Reference wise following this upgrade guide: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment
Verified on: puppet-tripleo-5.6.8-14.el7ost.noarch Deployed OSP9 (3*controllers, 1*compute, 1*ceph) Upgraded on same day to OSP10 (to avoid built-in cinder's db_purge cron job) No errors along the way. 2018-08-12 15:58:13Z [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: http://10.0.0.101:5000/v2.0 Overcloud Deployed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2670