Bug 1595315
Summary: | [UPGRADES][10] Upgrade converge failed: cinder-manage db sync returned 1 instead of one of | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Yurii Prokulevych <yprokule> | ||||
Component: | puppet-tripleo | Assignee: | Alan Bishop <abishop> | ||||
Status: | CLOSED ERRATA | QA Contact: | Avi Avraham <aavraham> | ||||
Severity: | high | Docs Contact: | Kim Nylander <knylande> | ||||
Priority: | high | ||||||
Version: | 10.0 (Newton) | CC: | abishop, augol, ccamacho, geguileo, jamsmith, jjoyce, jschluet, morazi, sathlang, slinaber, srevivo, tshefi, tvignaud | ||||
Target Milestone: | z9 | Keywords: | ReleaseNotes, Triaged, ZStream | ||||
Target Release: | 10.0 (Newton) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | puppet-tripleo-5.6.8-13.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: |
During a version upgrade, Cinder's database synchronization is now executed only on the bootstrap node. This prevents database synchronization and upgrade failures that occurred when database synchronization was executed on all Controller nodes.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1599409 1640371 (view as bug list) | Environment: | |||||
Last Closed: | 2018-09-17 16:56:14 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1599409, 1599410 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Yurii Prokulevych
2018-06-26 15:13:24 UTC
Created attachment 1454764 [details]
openstack stack failures list overcloud --long
So, from the live environment we can see that the failure was[0] Debug: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Sleeping for 5.0 seconds between tries Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT". Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Traceback (most recent call last): Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/bin/cinder-manage", line 10, in <module> Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: sys.exit(main()) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib/python2.7/site-packages/cinder/cmd/manage.py", line 697, in main Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: logging.setup(CONF, "cinder") Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 269, in setup Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: _setup_logging_from_conf(conf, product_name, version) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 366, in _setup_logging_from_conf Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: filelog = file_handler(logpath) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib64/python2.7/logging/handlers.py", line 392, in __init__ Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: logging.FileHandler.__init__(self, filename, mode, encoding, delay) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__ Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: StreamHandler.__init__(self, self._open()) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: stream = open(self.baseFilename, self.mode) Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: IOError: [Errno 13] Permission denied: '/var/log/cinder/cinder-manage.log' Notice: /Stage[main]/Cinder::Deps/Anchor[cinder::service::begin]: Triggered 'refresh' from 1 events on both controller 1 and 2. On those node permission are: [root@controller-1 ~]# ls -dlrthZ /var/log/cinder/ drwxr-x---. cinder root system_u:object_r:cinder_log_t:s0 /var/log/cinder/ [root@controller-1 ~]# ls -dlrthZ /var/log/cinder/* -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/api.log -rw-r--r--. root root unconfined_u:object_r:cinder_log_t:s0 /var/log/cinder/cinder-manage.log -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/scheduler.log [root@controller-1 ~]# ls -dlrth /var/log/cinder/ drwxr-x---. 2 cinder root 67 Jun 26 11:53 /var/log/cinder/ [root@controller-1 ~]# ls -dlrth /var/log/cinder/* -rw-r--r--. 1 cinder cinder 948K Jun 26 14:20 /var/log/cinder/api.log -rw-r--r--. 1 root root 41K Jun 26 14:56 /var/log/cinder/cinder-manage.log -rw-r--r--. 1 cinder cinder 742K Jun 26 17:14 /var/log/cinder/scheduler.log On controller 0 (which is the bootstrap node) [root@controller-0 ~]# ls -lrthdZ /var/log/cinder/ drwxr-x---. cinder root system_u:object_r:cinder_log_t:s0 /var/log/cinder/ [root@controller-0 ~]# ls -lrthdZ /var/log/cinder/* -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/api.log -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/cinder-manage.log -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/volume.log -rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/scheduler.log But why are we running the cinder-manage db_sync on all three controller node is unclear to me. This conditional[1] which is there since forever seems to indicate that if step >= 4 then we run cinder db_sync, even if we are not on the bootstrap node. Is that really intended ? I couldn't find such run happening on a working convergence run logs though, so maybe there is another mechanism which I'm missing for the db sync to not happen. [0] see the full log in the attachment above [1] https://github.com/openstack/puppet-tripleo/blob/stable/newton/manifests/profile/base/cinder/api.pp#L42..L43 So after reviewing a working installation, but no test vm on it, it appears that: cinder-manage db_sync is run a lot of time on during deployment of OSP10 (which happen during the converge step): On the bootstrap node: - ControllerDeployment_Step3 - ControllerDeployment_Step4 - ControllerDeployment_Step5 On non-bootstrap nodes: - ControllerDeployment_Step4 - ControllerDeployment_Step5 for a grant total of 7 times. This is "as expected" by cold reading the code, as, as soon as we include cinder::api in [1], we trigger a db_sync. This has been there forever and even if it's a waste of time it doesn't appear to cause problem. What we had here, is that the permission were wrong on the /var/log/cinder directory. Currently I've got no explanation about that. [1] https://github.com/openstack/puppet-tripleo/blob/stable/newton/manifests/profile/base/cinder/api.pp#L42..L43 I agree it's odd that the code is triggering db_sync on all nodes, and this could be fixed with a simple patch in puppet-tripleo. But, as noted, this is unrelated to the upgrade failure. Something TBD appears to be changing the ownership of /var/log/cinder/cinder-manage.log on controller-1 and controller-2. I say that because the db_sync command explicitly wants to run as the 'cinder' user [1]. [1] https://github.com/openstack/puppet-cinder/blob/stable/newton/manifests/db/sync.pp#L22 I looked at the contents of the cinder-manage.log files on controller-1 and -2 and the logs are definitely associated with a db_sync operation. This makes me think _something_ is running the db_sync as root. And if it's the first time the cinder-manage command has been invoked on that host, then this would explain why the log file is owned by root (it was the user that created file). I found one other place that runs cinder's db_sync, and that's [1]. [1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh#L59 The code block should only execute on the bootstrap node, but if something went wrong and it _did_ execute, then it would explain how cinder-manage.log got created with root as the owner. And that would be the precondition that causes the failure when puppet-cinder attempts to do another db_sync, this time as the 'cinder' user. (In reply to Alan Bishop from comment #6) > I looked at the contents of the cinder-manage.log files on controller-1 and > -2 and the logs are definitely associated with a db_sync operation. This > makes me think _something_ is running the db_sync as root. And if it's the > first time the cinder-manage command has been invoked on that host, then > this would explain why the log file is owned by root (it was the user that > created file). > > I found one other place that runs cinder's db_sync, and that's [1]. > > [1] > https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/ > extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh#L59 As you said, it should only run on the bootstrap node and the log confirms that it's not run on ctl-1 (similar log on ctl-2) [1]. The log[2] on controller-0 add more confidence in that statement as when we enter this if statement we have additional output where the db syncs are run. Furthermore that isn't causing any change in the permission of on ctl-0 (certainly because it was created before) But, this is definitively a bug should be addressed. So a su - cinder before the db sync should be done. As a matter of fact we should check if a user change is needed for the other sync commands. Unfortunately I don't think it's the root cause here, as this part of the code is not run on ctl-1 and ctl-2. But I may have missed something. Maybe a brand new fresh reproducer is in order here. Note to yprokule. So when you re-ran the controller upgrade, the cinder::db::sync was added to the mix, but as nothing was changed in cinder it wasn't triggered. If you change something in the cinder configuration file (without restarting cinder), I'm pretty confident that the error will pop up as the cinder::db::sync will then be triggered (it's a refreshed only puppet resource) [1] One can see that output in sosreport-failed-cinder-controller-1-20180626145541/sos_commands/pacemaker/crm_report/controller-1/journal.log we see that the node isn't considered bootstrap node at that time. Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:47 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Node isn't bootstrap, skipping check for galera to be started here Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Starting or enabling redis Jun 26 11:12:49 controller-1 os-collect-config[3048]: controller-1 pacemaker not active, skipping enable redis here Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Node isn't bootstrap, skipping check for redis to be started h ere Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Going to check_resource_systemd for mongod to be started Jun 26 11:12:49 controller-1 os-collect-config[3048]: mongod is started [2] on sosreport-failed-cinder-controller-0-20180626145350/sos_commands/pacemaker/crm_report/controller-0/journal.log Jun 26 11:15:06 controller-0 os-collect-config[3024]: redis has started Jun 26 11:15:06 controller-0 os-collect-config[3024]: Tue Jun 26 11:14:45 UTC 2018 7ed2320f-ca02-48c4-a2ec-21b79a4bbceb tripleo-upgrade controller-0 Going to check_resource_systemd for mongod to be started Jun 26 11:15:06 controller-0 os-collect-config[3024]: mongod is started Jun 26 11:15:06 controller-0 os-collect-config[3024]: HTTP/1.1 200 OK Jun 26 11:15:06 controller-0 os-collect-config[3024]: Content-Type: text/plain Jun 26 11:15:06 controller-0 os-collect-config[3024]: Connection: close Jun 26 11:15:06 controller-0 os-collect-config[3024]: Content-Length: 32 Jun 26 11:15:06 controller-0 os-collect-config[3024]: Galera cluster node is synced. Jun 26 11:15:06 controller-0 os-collect-config[3024]: Running upgrade for neutron ... Jun 26 11:15:06 controller-0 os-collect-config[3024]: OK Re-running converge command succeeded but cinder-manage.log files on controller-1,2 are owned by root (In reply to Yurii Prokulevych from comment #9) > Re-running converge command succeeded but cinder-manage.log files on > controller-1,2 are owned by root As said in the previous comment, you need to change something in cinder.conf to trigger the puppet run of the sync as this resource is refreshonly. The error will then shows up again. Further analysis reveals this problem will only occur in an artificial environment in which the upgrade to OSP-10 occurs immediately after the initial deployment of OSP-9. If the OSP-9 deployment runs for a day, cinder's db_purge cron job will run, and that logs to cinder-manage.log. This has the side effect of creating the cinder-manage.log file with the correct 'cinder' ownership prior to the OSP-10 upgrade. So, when upgrading to OSP-10, the log file will already have been created with the proper ownership that will allow the OSP-10 db_sync operation to succeed. With this in mind, I'm lowering the BZ priority and severity. To fully address the problem, puppet-tripleo needs to be fixed to stop running db_sync on non-bootstrap nodes. I've filed a bug upstream, and will soon submit a patch to fix it. Thanks for OPS9 and background tip/info mentioned on #11. Guessing verification wise: 1. Deploy OSP9 + ceph 2. Immediately after (same day to beat cron job) upgrade to OSP10. 3. A successfully completed upgrade => fix is verified, correct? Forgot to set needinfo on #20 ^ Canceling needinfo, confident about steps, feel free to correct if wrong. Reference wise following this upgrade guide: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment Verified on: puppet-tripleo-5.6.8-14.el7ost.noarch Deployed OSP9 (3*controllers, 1*compute, 1*ceph) Upgraded on same day to OSP10 (to avoid built-in cinder's db_purge cron job) No errors along the way. 2018-08-12 15:58:13Z [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: http://10.0.0.101:5000/v2.0 Overcloud Deployed Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2670 |