Bug 1595315

Summary:

[UPGRADES][10] Upgrade converge failed: cinder-manage db sync returned 1 instead of one of

Product:

Red Hat OpenStack

Reporter:

Yurii Prokulevych <yprokule>

Component:

puppet-tripleo

Assignee:

Alan Bishop <abishop>

Status:

CLOSED ERRATA

QA Contact:

Avi Avraham <aavraham>

Severity:

high

Docs Contact:

Kim Nylander <knylande>

Priority:

high

Version:

10.0 (Newton)

CC:

abishop, augol, ccamacho, geguileo, jamsmith, jjoyce, jschluet, morazi, sathlang, slinaber, srevivo, tshefi, tvignaud

Target Milestone:

Keywords:

ReleaseNotes, Triaged, ZStream

Target Release:

10.0 (Newton)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

puppet-tripleo-5.6.8-13.el7ost

Doc Type:

Bug Fix

Doc Text:

During a version upgrade, Cinder's database synchronization is now executed only on the bootstrap node. This prevents database synchronization and upgrade failures that occurred when database synchronization was executed on all Controller nodes.

Story Points:

---

Clone Of:

Clones:

1599409 1640371 (view as bug list)

Environment:

Last Closed:

2018-09-17 16:56:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1599409, 1599410

Bug Blocks:

Attachments:

Description	Flags
openstack stack failures list overcloud --long	none

Description Yurii Prokulevych 2018-06-26 15:13:24 UTC

Description of problem:
-----------------------
Overcloud upgrade converge failed:

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml

2018-06-26 14:22:56Z [AllNodesDeploySteps.ControllerDeployment_Step4.1]: CREATE_FAILED  Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2018-06-26 14:24:21Z [AllNodesDeploySteps.ControllerDeployment_Step4.0]: SIGNAL_IN_PROGRESS  Signal: deployment 86543e7e-a7d7-4755-a76f-097132ffb089 succeeded
2018-06-26 14:24:22Z [AllNodesDeploySteps.ControllerDeployment_Step4.0]: CREATE_COMPLETE  state changed
2018-06-26 14:24:22Z [AllNodesDeploySteps.ControllerDeployment_Step4]: CREATE_FAILED  Resource CREATE failed: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2018-06-26 14:24:23Z [AllNodesDeploySteps.ControllerDeployment_Step4]: CREATE_FAILED  Error: resources.ControllerDeployment_Step4.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6
2018-06-26 14:24:23Z [AllNodesDeploySteps]: CREATE_FAILED  Resource CREATE failed: Error: resources.ControllerDeployment_Step4.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6
2018-06-26 14:24:24Z [AllNodesDeploySteps]: CREATE_FAILED  Error: resources.AllNodesDeploySteps.resources.ControllerDeployment_Step4.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6
2018-06-26 14:24:24Z [overcloud]: UPDATE_FAILED  Error: resources.AllNodesDeploySteps.resources.ControllerDeployment_Step4.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6

 Stack overcloud UPDATE_FAILED 

openstack stack resource list --filter status=FAILED -n5 overcloud -f yaml
- physical_resource_id: 8c105c72-c8e0-4ca2-bac5-591516502e95
  resource_name: AllNodesDeploySteps
  resource_status: CREATE_FAILED
  resource_type: OS::TripleO::PostDeploySteps
  stack_name: overcloud
  updated_time: '2018-06-26T14:03:56Z'
- physical_resource_id: 80d73748-0b64-441c-b0d0-9a51d80fc5bb
  resource_name: ControllerDeployment_Step4
  resource_status: CREATE_FAILED
  resource_type: OS::Heat::StructuredDeploymentGroup
  stack_name: overcloud-AllNodesDeploySteps-epmolx5a24x3
  updated_time: '2018-06-26T14:03:57Z'
- physical_resource_id: eb7ee291-8d46-4b34-bdab-97e18440107d
  resource_name: '1'
  resource_status: CREATE_FAILED
  resource_type: OS::Heat::StructuredDeployment
  stack_name: overcloud-AllNodesDeploySteps-epmolx5a24x3-ControllerDeployment_Step4-iksgenqifsmp
  updated_time: '2018-06-26T14:15:40Z'
- physical_resource_id: 1e87c4c6-64a5-4e49-b59f-83907f2e0cf4
  resource_name: '2'
  resource_status: CREATE_FAILED
  resource_type: OS::Heat::StructuredDeployment
  stack_name: overcloud-AllNodesDeploySteps-epmolx5a24x3-ControllerDeployment_Step4-iksgenqifsmp
  updated_time: '2018-06-26T14:15:40Z'

On controller-1 and controller-2 next error present:
...
Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: cinder-manage  db sync returned 1 instead of one of [0]\u001b[0m\n", "deploy_status_code": 6}


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
openstack-tripleo-heat-templates-5.3.10-7.el7ost.noarch
puppet-tripleo-5.6.8-7.el7ost.noarch

python-cinder-9.1.4-33.el7ost.noarch
puppet-cinder-9.5.0-6.el7ost.noarch
python-cinderclient-1.9.0-6.el7ost.noarch
openstack-cinder-9.1.4-33.el7ost.noarch

Steps to Reproduce:
-------------------
1. Upgrade UC to RHOS-10
2. Launch VM with floating ip on OC
3. Setup rhos-10 repos on OC
4. Start ping test to VM's fip
5. Run 9->10 upgrade procedure 

Actual results:
---------------
Upgrade fails on converge step

Additional info:
----------------
Virtual env: 3controllers + 2computes + 3ceph

Comment 2 Sofer Athlan-Guyot 2018-06-26 17:22:20 UTC

Created attachment 1454764 [details]
openstack stack failures list overcloud --long

Comment 3 Sofer Athlan-Guyot 2018-06-26 17:38:30 UTC

So, from the live environment we can see that the failure was[0]

    Debug: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Sleeping for 5.0 seconds between tries
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: Traceback (most recent call last):                                                                                                  
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:   File "/usr/bin/cinder-manage", line 10, in <module>     
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:     sys.exit(main())                      
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:   File "/usr/lib/python2.7/site-packages/cinder/cmd/manage.py", line 697, in main
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:     logging.setup(CONF, "cinder")                                                                                                  
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:   File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 269, in setup                                                      
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:     _setup_logging_from_conf(conf, product_name, version)               
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:   File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 366, in _setup_logging_from_conf                                   
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:     filelog = file_handler(logpath)                
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:   File "/usr/lib64/python2.7/logging/handlers.py", line 392, in __init__
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:     logging.FileHandler.__init__(self, filename, mode, encoding, delay)
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:   File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:     StreamHandler.__init__(self, self._open())
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:   File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns:     stream = open(self.baseFilename, self.mode)
    Notice: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]/returns: IOError: [Errno 13] Permission denied: '/var/log/cinder/cinder-manage.log'
    Notice: /Stage[main]/Cinder::Deps/Anchor[cinder::service::begin]: Triggered 'refresh' from 1 events


on both controller 1 and 2.

On those node permission are:

[root@controller-1 ~]# ls -dlrthZ /var/log/cinder/
drwxr-x---. cinder root system_u:object_r:cinder_log_t:s0 /var/log/cinder/
[root@controller-1 ~]# ls -dlrthZ /var/log/cinder/*
-rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/api.log
-rw-r--r--. root   root   unconfined_u:object_r:cinder_log_t:s0 /var/log/cinder/cinder-manage.log
-rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/scheduler.log
[root@controller-1 ~]# ls -dlrth /var/log/cinder/
drwxr-x---. 2 cinder root 67 Jun 26 11:53 /var/log/cinder/
[root@controller-1 ~]# ls -dlrth /var/log/cinder/*
-rw-r--r--. 1 cinder cinder 948K Jun 26 14:20 /var/log/cinder/api.log
-rw-r--r--. 1 root   root    41K Jun 26 14:56 /var/log/cinder/cinder-manage.log
-rw-r--r--. 1 cinder cinder 742K Jun 26 17:14 /var/log/cinder/scheduler.log


On controller 0 (which is the bootstrap node)

[root@controller-0 ~]# ls -lrthdZ /var/log/cinder/
drwxr-x---. cinder root system_u:object_r:cinder_log_t:s0 /var/log/cinder/
[root@controller-0 ~]# ls -lrthdZ /var/log/cinder/*
-rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/api.log
-rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/cinder-manage.log
-rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/volume.log
-rw-r--r--. cinder cinder system_u:object_r:cinder_log_t:s0 /var/log/cinder/scheduler.log


But why are we running the cinder-manage db_sync on all three controller node is  unclear to me.

This conditional[1] which is there since forever seems to indicate that if step >= 4 then we run cinder db_sync, even if we are not on the bootstrap node.  Is that really intended ?

I couldn't find such run happening on a working convergence run logs though, so maybe there is another mechanism which I'm missing for the db sync to not happen.

[0] see the full log in the attachment above
[1] https://github.com/openstack/puppet-tripleo/blob/stable/newton/manifests/profile/base/cinder/api.pp#L42..L43

Comment 4 Sofer Athlan-Guyot 2018-06-27 12:07:36 UTC

So after reviewing a working installation, but no test vm on it, it appears that:

  cinder-manage db_sync 

is run a lot of time on during deployment of OSP10 (which happen during the converge step):

On the bootstrap node:
  - ControllerDeployment_Step3
  - ControllerDeployment_Step4
  - ControllerDeployment_Step5

On non-bootstrap nodes:
  - ControllerDeployment_Step4
  - ControllerDeployment_Step5

for a grant total of 7 times.

This is "as expected" by cold reading the code, as, as soon as we include cinder::api in [1], we trigger a db_sync.

This has been there forever and even if it's a waste of time it doesn't appear to cause problem.

What we had here, is that the permission were wrong on the /var/log/cinder directory.  Currently I've got no explanation about that.  

[1] https://github.com/openstack/puppet-tripleo/blob/stable/newton/manifests/profile/base/cinder/api.pp#L42..L43

Comment 5 Alan Bishop 2018-06-27 15:02:34 UTC

I agree it's odd that the code is triggering db_sync on all nodes, and this could be fixed with a simple patch in puppet-tripleo. But, as noted, this is unrelated to the upgrade failure.

Something TBD appears to be changing the ownership of /var/log/cinder/cinder-manage.log on controller-1 and controller-2. I say that because the db_sync command explicitly wants to run as the 'cinder' user [1].

[1] https://github.com/openstack/puppet-cinder/blob/stable/newton/manifests/db/sync.pp#L22

Comment 6 Alan Bishop 2018-06-27 17:37:30 UTC

I looked at the contents of the cinder-manage.log files on controller-1 and -2 and the logs are definitely associated with a db_sync operation. This makes me think _something_ is running the db_sync as root. And if it's the first time the cinder-manage command has been invoked on that host, then this would explain why the log file is owned by root (it was the user that created file).

I found one other place that runs cinder's db_sync, and that's [1].

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh#L59

The code block should only execute on the bootstrap node, but if something went wrong and it _did_ execute, then it would explain how cinder-manage.log got created with root as the owner. And that would be the precondition that causes the failure when puppet-cinder attempts to do another db_sync, this time as the 'cinder' user.

Comment 8 Sofer Athlan-Guyot 2018-06-27 19:52:03 UTC

(In reply to Alan Bishop from comment #6)
> I looked at the contents of the cinder-manage.log files on controller-1 and
> -2 and the logs are definitely associated with a db_sync operation. This
> makes me think _something_ is running the db_sync as root. And if it's the
> first time the cinder-manage command has been invoked on that host, then
> this would explain why the log file is owned by root (it was the user that
> created file).
> 
> I found one other place that runs cinder's db_sync, and that's [1].
> 
> [1]
> https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/
> extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh#L59

As you said, it should only run on the bootstrap node and the log confirms that it's not run on ctl-1 (similar log on ctl-2) [1].  

The log[2] on controller-0 add more confidence in that statement as when we enter this if statement we have additional output where the db syncs are run. Furthermore that isn't causing any change in the permission of on ctl-0 (certainly because it was created before)

But, this is definitively a bug should be addressed.  So a su - cinder before the db sync should be done.  As a matter of fact we should check if a user change is needed for the other sync commands.

Unfortunately I don't think it's the root cause here, as this part of the code is not run on ctl-1 and ctl-2.

But I may have missed something.  Maybe a brand new fresh reproducer is in order here.

Note to yprokule.  So when you re-ran the controller upgrade, the cinder::db::sync was added to the mix, but as nothing was changed in cinder it wasn't triggered.  If you change something in the cinder configuration file (without restarting cinder), I'm pretty confident that the error will pop up as the cinder::db::sync will then be triggered (it's a refreshed only puppet resource)

[1] One can see that output in sosreport-failed-cinder-controller-1-20180626145541/sos_commands/pacemaker/crm_report/controller-1/journal.log we see that the node isn't considered bootstrap node at that time.

Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:47 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Node isn't bootstrap, skipping check for galera to be started 
here
Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Starting or enabling redis
Jun 26 11:12:49 controller-1 os-collect-config[3048]: controller-1 pacemaker not active, skipping enable redis here
Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Node isn't bootstrap, skipping check for redis to be started h
ere
Jun 26 11:12:49 controller-1 os-collect-config[3048]: Tue Jun 26 11:12:48 UTC 2018 e4ddf8fa-e171-4473-8712-d392f739e6d5 tripleo-upgrade controller-1 Going to check_resource_systemd for mongod to be started
Jun 26 11:12:49 controller-1 os-collect-config[3048]: mongod is started


[2] on sosreport-failed-cinder-controller-0-20180626145350/sos_commands/pacemaker/crm_report/controller-0/journal.log

Jun 26 11:15:06 controller-0 os-collect-config[3024]: redis has started
Jun 26 11:15:06 controller-0 os-collect-config[3024]: Tue Jun 26 11:14:45 UTC 2018 7ed2320f-ca02-48c4-a2ec-21b79a4bbceb tripleo-upgrade controller-0 Going to check_resource_systemd for mongod to be started
Jun 26 11:15:06 controller-0 os-collect-config[3024]: mongod is started
Jun 26 11:15:06 controller-0 os-collect-config[3024]: HTTP/1.1 200 OK
Jun 26 11:15:06 controller-0 os-collect-config[3024]: Content-Type: text/plain
Jun 26 11:15:06 controller-0 os-collect-config[3024]: Connection: close
Jun 26 11:15:06 controller-0 os-collect-config[3024]: Content-Length: 32
Jun 26 11:15:06 controller-0 os-collect-config[3024]: Galera cluster node is synced.
Jun 26 11:15:06 controller-0 os-collect-config[3024]: Running upgrade for neutron ...
Jun 26 11:15:06 controller-0 os-collect-config[3024]: OK

Comment 9 Yurii Prokulevych 2018-06-28 06:07:29 UTC

Re-running converge command succeeded but cinder-manage.log files on controller-1,2 are owned by root

Comment 10 Sofer Athlan-Guyot 2018-06-28 09:49:42 UTC

(In reply to Yurii Prokulevych from comment #9)
> Re-running converge command succeeded but cinder-manage.log files on
> controller-1,2 are owned by root

As said in the previous comment, you need to change something in cinder.conf to trigger the puppet run of the sync as this resource is refreshonly.  The error will then shows up again.

Comment 11 Alan Bishop 2018-06-28 12:34:46 UTC

Further analysis reveals this problem will only occur in an artificial
environment in which the upgrade to OSP-10 occurs immediately after the
initial deployment of OSP-9. If the OSP-9 deployment runs for a day, cinder's
db_purge cron job will run, and that logs to cinder-manage.log. This has the
side effect of creating the cinder-manage.log file with the correct 'cinder'
ownership prior to the OSP-10 upgrade. So, when upgrading to OSP-10, the
log file will already have been created with the proper ownership that will
allow the OSP-10 db_sync operation to succeed.

With this in mind, I'm lowering the BZ priority and severity.

To fully address the problem, puppet-tripleo needs to be fixed to stop
running db_sync on non-bootstrap nodes. I've filed a bug upstream, and will
soon submit a patch to fix it.

Comment 20 Tzach Shefi 2018-08-09 10:36:55 UTC

Thanks for OPS9 and background tip/info mentioned on #11.   

Guessing verification wise:

1. Deploy OSP9 + ceph
2. Immediately after (same day to beat cron job) upgrade to OSP10.
3. A successfully completed upgrade => fix is verified, correct?

Comment 21 Tzach Shefi 2018-08-09 10:38:20 UTC

Forgot to set needinfo on #20 ^

Comment 22 Tzach Shefi 2018-08-09 11:37:27 UTC

Canceling needinfo, confident about steps, feel free to correct if wrong. 

Reference wise following this upgrade guide: 
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment

Comment 23 Tzach Shefi 2018-08-12 16:00:22 UTC

Verified on:
puppet-tripleo-5.6.8-14.el7ost.noarch

Deployed OSP9 (3*controllers, 1*compute, 1*ceph)
Upgraded on same day to OSP10 (to avoid built-in cinder's db_purge cron job)
No errors along the way. 


2018-08-12 15:58:13Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Overcloud Endpoint: http://10.0.0.101:5000/v2.0
Overcloud Deployed

Comment 25 errata-xmlrpc 2018-09-17 16:56:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2670