Bug 1911620

Summary: [RHOSP13]Regression blocks minor upgrades for overclouds with TripleO-managed Ceph clusters
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-tripleo-heat-templatesAssignee: RHOS Maint <rhos-maint>
Status: CLOSED DUPLICATE QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: aschultz, fpantano, gfidente, johfulto, mburns, sathlang
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-07 14:01:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2020-12-30 12:33:52 UTC
Description of problem:

A patch for bug #1877815 blocks minor upgrade procedures for RHOSP 13 clusters with TripleO-controlled Ceph if new docker RPM is available.

A customer reported that minor upgrade procedure is not working well: it fails because of error [1] on some controller node. The fix is simple: to manually start appropriate services and repeat minor upgrade procedure; next run would fail on next controller node.

As a result, this problem is not complete blocker, but at the same time it requires 4 executions of "openstack overcloud update run" command.

Customer provided sosreports from affected controller node, director node and also provided complete set of mistral logs. From logs it looks like ceph-mon container (and other docker containers) was originally stopped by "Stop docker" play, so "Double check the mon systemd unit is not consistent with the current mon" finds out that ceph-mon unit is not running and ansible runs "Stop mons to make them consistent with systemd" play, which fails because there is not appropriate container.

Again, full set of logs is available in provided case's attachements.

[1]
  2020-12-29 16:10:52,071 p=12272 u=mistral |  fatal: [controller01]: FAILED! => {"changed": true, "cmd": "docker stop ceph-mon-controller01", "delta": "0:00:00.032422", "end": "2020-12-29 16:10:52.048778", "msg": "non-zero return code", "rc": 1, "start": "2020-12-29 16:10:52.016356", "stderr": "Error response from daemon: No such container: ceph-mon-controller01", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-controller01"], "stdout": "", "stdout_lines": []}

Comment 4 John Fulton 2021-01-07 13:24:43 UTC
Please use https://access.redhat.com/solutions/5679791 to workaround this issue. 
This looks like a duplicate of bug 1910842.