Bug 1883499

Summary: [FFU OSP13 TO 16.1] upgrade run fails at ' "Error response from daemon: No such container: ceph-mon-overcloud-controller-0"
Product: Red Hat OpenStack Reporter: Ravi Singh <ravsingh>
Component: tripleo-ansibleAssignee: Francesco Pantano <fpantano>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Sasha Smolyak <ssmolyak>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: fpantano, gfidente, jpretori, jstransk, ramishra, sgolovat, yrabl
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-21 17:26:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
package_update.log none

Description Ravi Singh 2020-09-29 12:11:50 UTC
Created attachment 1717522 [details]
package_update.log

Description of problem:

During FFU from 13 to 16.1 we were stuck at below

1. to leapp upgrade on overcloud nodes we executed below

~~~
openstack overcloud upgrade run --tags system_upgrade --limit overcloud-controller-0
~~~

 2. Deployment failed with 

~~~
2020-09-21 15:04:43,623 p=5132 u=mistral n=ansible | fatal: [overcloud-controller-0]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\n Run \"yum repolist all\" to see the repos you have.\n To enable Red Hat Subscription Management repositories:\n     subscription-manager repos --enable <repo>\n To enable custom repositories:\n     yum-config-manager --enable <repo>\n", "rc": 1, "results": []}
~~~

3. We enabled repos but now the deployment fails at much early while trying to set certain conditions on OSDs

~~~
2020-09-21 15:25:58,604 p=5945 u=mistral n=ansible | failed: [overcloud-controller-0 -> 172.16.0.21] (item=nodeep-scrub) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec 
-u root ceph-mon-${HOSTNAME} ceph osd set nodeep-scrub", "delta": "0:00:00.023928", "end": "2020-09-29 11:12:28.027507", "item": "nodeep-scrub", "msg": "non-zero return code", "rc": 1, "start": "2020-09-29 11:12:28.003579", "stderr": "Error response from daemon: No such container: ceph-mon-overcloud-controller-0", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-overcloud-controller-0"], "stdout": "", "stdout_lines": []}
~~~

 4. But eventually during the execution of 1st command we already stopped all docker containers so it's now failing at much early.

~~~
2020-09-21 15:01:49,161 p=5132 u=mistral n=ansible | TASK [Stop all services by stopping all docker containers] *********************
2020-09-21 15:01:49,162 p=5132 u=mistral n=ansible | Monday 21 September 2020  15:01:49 -0400 (0:00:03.066)       0:00:50.624 ****** 
2020-09-21 15:01:49,355 p=5132 u=mistral n=ansible | TASK [tripleo-podman : Check if docker is enabled in the system] ***************
2020-09-21 15:01:49,355 p=5132 u=mistral n=ansible | Monday 21 September 2020  15:01:49 -0400 (0:00:00.193)       0:00:50.818 ****** 
2020-09-21 15:01:49,632 p=5132 u=mistral n=ansible | ok: [overcloud-controller-0] => {"changed": false, "failed_when_result": false, "stat": {"atime": 1601376605.4744918, "attr_flags": "", "attributes": [], "block_size": 4096, "blocks": 0, "charset": "binary", "ctime": 1600358610.215621, "dev": 19, "device_type": 0, "executable": false, "exists": true, "gid": 1002, "gr_name": "docker", "inode": 65479, "isblk": false, "ischr": false, "isdir": false, "isfifo": false, "isgid": false, "islnk": false, "isreg": false, "issock": true, "isuid": false, "mimetype": "inode/socket", "mode": "0660", "mtime": 1600358610.215621, "nlink": 1, "path": "/var/run/docker.sock", "pw_name": "root", "readable": true, "rgrp": true, "roth": false, "rusr": true, "size": 0, "uid": 0, "version": null, "wgrp": true, "woth": false, "writeable": true, "wusr": true, "xgrp": false, "xoth": false, "xusr": false}}
2020-09-21 15:01:49,673 p=5132 u=mistral n=ansible | TASK [tripleo-podman : Stop all services by stopping all Docker containers] ****
~~~

We need to set some sort of mechanism to get out of this situation ,either skip those tasks or provide some mechanism to start playbook at the step where it actually fails.

Upgrade is completely blocked due to this.

I will attach package_update.log file

Version-Release number of selected component (if applicable):

OSP13 to 16
How reproducible:
100%

Steps to Reproduce:
1.
2.
3.

Actual results:

Upgrade is failed much early.

Expected results:

It should skip already executed steps.
Additional info: