Bug 1883499

Summary:

[FFU OSP13 TO 16.1] upgrade run fails at ' "Error response from daemon: No such container: ceph-mon-overcloud-controller-0"

Product:

Red Hat OpenStack

Reporter:

Ravi Singh <ravsingh>

Component:

tripleo-ansible

Assignee:

Francesco Pantano <fpantano>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Sasha Smolyak <ssmolyak>

Severity:

high

Docs Contact:

Priority:

high

Version:

16.1 (Train)

CC:

fpantano, gfidente, jpretori, jstransk, ramishra, sgolovat, yrabl

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-12-21 17:26:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
package_update.log	none

Description Ravi Singh 2020-09-29 12:11:50 UTC

Created attachment 1717522 [details]
package_update.log

Description of problem:

During FFU from 13 to 16.1 we were stuck at below

1. to leapp upgrade on overcloud nodes we executed below

~~~
openstack overcloud upgrade run --tags system_upgrade --limit overcloud-controller-0
~~~

 2. Deployment failed with 

~~~
2020-09-21 15:04:43,623 p=5132 u=mistral n=ansible | fatal: [overcloud-controller-0]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\n Run \"yum repolist all\" to see the repos you have.\n To enable Red Hat Subscription Management repositories:\n     subscription-manager repos --enable <repo>\n To enable custom repositories:\n     yum-config-manager --enable <repo>\n", "rc": 1, "results": []}
~~~

3. We enabled repos but now the deployment fails at much early while trying to set certain conditions on OSDs

~~~
2020-09-21 15:25:58,604 p=5945 u=mistral n=ansible | failed: [overcloud-controller-0 -> 172.16.0.21] (item=nodeep-scrub) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec 
-u root ceph-mon-${HOSTNAME} ceph osd set nodeep-scrub", "delta": "0:00:00.023928", "end": "2020-09-29 11:12:28.027507", "item": "nodeep-scrub", "msg": "non-zero return code", "rc": 1, "start": "2020-09-29 11:12:28.003579", "stderr": "Error response from daemon: No such container: ceph-mon-overcloud-controller-0", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-overcloud-controller-0"], "stdout": "", "stdout_lines": []}
~~~

 4. But eventually during the execution of 1st command we already stopped all docker containers so it's now failing at much early.

~~~
2020-09-21 15:01:49,161 p=5132 u=mistral n=ansible | TASK [Stop all services by stopping all docker containers] *********************
2020-09-21 15:01:49,162 p=5132 u=mistral n=ansible | Monday 21 September 2020  15:01:49 -0400 (0:00:03.066)       0:00:50.624 ****** 
2020-09-21 15:01:49,355 p=5132 u=mistral n=ansible | TASK [tripleo-podman : Check if docker is enabled in the system] ***************
2020-09-21 15:01:49,355 p=5132 u=mistral n=ansible | Monday 21 September 2020  15:01:49 -0400 (0:00:00.193)       0:00:50.818 ****** 
2020-09-21 15:01:49,632 p=5132 u=mistral n=ansible | ok: [overcloud-controller-0] => {"changed": false, "failed_when_result": false, "stat": {"atime": 1601376605.4744918, "attr_flags": "", "attributes": [], "block_size": 4096, "blocks": 0, "charset": "binary", "ctime": 1600358610.215621, "dev": 19, "device_type": 0, "executable": false, "exists": true, "gid": 1002, "gr_name": "docker", "inode": 65479, "isblk": false, "ischr": false, "isdir": false, "isfifo": false, "isgid": false, "islnk": false, "isreg": false, "issock": true, "isuid": false, "mimetype": "inode/socket", "mode": "0660", "mtime": 1600358610.215621, "nlink": 1, "path": "/var/run/docker.sock", "pw_name": "root", "readable": true, "rgrp": true, "roth": false, "rusr": true, "size": 0, "uid": 0, "version": null, "wgrp": true, "woth": false, "writeable": true, "wusr": true, "xgrp": false, "xoth": false, "xusr": false}}
2020-09-21 15:01:49,673 p=5132 u=mistral n=ansible | TASK [tripleo-podman : Stop all services by stopping all Docker containers] ****
~~~

We need to set some sort of mechanism to get out of this situation ,either skip those tasks or provide some mechanism to start playbook at the step where it actually fails.

Upgrade is completely blocked due to this.

I will attach package_update.log file

Version-Release number of selected component (if applicable):

OSP13 to 16
How reproducible:
100%

Steps to Reproduce:
1.
2.
3.

Actual results:

Upgrade is failed much early.

Expected results:

It should skip already executed steps.
Additional info: