1883499 – [FFU OSP13 TO 16.1] upgrade run fails at ' "Error response from daemon: No such container: ceph-mon-overcloud-controller-0"

Bug 1883499 - [FFU OSP13 TO 16.1] upgrade run fails at ' "Error response from daemon: No such container: ceph-mon-overcloud-controller-0"

Summary: [FFU OSP13 TO 16.1] upgrade run fails at ' "Error response from daemon: No su...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Francesco Pantano
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-29 12:11 UTC by Ravi Singh
Modified:	2020-12-21 17:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-21 17:26:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
package_update.log (1.98 MB, text/plain) 2020-09-29 12:11 UTC, Ravi Singh	no flags	Details
View All

Description Ravi Singh 2020-09-29 12:11:50 UTC

Created attachment 1717522 [details]
package_update.log

Description of problem:

During FFU from 13 to 16.1 we were stuck at below

1. to leapp upgrade on overcloud nodes we executed below

~~~
openstack overcloud upgrade run --tags system_upgrade --limit overcloud-controller-0
~~~

 2. Deployment failed with 

~~~
2020-09-21 15:04:43,623 p=5132 u=mistral n=ansible | fatal: [overcloud-controller-0]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\n Run \"yum repolist all\" to see the repos you have.\n To enable Red Hat Subscription Management repositories:\n     subscription-manager repos --enable <repo>\n To enable custom repositories:\n     yum-config-manager --enable <repo>\n", "rc": 1, "results": []}
~~~

3. We enabled repos but now the deployment fails at much early while trying to set certain conditions on OSDs

~~~
2020-09-21 15:25:58,604 p=5945 u=mistral n=ansible | failed: [overcloud-controller-0 -> 172.16.0.21] (item=nodeep-scrub) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec 
-u root ceph-mon-${HOSTNAME} ceph osd set nodeep-scrub", "delta": "0:00:00.023928", "end": "2020-09-29 11:12:28.027507", "item": "nodeep-scrub", "msg": "non-zero return code", "rc": 1, "start": "2020-09-29 11:12:28.003579", "stderr": "Error response from daemon: No such container: ceph-mon-overcloud-controller-0", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-overcloud-controller-0"], "stdout": "", "stdout_lines": []}
~~~

 4. But eventually during the execution of 1st command we already stopped all docker containers so it's now failing at much early.

~~~
2020-09-21 15:01:49,161 p=5132 u=mistral n=ansible | TASK [Stop all services by stopping all docker containers] *********************
2020-09-21 15:01:49,162 p=5132 u=mistral n=ansible | Monday 21 September 2020  15:01:49 -0400 (0:00:03.066)       0:00:50.624 ****** 
2020-09-21 15:01:49,355 p=5132 u=mistral n=ansible | TASK [tripleo-podman : Check if docker is enabled in the system] ***************
2020-09-21 15:01:49,355 p=5132 u=mistral n=ansible | Monday 21 September 2020  15:01:49 -0400 (0:00:00.193)       0:00:50.818 ****** 
2020-09-21 15:01:49,632 p=5132 u=mistral n=ansible | ok: [overcloud-controller-0] => {"changed": false, "failed_when_result": false, "stat": {"atime": 1601376605.4744918, "attr_flags": "", "attributes": [], "block_size": 4096, "blocks": 0, "charset": "binary", "ctime": 1600358610.215621, "dev": 19, "device_type": 0, "executable": false, "exists": true, "gid": 1002, "gr_name": "docker", "inode": 65479, "isblk": false, "ischr": false, "isdir": false, "isfifo": false, "isgid": false, "islnk": false, "isreg": false, "issock": true, "isuid": false, "mimetype": "inode/socket", "mode": "0660", "mtime": 1600358610.215621, "nlink": 1, "path": "/var/run/docker.sock", "pw_name": "root", "readable": true, "rgrp": true, "roth": false, "rusr": true, "size": 0, "uid": 0, "version": null, "wgrp": true, "woth": false, "writeable": true, "wusr": true, "xgrp": false, "xoth": false, "xusr": false}}
2020-09-21 15:01:49,673 p=5132 u=mistral n=ansible | TASK [tripleo-podman : Stop all services by stopping all Docker containers] ****
~~~

We need to set some sort of mechanism to get out of this situation ,either skip those tasks or provide some mechanism to start playbook at the step where it actually fails.

Upgrade is completely blocked due to this.

I will attach package_update.log file

Version-Release number of selected component (if applicable):

OSP13 to 16
How reproducible:
100%

Steps to Reproduce:
1.
2.
3.

Actual results:

Upgrade is failed much early.

Expected results:

It should skip already executed steps.
Additional info:

Note You need to log in before you can comment on or make changes to this bug.