Bug 1795792 - Overcloud minor update fails 'host looking for a container name it would never have'
Summary: Overcloud minor update fails 'host looking for a container name it would neve...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Ansible
Version: 4.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 4.0
Assignee: Dimitri Savineau
QA Contact: Vasishta
URL:
Whiteboard:
Depends On:
Blocks: 1796492 1642481
TreeView+ depends on / blocked
 
Reported: 2020-01-28 21:27 UTC by Alistair Tonner
Modified: 2020-01-31 12:49 UTC (History)
16 users (show)

Fixed In Version: ceph-ansible-4.0.14-1.el8cp, ceph-ansible-4.0.14-1.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1796492 (view as bug list)
Environment:
Last Closed: 2020-01-31 12:48:52 UTC
Target Upstream Version:


Attachments (Terms of Use)
ansible log output from update runs- includes listed error (13.85 MB, text/plain)
2020-01-28 21:27 UTC, Alistair Tonner
no flags Details
ansible log from ceph-update-run (327.25 KB, application/gzip)
2020-01-28 21:46 UTC, John Fulton
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 5003 0 None closed ceph-facts: fix _container_exec_cmd fact value 2020-12-18 15:27:38 UTC
Red Hat Product Errata RHBA-2020:0312 0 None None None 2020-01-31 12:49:04 UTC

Description Alistair Tonner 2020-01-28 21:27:22 UTC
Created attachment 1656120 [details]
ansible log output from update runs- includes listed error

Description of problem:
  composable-minor_update-RHELOSP-48861 - job fails in overcloud update during ceph-update-run.sh - 


Version-Release number of selected component (if applicable):

RHOS_TRUNK-16.0-RHEL-8-20200124.n.1

puppet-ceph-3.0.1-0.20191002213425.55a0f94.el8ost.noarch
ansible-role-redhat-subscription-1.0.5-0.20191022053336.6c67a40.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.1-0.20191108131052.2ad3189.el8ost.noarch
python3-tripleoclient-12.3.1-0.20191230195937.585fb28.el8ost.noarch
ansible-pacemaker-1.0.4-0.20191022042340.0e4d7c0.el8ost.noarch
ansible-role-atos-hsm-0.1.1-0.20191024165047.866e075.el8ost.noarch
python3-tripleo-common-11.3.3-0.20200121231250.3c68b48.el8ost.noarch
openstack-tripleo-common-11.3.3-0.20200121231250.3c68b48.el8ost.noarch
puppet-tripleo-11.4.1-0.20200118215809.6f9bf6c.el8ost.noarch
ansible-2.8.8-1.el8ae.noarch
ansible-config_template-1.0.1-0.20191122040234.ff61269.el8ost.noarch
openstack-tripleo-image-elements-10.6.1-0.20191022065313.7338463.el8ost.noarch
openstack-tripleo-validations-11.3.1-0.20191126041901.2bba53a.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-0.20200114185851.813f68b.el8ost.noarch
ansible-tripleo-ipsec-9.2.0-0.20191022054642.ffe104c.el8ost.noarch
ansible-role-thales-hsm-0.2.1-0.20191024165911.2803c6c.el8ost.noarch
ansible-role-openstack-operations-0.0.1-0.20191022044056.29cc537.el8ost.noarch
ceph-ansible-4.0.13-1.el8cp.noarch
python3-heat-agent-ansible-1.10.1-0.20191022061131.96b819c.el8ost.noarch
tripleo-ansible-0.4.2-0.20200110023759.ee731ba.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.1-0.20191230195937.585fb28.el8ost.noarch
ansible-role-chrony-1.0.2-0.20191022052427.03e7fbe.el8ost.noarch
ansible-role-tripleo-modify-image-1.1.1-0.20200122200932.58d7a5b.el8ost.noarch
ansible-role-container-registry-1.1.1-0.20191025041237.bf2e310.el8ost.noarch
openstack-tripleo-common-containers-11.3.3-0.20200121231250.3c68b48.el8ost.noarch


How reproducible:

 Consistent - 

Steps to Reproduce:
1. Deploy OSP16, 3 cont, 3 ceph, 2 compute, 2 ironic and composable nodes , execute minor update.
2.
3.

Actual results:

overcloud_update_run_CephStorage.sh fails pointing to ansible error:

<192.168.24.24> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o 'IdentityFile=\"/var/lib/mistral/0ed610f4-1262-4635-a01c-9cdba029ce0b/ssh_private_key\"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User=\"tripleo-admin\"' -o ConnectTimeout=60 -o ControlPath=/root/.ansible/cp/%h-%r-%p 192.168.24.24 '/bin/sh -c '\"'\"'sudo -H -S -n  -u root /bin/sh -c '\"'\"'\"'\"'\"'\"'\"'\"'echo BECOME-SUCCESS-jhmetxwsdgoftvpgvpogrwqknvupotmj ; /usr/bin/python3'\"'\"'\"'\"'\"'\"'\"'\"' && sleep 0'\"'\"''",
        "<192.168.24.24> (1, b'\\n{\"msg\": \"non-zero return code\", \"cmd\": [\"podman\", \"exec\", \"ceph-mon-overcloud-controller-2\", \"ceph\", \"--cluster\", \"ceph\", \"auth\", \"get\", \"mgr.controller-0\"], \"stdout\": \"\", \"stderr\": \"Error: no container with name or ID ceph-mon-overcloud-controller-2 found: no such container\", \"rc\": 125, \"start\": \"2020-01-28 10:50:22.611230\", \"end\": \"2020-01-28 10:50:22.704025\", \"delta\": \"0:00:00.092795\", \"changed\": true, \"failed\": true, \"invocation\": {\"module_args\": {\"_raw_params\": \"podman exec ceph-mon-overcloud-controller-2 ceph --cluster ceph auth get mgr.controller-0\", \"warn\": true, \"_uses_shell\": false, \"stdin_add_newline\": true, \"strip_empty_ends\": true, \"argv\": null, \"chdir\": null, \"executable\": null, \"creates\": null, \"removes\": null, \"stdin\": null}}}\\n', b'')",
        "failed: [overcloud-controller-0 -> 192.168.24.24] (item={'name': 'mgr.controller-0', 'path': '/var/lib/ceph/mgr/ceph-controller-0/keyring', 'copy_key': True}) => changed=true ",
        "  - mgr.controller-0",
        "  delta: '0:00:00.092795'",
        "  end: '2020-01-28 10:50:22.704025'",
        "      _raw_params: podman exec ceph-mon-overcloud-controller-2 ceph --cluster ceph auth get mgr.controller-0",
        "    copy_key: true",
        "    name: mgr.controller-0",
        "    path: /var/lib/ceph/mgr/ceph-controller-0/keyring",
        "  start: '2020-01-28 10:50:22.611230'",


Expected results:

Overcloud update should succeed and complete.


Additional info:

Comment 2 John Fulton 2020-01-28 21:35:33 UTC
I see a bug like bz1792320 but on rolling_update.yml in ceph-ansible-4.0.13-1.el8cp.noarch 

"Error: no container with name or ID ceph-mon-overcloud-controller-2 found: no such container"

        "failed: [overcloud-controller-0 -> 192.168.24.24] (item={'name': 'mgr.controller-0', 'path': '/var/lib/ceph/mgr/ceph-controller-0/keyring', 'copy_key': True}) => changed=true ",
        "  - mgr.controller-0",
        "  delta: '0:00:00.092795'",
        "  end: '2020-01-28 10:50:22.704025'",
        "      _raw_params: podman exec ceph-mon-overcloud-controller-2 ceph --cluster ceph auth get mgr.controller-0",
        "    copy_key: true",
        "    name: mgr.controller-0",
        "    path: /var/lib/ceph/mgr/ceph-controller-0/keyring",
        "  start: '2020-01-28 10:50:22.611230'",

Comment 3 John Fulton 2020-01-28 21:43:30 UTC
The task "waiting for the containerized monitor to join the quorum"

https://github.com/ceph/ceph-ansible/blob/v4.0.13/infrastructure-playbooks/rolling_update.yml#L275-L285

was the last to run [1]

this should only affect minor updates of ceph.


[1] 

[fultonj@runcible stack]$ grep TASK ceph-update-run.log  | tail -5
2020-01-28 10:50:27 |         "TASK [start ceph mon] **********************************************************",
2020-01-28 10:50:27 |         "TASK [start ceph mgr] **********************************************************",
2020-01-28 10:50:27 |         "TASK [restart containerized ceph mon] ******************************************",
2020-01-28 10:50:27 |         "TASK [non container | waiting for the monitor to join the quorum...] ***********",
2020-01-28 10:50:27 |         "TASK [container | waiting for the containerized monitor to join the quorum...] ***",
[fultonj@runcible stack]$

Comment 4 John Fulton 2020-01-28 21:46:22 UTC
Created attachment 1656122 [details]
ansible log from ceph-update-run

Comment 5 Dimitri Savineau 2020-01-29 03:06:39 UTC
> this should only affect minor updates of ceph.

IMHO if it impacts minor update, this should impact major update too because that's the same playbook.

Comment 7 Federico Lucifredi 2020-01-29 03:44:02 UTC
Hi Tejas, 
  I think you should start RC testing anyway. We will not have confirmation on Blocker/not blocker until Dimitri wakes. 

  John, you also get to call it a blocker (or not) for OSP. Please advise.

Comment 8 Dimitri Savineau 2020-01-29 03:51:10 UTC
From my point of view, it is a blocker

Comment 9 Giulio Fidente 2020-01-29 08:34:38 UTC
(In reply to Federico Lucifredi from comment #7)
> Hi Tejas, 
>   I think you should start RC testing anyway. We will not have confirmation
> on Blocker/not blocker until Dimitri wakes. 
> 
>   John, you also get to call it a blocker (or not) for OSP. Please advise.

Unfortunately I think it is, without it people won't be able to update their overcloud from 16 ga to the next 16 z (16.0.1)

Comment 23 errata-xmlrpc 2020-01-31 12:48:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0312


Note You need to log in before you can comment on or make changes to this bug.