Bug 1792320

Summary: "ceph-handler : unset noup flag attempts to use container not on host
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: John Fulton <johfulto>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: urgent Docs Contact: Bara Ancincova <bancinco>
Priority: medium    
Version: 4.1CC: agunn, amsyedha, aschoen, bancinco, ceph-eng-bugs, edonnell, flucifre, fpantano, gabrioux, gcharot, gfidente, gmeno, jbrier, jvisser, kdreyer, knortema, nthomas, nweinber, pasik, pgrist, sunnagar, tchandra, tserlin, vashastr, ykaul, yobshans, yrabl
Target Milestone: rcKeywords: Regression
Target Release: 4.1Flags: knortema: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-4.0.13-1.el8cp, ceph-ansible-4.0.13-1.el7cp Doc Type: Bug Fix
Doc Text:
.{storage-product} installation on Red Hat OpenStack Platform no longer fails Previously, the `ceph-ansible` utility became unresponsive when attempting to install {product} with the Red Hat OpenStack Platform 16, and it returns an error similar to the following: ---- 'Error: unable to exec into ceph-mon-dcn1-computehci1-2: no container with name or ID ceph-mon-dcn1-computehci1-2 found: no such container' ---- This occurred because `ceph-ansible` reads the value of the fact `container_exec_cmd` from the wrong node in handler_osds.yml With this update, `ceph-ansible` reads the value of `container_exec_cmd` from the correct node, and the installation proceeds successfully.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-19 17:32:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1760354, 1730176, 1816167    
Attachments:
Description Flags
ceph-ansible playbook log, inventory and vars - from failedqa run none

Description John Fulton 2020-01-17 14:15:04 UTC
ceph-ansible 4.0.10 encounters the following error [1]

'Error: unable to exec into ceph-mon-dcn1-computehci1-2: no container with name or ID ceph-mon-dcn1-computehci1-2 found: no such container'

on this task:

https://github.com/ceph/ceph-ansible/blob/v4.0.10/roles/ceph-handler/tasks/handler_osds.yml#L8

This can be avoided by updating the task from OLD to NEW as follows.

OLD:

- name: unset noup flag
  command: "{{ container_exec_cmd | default('') }} ceph --cluster {{ cluster }} osd unset noup"
  delegate_to: "{{ groups[mon_group_name][0] }}"
  changed_when: False

NEW:

- name: unset noup flag
  command: "{{ hostvars[groups[mon_group_name][0]]['container_exec_cmd'] | default('') }} ceph --cluster {{ cluster }} osd unset noup"
  delegate_to: "{{ groups[mon_group_name][0] }}"
  changed_when: False

[1]
2020-01-16 20:07:41,091 p=566779 u=root |  RUNNING HANDLER [ceph-handler : unset noup flag] *******************************
2020-01-16 20:07:41,092 p=566779 u=root |  task path: /usr/share/ceph-ansible/roles/ceph-handler/tasks/handler_osds.yml:6
2020-01-16 20:07:41,092 p=566779 u=root |  Thursday 16 January 2020  20:07:41 +0000 (0:00:00.204)       0:04:04.329 ****** 
2020-01-16 20:07:42,114 p=566779 u=root |  fatal: [dcn1-computehci1-2 -> 192.168.34.79]: FAILED! => changed=false 
  cmd:
  - podman
  - exec
  - ceph-mon-dcn1-computehci1-2
  - ceph
  - --cluster
  - ceph
  - osd
  - unset
  - noup
  delta: '0:00:00.076633'
  end: '2020-01-16 20:07:42.036222'
  msg: non-zero return code
  rc: 125
  start: '2020-01-16 20:07:41.959589'
  stderr: 'Error: unable to exec into ceph-mon-dcn1-computehci1-2: no container with name or ID ceph-mon-dcn1-computehci1-2 found: no such container'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
2020-01-16 20:07:42,351 p=566779 u=root |  fatal: [dcn1-computehci1-1 -> 192.168.34.79]: FAILED! => changed=false 
  cmd:
  - podman
  - exec
  - ceph-mon-dcn1-computehci1-1
  - ceph
  - --cluster
  - ceph
  - osd
  - unset
  - noup
  delta: '0:00:00.072984'
  end: '2020-01-16 20:07:42.274307'
  msg: non-zero return code
  rc: 125
  start: '2020-01-16 20:07:42.201323'
  stderr: 'Error: unable to exec into ceph-mon-dcn1-computehci1-1: no container with name or ID ceph-mon-dcn1-computehci1-1 found: no such container'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

Comment 1 RHEL Program Management 2020-01-17 14:15:12 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 3 Guillaume Abrioux 2020-01-22 16:59:56 UTC
*** Bug 1794111 has been marked as a duplicate of this bug. ***

Comment 15 Yuri Obshansky 2020-01-24 23:59:40 UTC
Still failed with error:
        "fatal: [dcn1-computehci1-2 -> 192.168.34.63]: FAILED! => changed=false ",
        "  delta: '0:00:00.103441'",
        "  end: '2020-01-24 23:36:41.139930'",
        "      _raw_params: podman exec ceph-mon-dcn1-computehci1-2 ceph --cluster ceph osd unset noup",
        "  start: '2020-01-24 23:36:41.036489'",
        "  stderr: 'Error: no container with name or ID ceph-mon-dcn1-computehci1-2 found: no such container'",

while
[root@dcn1-computehci1-2 ~]# podman ps |grep ceph-mon
da807e827ac6  site-undercloud-0.ctlplane.localdomain:8787/ceph/rhceph-4.0-rhel8:latest           13 minutes ago  Up 13 minutes ago         ceph-mon-dcn1-computehci1-2

Local ceph-ansible contains fix
(undercloud) [stack@site-undercloud-0 ~]$ sudo less /usr/share/ceph-ansible/roles/ceph-handler/tasks/handler_osds.yml

- name: unset noup flag
  command: "{{ hostvars[groups[mon_group_name][0]]['container_exec_cmd'] | default('') }} ceph --cluster {{ cluster }} osd unset noup"
  delegate_to: "{{ groups[mon_group_name][0] }}"
  run_once: true
  changed_when: False

(undercloud) [stack@site-undercloud-0 ~]$ rpm -qa | grep ceph
puppet-ceph-3.0.1-0.20191002213425.55a0f94.el8ost.noarch
ceph-ansible-4.0.12-1.el8cp.noarch

Comment 18 Giulio Fidente 2020-01-27 13:28:42 UTC
Created attachment 1655676 [details]
ceph-ansible playbook log, inventory and vars - from failedqa run

Comment 20 Federico Lucifredi 2020-01-27 20:17:14 UTC
Paul or John: is this a blocker for OSP integration?

Comment 43 errata-xmlrpc 2020-05-19 17:32:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:2231