Description of problem: Doing a update from OSP16 to passed_phased2 we had an error during ceph update: openstack overcloud external-update run \ --stack qe-Cloud-0 \ --tags ceph 2>&1 The last part of the error was: 2020-03-10 00:54:07 | "ok: [controller-0] => (item={'cmd': ['podman', 'ps', '-q', '--filter', 'name=ceph-mon-controller-0'], 'stdout': '195d00c31463', 'stderr': '', 'rc': 0, 'start': '2020-03-10 00:53:42.546279', 'end': '2020-03-10 00:53:42.691936', 'delta': '0:00:00.145657', 'changed': True, 'invocation': { 'module_args': {'_raw_params': 'podman ps -q --filter name=ceph-mon-controller-0', 'warn': True, '_uses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['195d00c31463'], 'stderr_lin es': [], 'failed': False, 'failed_when_result': False, 'item': 'controller-0', 'ansible_loop_var': 'item'}) => changed=false ", 2020-03-10 00:54:07 | " delta: '0:00:00.145657'", 2020-03-10 00:54:07 | " end: '2020-03-10 00:53:42.691936'", 2020-03-10 00:54:07 | " start: '2020-03-10 00:53:42.546279'", 2020-03-10 00:54:07 | "ok: [controller-0] => (item={'cmd': ['podman', 'ps', '-q', '--filter', 'name=ceph-mon-controller-1'], 'stdout': '129c7eb5764a', 'stderr': '', 'rc': 0, 'start': '2020-03-10 00:53:43.075757', 'end': '2020-03-10 00:53:43.204303', 'delta': '0:00:00.128546', 'changed': True, 'invocation': { 'module_args': {'_raw_params': 'podman ps -q --filter name=ceph-mon-controller-1', 'warn': True, '_uses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['129c7eb5764a'], 'stderr_lin es': [], 'failed': False, 'failed_when_result': False, 'item': 'controller-1', 'ansible_loop_var': 'item'}) => changed=false ", 2020-03-10 00:54:07 | " delta: '0:00:00.128546'", 2020-03-10 00:54:07 | " end: '2020-03-10 00:53:43.204303'", 2020-03-10 00:54:07 | " start: '2020-03-10 00:53:43.075757'", 2020-03-10 00:54:07 | "skipping: [controller-0] => (item={'cmd': ['podman', 'ps', '-q', '--filter', 'name=ceph-mon-controller-2'], 'stdout': '', 'stderr': '', 'rc': 0, 'start': '2020-03-10 00:53:43.634250', 'end': '2020-03-10 00:53:43.773763', 'delta': '0:00:00.139513', 'changed': True, 'invocation': {'modul e_args': {'_raw_params': 'podman ps -q --filter name=ceph-mon-controller-2', 'warn': True, '_uses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': [], 'stderr_lines': [], 'failed': F alse, 'failed_when_result': False, 'item': 'controller-2', 'ansible_loop_var': 'item'}) => changed=false ", 2020-03-10 00:54:07 | " delta: '0:00:00.139513'", 2020-03-10 00:54:07 | " end: '2020-03-10 00:53:43.773763'", 2020-03-10 00:54:07 | " start: '2020-03-10 00:53:43.634250'", 2020-03-10 00:54:07 | "Tuesday 10 March 2020 00:53:44 +0000 (0:00:00.188) 0:02:34.048 ********* ", 2020-03-10 00:54:07 | "Tuesday 10 March 2020 00:53:45 +0000 (0:00:00.303) 0:02:34.352 ********* ", 2020-03-10 00:54:07 | "Tuesday 10 March 2020 00:53:45 +0000 (0:00:00.118) 0:02:34.470 ********* ", 2020-03-10 00:54:07 | " rc: 1", 2020-03-10 00:54:07 | "Tuesday 10 March 2020 00:53:45 +0000 (0:00:00.181) 0:02:34.651 ********* ", 2020-03-10 00:54:07 | "Tuesday 10 March 2020 00:53:45 +0000 (0:00:00.451) 0:02:35.103 ********* ", 2020-03-10 00:54:07 | "FAILED - RETRYING: get current fsid (3 retries left).", 2020-03-10 00:54:07 | "FAILED - RETRYING: get current fsid (2 retries left).", 2020-03-10 00:54:07 | "FAILED - RETRYING: get current fsid (1 retries left).", 2020-03-10 00:54:07 | "fatal: [controller-0 -> 192.168.24.47]: FAILED! => changed=true ", 2020-03-10 00:54:07 | " attempts: 3", 2020-03-10 00:54:07 | " - ceph-mon-controller-2", 2020-03-10 00:54:07 | " - --admin-daemon", 2020-03-10 00:54:07 | " - /var/run/ceph/ceph-mon.controller-2.asok", 2020-03-10 00:54:07 | " - config", 2020-03-10 00:54:07 | " - get", 2020-03-10 00:54:07 | " - fsid", 2020-03-10 00:54:07 | " delta: '0:00:00.101879'", 2020-03-10 00:54:07 | " end: '2020-03-10 00:54:02.860223'", 2020-03-10 00:54:07 | " rc: 125", 2020-03-10 00:54:07 | " start: '2020-03-10 00:54:02.758344'", 2020-03-10 00:54:07 | " stderr: 'Error: no container with name or ID ceph-mon-controller-2 found: no such container'", during : TASK [select a running monitor] Version-Release number of selected component (if applicable): ceph-ansible.noarch 4.0.14-1.el8cp @rhelosp-ceph-4-tools puddle: GA (RHOS_TRUNK-16.0-RHEL-8-20200204.n.1) to RHOS_TRUNK-16.0-RHEL-8-20200226.n.1 How reproducible: Only once. A previous with the same ceph-ansible and the same puddle was successful, and another test went past ceph-update. But it may be worth investigating.
Created attachment 1669060 [details] ceph-ansible.tar.xz command, vars and logs from ceph-ansible run
According to the logs provided by Giulio, the controller-2 node isn't able to join the quorum after the RHCS 4 update. 2020-03-09 16:30:30,496 p=422404 u=root | TASK [container | waiting for the containerized monitor to join the quorum...] *** 2020-03-09 16:30:30,497 p=422404 u=root | task path: /usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml:275 2020-03-09 16:30:30,497 p=422404 u=root | Monday 09 March 2020 16:30:30 +0000 (0:00:00.131) 0:06:08.088 ********** 2020-03-09 16:30:31,021 p=422404 u=root | FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (5 retries left). 2020-03-09 16:30:46,406 p=422404 u=root | FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (4 retries left). 2020-03-09 16:31:01,829 p=422404 u=root | FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (3 retries left). 2020-03-09 16:31:17,190 p=422404 u=root | FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (2 retries left). 2020-03-09 16:31:32,624 p=422404 u=root | FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (1 retries left). 2020-03-09 16:31:48,029 p=422404 u=root | fatal: [controller-2]: FAILED! => changed=true attempts: 5 cmd: - podman - exec - ceph-mon-controller-2 - ceph - --cluster - ceph - -m - 172.17.3.20 - -s - --format - json delta: '0:00:00.089607' end: '2020-03-09 16:31:47.996623' msg: non-zero return code rc: 125 start: '2020-03-09 16:31:47.907016' stderr: 'Error: no container with name or ID ceph-mon-controller-2 found: no such container' stderr_lines: <omitted> stdout: '' stdout_lines: <omitted> Would it be possible to get the ceph-mon-controller-2 container logs ? (or ceph-mon@controller-2 systemd service)