Bug 1812238 - OSP16 ceph update failed: ID ceph-mon-controller-2 found: no such container
Summary: OSP16 ceph update failed: ID ceph-mon-controller-2 found: no such container
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 4.2
Assignee: Guillaume Abrioux
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks: 1760354
TreeView+ depends on / blocked
 
Reported: 2020-03-10 19:51 UTC by Sofer Athlan-Guyot
Modified: 2020-04-06 17:03 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-06 17:03:30 UTC
Embargoed:


Attachments (Terms of Use)
ceph-ansible.tar.xz (64.03 KB, application/x-xz)
2020-03-10 20:30 UTC, Giulio Fidente
no flags Details

Description Sofer Athlan-Guyot 2020-03-10 19:51:15 UTC
Description of problem: Doing a update from OSP16 to passed_phased2 we had an error during ceph update:

openstack overcloud external-update run \
--stack qe-Cloud-0 \
--tags ceph 2>&1

The last part of the error was:

2020-03-10 00:54:07 |         "ok: [controller-0] => (item={'cmd': ['podman', 'ps', '-q', '--filter', 'name=ceph-mon-controller-0'], 'stdout': '195d00c31463', 'stderr': '', 'rc': 0, 'start': '2020-03-10 00:53:42.546279', 'end': '2020-03-10 00:53:42.691936', 'delta': '0:00:00.145657', 'changed': True, 'invocation': {
'module_args': {'_raw_params': 'podman ps -q --filter name=ceph-mon-controller-0', 'warn': True, '_uses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['195d00c31463'], 'stderr_lin
es': [], 'failed': False, 'failed_when_result': False, 'item': 'controller-0', 'ansible_loop_var': 'item'}) => changed=false ",
2020-03-10 00:54:07 |         "    delta: '0:00:00.145657'",
2020-03-10 00:54:07 |         "    end: '2020-03-10 00:53:42.691936'",
2020-03-10 00:54:07 |         "    start: '2020-03-10 00:53:42.546279'",
2020-03-10 00:54:07 |         "ok: [controller-0] => (item={'cmd': ['podman', 'ps', '-q', '--filter', 'name=ceph-mon-controller-1'], 'stdout': '129c7eb5764a', 'stderr': '', 'rc': 0, 'start': '2020-03-10 00:53:43.075757', 'end': '2020-03-10 00:53:43.204303', 'delta': '0:00:00.128546', 'changed': True, 'invocation': {
'module_args': {'_raw_params': 'podman ps -q --filter name=ceph-mon-controller-1', 'warn': True, '_uses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['129c7eb5764a'], 'stderr_lin
es': [], 'failed': False, 'failed_when_result': False, 'item': 'controller-1', 'ansible_loop_var': 'item'}) => changed=false ",
2020-03-10 00:54:07 |         "    delta: '0:00:00.128546'",
2020-03-10 00:54:07 |         "    end: '2020-03-10 00:53:43.204303'",
2020-03-10 00:54:07 |         "    start: '2020-03-10 00:53:43.075757'",
2020-03-10 00:54:07 |         "skipping: [controller-0] => (item={'cmd': ['podman', 'ps', '-q', '--filter', 'name=ceph-mon-controller-2'], 'stdout': '', 'stderr': '', 'rc': 0, 'start': '2020-03-10 00:53:43.634250', 'end': '2020-03-10 00:53:43.773763', 'delta': '0:00:00.139513', 'changed': True, 'invocation': {'modul
e_args': {'_raw_params': 'podman ps -q --filter name=ceph-mon-controller-2', 'warn': True, '_uses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': [], 'stderr_lines': [], 'failed': F
alse, 'failed_when_result': False, 'item': 'controller-2', 'ansible_loop_var': 'item'})  => changed=false ",
2020-03-10 00:54:07 |         "    delta: '0:00:00.139513'",
2020-03-10 00:54:07 |         "    end: '2020-03-10 00:53:43.773763'",
2020-03-10 00:54:07 |         "    start: '2020-03-10 00:53:43.634250'",
2020-03-10 00:54:07 |         "Tuesday 10 March 2020  00:53:44 +0000 (0:00:00.188)       0:02:34.048 ********* ",
2020-03-10 00:54:07 |         "Tuesday 10 March 2020  00:53:45 +0000 (0:00:00.303)       0:02:34.352 ********* ",
2020-03-10 00:54:07 |         "Tuesday 10 March 2020  00:53:45 +0000 (0:00:00.118)       0:02:34.470 ********* ",
2020-03-10 00:54:07 |         "      rc: 1",
2020-03-10 00:54:07 |         "Tuesday 10 March 2020  00:53:45 +0000 (0:00:00.181)       0:02:34.651 ********* ",
2020-03-10 00:54:07 |         "Tuesday 10 March 2020  00:53:45 +0000 (0:00:00.451)       0:02:35.103 ********* ",
2020-03-10 00:54:07 |         "FAILED - RETRYING: get current fsid (3 retries left).",
2020-03-10 00:54:07 |         "FAILED - RETRYING: get current fsid (2 retries left).",
2020-03-10 00:54:07 |         "FAILED - RETRYING: get current fsid (1 retries left).",
2020-03-10 00:54:07 |         "fatal: [controller-0 -> 192.168.24.47]: FAILED! => changed=true ",
2020-03-10 00:54:07 |         "  attempts: 3",
2020-03-10 00:54:07 |         "  - ceph-mon-controller-2",
2020-03-10 00:54:07 |         "  - --admin-daemon",
2020-03-10 00:54:07 |         "  - /var/run/ceph/ceph-mon.controller-2.asok",
2020-03-10 00:54:07 |         "  - config",
2020-03-10 00:54:07 |         "  - get",
2020-03-10 00:54:07 |         "  - fsid",
2020-03-10 00:54:07 |         "  delta: '0:00:00.101879'",
2020-03-10 00:54:07 |         "  end: '2020-03-10 00:54:02.860223'",
2020-03-10 00:54:07 |         "  rc: 125",
2020-03-10 00:54:07 |         "  start: '2020-03-10 00:54:02.758344'",
2020-03-10 00:54:07 |         "  stderr: 'Error: no container with name or ID ceph-mon-controller-2 found: no such container'",

during :

TASK [select a running monitor] 

Version-Release number of selected component (if applicable):

ceph-ansible.noarch                           4.0.14-1.el8cp                                  @rhelosp-ceph-4-tools

puddle: GA (RHOS_TRUNK-16.0-RHEL-8-20200204.n.1) to RHOS_TRUNK-16.0-RHEL-8-20200226.n.1

How reproducible: Only once.  A previous with the same ceph-ansible and the same puddle was successful, and another test went past ceph-update.

But it may be worth investigating.

Comment 2 Giulio Fidente 2020-03-10 20:30:41 UTC
Created attachment 1669060 [details]
ceph-ansible.tar.xz

command, vars and logs from ceph-ansible run

Comment 3 Dimitri Savineau 2020-03-13 16:39:19 UTC
According to the logs provided by Giulio, the controller-2 node isn't able to join the quorum after the RHCS 4 update.

2020-03-09 16:30:30,496 p=422404 u=root |  TASK [container | waiting for the containerized monitor to join the quorum...] ***
2020-03-09 16:30:30,497 p=422404 u=root |  task path: /usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml:275
2020-03-09 16:30:30,497 p=422404 u=root |  Monday 09 March 2020  16:30:30 +0000 (0:00:00.131)       0:06:08.088 ********** 
2020-03-09 16:30:31,021 p=422404 u=root |  FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (5 retries left).
2020-03-09 16:30:46,406 p=422404 u=root |  FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (4 retries left).
2020-03-09 16:31:01,829 p=422404 u=root |  FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (3 retries left).
2020-03-09 16:31:17,190 p=422404 u=root |  FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (2 retries left).
2020-03-09 16:31:32,624 p=422404 u=root |  FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (1 retries left).
2020-03-09 16:31:48,029 p=422404 u=root |  fatal: [controller-2]: FAILED! => changed=true 
  attempts: 5
  cmd:
  - podman
  - exec
  - ceph-mon-controller-2
  - ceph
  - --cluster
  - ceph
  - -m
  - 172.17.3.20
  - -s
  - --format
  - json
  delta: '0:00:00.089607'
  end: '2020-03-09 16:31:47.996623'
  msg: non-zero return code
  rc: 125
  start: '2020-03-09 16:31:47.907016'
  stderr: 'Error: no container with name or ID ceph-mon-controller-2 found: no such container'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

Would it be possible to get the ceph-mon-controller-2 container logs ? (or ceph-mon@controller-2 systemd service)


Note You need to log in before you can comment on or make changes to this bug.