Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1826067

Summary: ceph attempting to use non-working key while "waiting for the monitor(s) to form the quorum..."
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: John Fulton <johfulto>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED NOTABUG QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0CC: aschoen, ceph-eng-bugs, gmeno, nthomas, ykaul
Target Milestone: rc   
Target Release: 5.*   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-21 15:06:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1760354    

Description John Fulton 2020-04-20 19:37:10 UTC
After a good deployment with ceph-ansible 4.0.18, ceph-ansible was re-run after some changes were made in group_vars to add a new key [1]. Before the new key could be added however, the task to check if the cluster was in quorum [2] failed [3] because it couldn't authenticate to the cluster.

It was trying to use a key that exists [4] though running the same command directly on the host I can get the exact same error with that key:

[root@central-controller0-0 ~]# podman exec ceph-mon-central-controller0-0 ceph --cluster central -n mon. -k /var/lib/ceph/mon/central-central-controller0-0/keyring mon_status
[errno 13] error connecting to the cluster
Error: non zero exit code: 13: OCI runtime error
[root@central-controller0-0 ~]# 

However, if I use the client.admin.keyring it works fine. 

[root@central-controller0-0 ~]# podman exec ceph-mon-central-controller0-0 ceph --cluster central -n client.admin -k /etc/ceph/central.client.admin.keyring mon_status | jq . | head -2
{
  "name": "central-controller0-0",
[root@central-controller0-0 ~]# 

So why does ceph-ansible 4.0.18 use a key which doesn't work? I do not have this issue if I use ceph-ansible 4.0.14 on a very similar system [5].


Footnotes:

[1] Basically the keys list was changed to have an additional key (technically it would be faster to just run a ceph auth commands but tripleo is triggering it this way and we have always relied on ceph-ansible's idempotence). 

First run:

keys:
-   caps:
        mgr: allow *
        mon: profile rbd
        osd: profile rbd pool=vms, profile rbd pool=volumes, profile rbd pool=images
    key: AQBG2pleAAAAABAAall65F/X9osE02z4wmopHA==
    mode: '0600'
    name: client.openstack

Second run:

keys:
-   caps:
        mgr: allow *
        mon: profile rbd
        osd: profile rbd pool=vms, profile rbd pool=volumes, profile rbd pool=images
    key: AQBG2pleAAAAABAAall65F/X9osE02z4wmopHA==
    mode: '0600'
    name: client.openstack
-   caps:
        mgr: allow *
        mon: profile rbd
        osd: profile rbd pool=vms, profile rbd pool=volumes, profile rbd pool=images
    key: AQCFIppeAAAAABAAIh5cLiUj+eA6WCGyIA/DFQ==
    mode: '0600'
    name: client.external

[2] https://github.com/ceph/ceph-ansible/blob/v4.0.18/roles/ceph-mon/tasks/ceph_keys.yml#L2-L17


[3] 
2020-04-17 22:20:40,724 p=709422 u=root |  TASK [ceph-mon : include_tasks ceph_keys.yml] **********************************
2020-04-17 22:20:40,724 p=709422 u=root |  task path: /usr/share/ceph-ansible/roles/ceph-mon/tasks/main.yml:17
2020-04-17 22:20:40,724 p=709422 u=root |  Friday 17 April 2020  22:20:40 +0000 (0:00:01.166)       0:03:27.807 ********** 
2020-04-17 22:20:41,179 p=709422 u=root |  included: /usr/share/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml for central-controller0-0, central-controller0-1, central-controller0-2
2020-04-17 22:20:41,299 p=709422 u=root |  TASK [ceph-mon : waiting for the monitor(s) to form the quorum...] *************
2020-04-17 22:20:41,300 p=709422 u=root |  task path: /usr/share/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml:2
2020-04-17 22:20:41,300 p=709422 u=root |  Friday 17 April 2020  22:20:41 +0000 (0:00:00.575)       0:03:28.383 ********** 
2020-04-17 22:20:42,242 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (10 retries left).
2020-04-17 22:21:03,092 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (9 retries left).
2020-04-17 22:21:23,965 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (8 retries left).
2020-04-17 22:21:45,390 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (7 retries left).
2020-04-17 22:22:06,355 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (6 retries left).
2020-04-17 22:22:27,237 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (5 retries left).
2020-04-17 22:22:48,226 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (4 retries left).
2020-04-17 22:23:09,264 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (3 retries left).
2020-04-17 22:23:30,232 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (2 retries left).
2020-04-17 22:23:52,599 p=709422 u=root |  FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (1 retries left).
2020-04-17 22:24:13,535 p=709422 u=root |  fatal: [central-controller0-0]: FAILED! => changed=false 
  attempts: 10
  cmd:
  - podman
  - exec
  - ceph-mon-central-controller0-0
  - ceph
  - --cluster
  - central
  - -n
  - mon.
  - -k
  - /var/lib/ceph/mon/central-central-controller0-0/keyring
  - mon_status
  - --format
  - json
  delta: '0:00:00.601977'
  end: '2020-04-17 22:24:13.503790'
  msg: non-zero return code
  rc: 13
  start: '2020-04-17 22:24:12.901813'
  stderr: |-
    [errno 13] error connecting to the cluster
    Error: non zero exit code: 13: OCI runtime error
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
2020-04-17 22:24:13,536 p=709422 u=root |  NO MORE HOSTS LEFT *************************************************************
2020-04-17 22:24:13,538 p=709422 u=root |  PLAY RECAP *********************************************************************
2020-04-17 22:24:13,538 p=709422 u=root |  central-computehci0-0      : ok=64   changed=2    unreachable=0    failed=0    skipped=140  rescued=0    ignored=0   
2020-04-17 22:24:13,539 p=709422 u=root |  central-computehci0-1      : ok=64   changed=2    unreachable=0    failed=0    skipped=140  rescued=0    ignored=0   
2020-04-17 22:24:13,539 p=709422 u=root |  central-computehci0-2      : ok=64   changed=2    unreachable=0    failed=0    skipped=140  rescued=0    ignored=0   
2020-04-17 22:24:13,539 p=709422 u=root |  central-controller0-0      : ok=117  changed=12   unreachable=0    failed=1    skipped=210  rescued=0    ignored=0   
2020-04-17 22:24:13,539 p=709422 u=root |  central-controller0-1      : ok=107  changed=9    unreachable=0    failed=0    skipped=198  rescued=0    ignored=0   
2020-04-17 22:24:13,539 p=709422 u=root |  central-controller0-2      : ok=107  changed=9    unreachable=0    failed=0    skipped=198  rescued=0    ignored=0   
2020-04-17 22:24:13,539 p=709422 u=root |  INSTALLER STATUS ***************************************************************
2020-04-17 22:24:13,543 p=709422 u=root |  Install Ceph Monitor           : In Progress (0:04:27)
2020-04-17 22:24:13,543 p=709422 u=root |  	This phase can be restarted by running: roles/ceph-mon/tasks/main.yml
2020-04-17 22:24:13,544 p=709422 u=root |  Friday 17 April 2020  22:24:13 +0000 (0:03:32.243)       0:07:00.627 ********** 
2020-04-17 22:24:13,544 p=709422 u=root |  =============================================================================== 
2020-04-17 22:24:13,544 p=709422 u=root |  ceph-mon : waiting for the monitor(s) to form the quorum... ----------- 212.24s
/usr/share/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml:2 ------------------
2020-04-17 22:24:13,544 p=709422 u=root |  gather and delegate facts ---------------------------------------------- 10.92s
/usr/share/ceph-ansible/site-container.yml.sample:34 --------------------------
2020-04-17 22:24:13,544 p=709422 u=root |  ceph-config : create ceph initial directories --------------------------- 4.70s
/usr/share/ceph-ansible/roles/ceph-config/tasks/create_ceph_initial_dirs.yml:2 
2020-04-17 22:24:13,544 p=709422 u=root |  gather facts ------------------------------------------------------------ 3.29s
/usr/share/ceph-ansible/site-container.yml.sample:29 --------------------------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-facts : set_fact _monitor_address to monitor_address_block ipv4 ---- 2.17s
/usr/share/ceph-ansible/roles/ceph-facts/tasks/set_monitor_address.yml:2 ------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-facts : find a running mon container ------------------------------- 2.14s
/usr/share/ceph-ansible/roles/ceph-facts/tasks/facts.yml:50 -------------------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-validate : read information about the devices ---------------------- 2.05s
/usr/share/ceph-ansible/roles/ceph-validate/tasks/check_devices.yml:59 --------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-facts : resolve device link(s) ------------------------------------- 1.99s
/usr/share/ceph-ansible/roles/ceph-facts/tasks/facts.yml:208 ------------------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-validate : get devices information --------------------------------- 1.81s
/usr/share/ceph-ansible/roles/ceph-validate/tasks/check_devices.yml:80 --------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-infra : install chrony --------------------------------------------- 1.81s
/usr/share/ceph-ansible/roles/ceph-infra/tasks/setup_ntp.yml:15 ---------------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-infra : enable chronyd --------------------------------------------- 1.77s
/usr/share/ceph-ansible/roles/ceph-infra/tasks/setup_ntp.yml:46 ---------------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-config : generate central.conf configuration file ------------------ 1.76s
/usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:159 ------------------
2020-04-17 22:24:13,545 p=709422 u=root |  check for python -------------------------------------------------------- 1.72s
/usr/share/ceph-ansible/raw_install_python.yml:2 ------------------------------
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-container-common : pulling site-undercloud-0.ctlplane.localdomain:8787/ceph/ceph-4.0-rhel-8:latest image --- 1.70s
/usr/share/ceph-ansible/roles/ceph-container-common/tasks/fetch_image.yml:179 -
2020-04-17 22:24:13,545 p=709422 u=root |  ceph-facts : find a running mon container ------------------------------- 1.69s
/usr/share/ceph-ansible/roles/ceph-facts/tasks/facts.yml:50 -------------------
2020-04-17 22:24:13,546 p=709422 u=root |  ceph-container-common : inspect ceph osd container ---------------------- 1.69s
/usr/share/ceph-ansible/roles/ceph-container-common/tasks/fetch_image.yml:12 --
2020-04-17 22:24:13,546 p=709422 u=root |  ceph-validate : resolve devices in devices ------------------------------ 1.66s
/usr/share/ceph-ansible/roles/ceph-validate/tasks/check_devices.yml:41 --------
2020-04-17 22:24:13,546 p=709422 u=root |  ceph-container-common : get ceph version -------------------------------- 1.63s
/usr/share/ceph-ansible/roles/ceph-container-common/tasks/main.yml:26 ---------
2020-04-17 22:24:13,546 p=709422 u=root |  ceph-mon : ceph monitor mkfs with keyring ------------------------------- 1.34s
/usr/share/ceph-ansible/roles/ceph-mon/tasks/deploy_monitors.yml:98 -----------
2020-04-17 22:24:13,546 p=709422 u=root |  ceph-mon : create monitor initial keyring ------------------------------- 1.33s
/usr/share/ceph-ansible/roles/ceph-mon/tasks/deploy_monitors.yml:29 -----------


[4]
[root@central-controller0-0 ~]# ls -l /var/lib/ceph/mon/central-central-controller0-0/keyring 
-rw-------. 1 167 167 77 Apr 17 22:20 /var/lib/ceph/mon/central-central-controller0-0/keyring
[root@central-controller0-0 ~]# cat /var/lib/ceph/mon/central-central-controller0-0/keyring 
[mon.]
        key = AQCtK5peAAAAABAALKgWUBknBPEd1q1XIxqxMQ==
        caps mon = "allow *"
[root@central-controller0-0 ~]# 

[5]
[root@control-plane-controller-0 ~]# podman exec ceph-mon-$HOSTNAME ceph --cluster central -n mon. -k /var/lib/ceph/mon/central-control-plane-controller-0/keyring mon_status 
{"name":"control-plane-controller-0","rank":0,"state":"leader","election_epoch":5,"quorum":[0],"quorum_age":434765,"features":{"required_con":"2449958747315912708","required_mon":["kraken","luminous","mimic","osdmap-prune","nautilus"],"quorum_con":"4611087854031667199","quorum_mon":["kraken","luminous","mimic","osdmap-prune","nautilus"]},"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":1,"fsid":"03c80ec0-43d6-4fc3-9579-01c3e280b56e","modified":"2020-04-13 21:32:41.999674","created":"2020-04-13 21:32:41.999674","min_mon_release":14,"min_mon_release_name":"nautilus","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus"],"optional":[]},"mons":[{"rank":0,"name":"control-plane-controller-0","public_addrs":{"addrvec":[{"type":"v2","addr":"172.16.11.5:3300","nonce":0},{"type":"v1","addr":"172.16.11.5:6789","nonce":0}]},"addr":"172.16.11.5:6789/0","public_addr":"172.16.11.5:6789/0"}]},"feature_map":{"mon":[{"features":"0x3ffddff8ffacffff","release":"luminous","num":1}],"osd":[{"features":"0x3ffddff8ffacffff","release":"luminous","num":1}],"client":[{"features":"0x3ffddff8ffacffff","release":"luminous","num":2}],"mgr":[{"features":"0x3ffddff8ffacffff","release":"luminous","num":1}]}}
[root@control-plane-controller-0 ~]# 

[root@control-plane-controller-0 ~]# cat /var/lib/ceph/mon/central-control-plane-controller-0/keyring
[mon.]
        key = AQB02pReAAAAABAAT6qrHsOAu3n3W+q6OzhBBA==
        caps mon = "allow *"
[root@control-plane-controller-0 ~]#

Comment 1 RHEL Program Management 2020-04-20 19:37:16 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 John Fulton 2020-04-20 20:05:30 UTC
On further inspection I see evidence that the person I'm working with from OpenStack QE had done an earlier deployment on the same system and that it looks like he's changed the cluster name. This might have caused ceph-ansible to behave differently (as if it were deploying a new cluster).

I'm going to ask him to do a fresh install and ensure the name doesn't change. Setting needinfo for those results before we go further on this bug.

[root@central-controller0-0 ~]# ls -l /etc/ceph/
total 40
-r--------. 1  167  167  151 Apr 17 22:20 central.client.admin.keyring
-rw-r--r--. 1 root root 1149 Apr 17 22:20 central.conf
-r--------. 1 root root   77 Apr 17 22:20 central.mon.keyring
-r--------. 1  167  167  151 Apr 17 17:09 ceph.client.admin.keyring
-rw-------. 1  167  167  201 Apr 17 17:12 ceph.client.openstack.keyring
-rw-r--r--. 1 root root 1149 Apr 17 17:09 ceph.conf
-r--------. 1  167  167  152 Apr 17 17:10 ceph.mgr.central-controller0-0.keyring
-r--------. 1  167  167  152 Apr 17 17:10 ceph.mgr.central-controller0-1.keyring
-r--------. 1  167  167  152 Apr 17 17:10 ceph.mgr.central-controller0-2.keyring
-r--------. 1 root root   77 Apr 17 17:09 ceph.mon.keyring
[root@central-controller0-0 ~]#

Comment 3 John Fulton 2020-04-21 15:06:45 UTC
I confirm an unsupported ceph cluster name change in the middle of the deployment is root cause. User error. Not a bug. Closing.