Bug 1826067
| Summary: | ceph attempting to use non-working key while "waiting for the monitor(s) to form the quorum..." | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | John Fulton <johfulto> |
| Component: | Ceph-Ansible | Assignee: | Guillaume Abrioux <gabrioux> |
| Status: | CLOSED NOTABUG | QA Contact: | Vasishta <vashastr> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.0 | CC: | aschoen, ceph-eng-bugs, gmeno, nthomas, ykaul |
| Target Milestone: | rc | ||
| Target Release: | 5.* | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-21 15:06:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1760354 | ||
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity. On further inspection I see evidence that the person I'm working with from OpenStack QE had done an earlier deployment on the same system and that it looks like he's changed the cluster name. This might have caused ceph-ansible to behave differently (as if it were deploying a new cluster). I'm going to ask him to do a fresh install and ensure the name doesn't change. Setting needinfo for those results before we go further on this bug. [root@central-controller0-0 ~]# ls -l /etc/ceph/ total 40 -r--------. 1 167 167 151 Apr 17 22:20 central.client.admin.keyring -rw-r--r--. 1 root root 1149 Apr 17 22:20 central.conf -r--------. 1 root root 77 Apr 17 22:20 central.mon.keyring -r--------. 1 167 167 151 Apr 17 17:09 ceph.client.admin.keyring -rw-------. 1 167 167 201 Apr 17 17:12 ceph.client.openstack.keyring -rw-r--r--. 1 root root 1149 Apr 17 17:09 ceph.conf -r--------. 1 167 167 152 Apr 17 17:10 ceph.mgr.central-controller0-0.keyring -r--------. 1 167 167 152 Apr 17 17:10 ceph.mgr.central-controller0-1.keyring -r--------. 1 167 167 152 Apr 17 17:10 ceph.mgr.central-controller0-2.keyring -r--------. 1 root root 77 Apr 17 17:09 ceph.mon.keyring [root@central-controller0-0 ~]# I confirm an unsupported ceph cluster name change in the middle of the deployment is root cause. User error. Not a bug. Closing. |
After a good deployment with ceph-ansible 4.0.18, ceph-ansible was re-run after some changes were made in group_vars to add a new key [1]. Before the new key could be added however, the task to check if the cluster was in quorum [2] failed [3] because it couldn't authenticate to the cluster. It was trying to use a key that exists [4] though running the same command directly on the host I can get the exact same error with that key: [root@central-controller0-0 ~]# podman exec ceph-mon-central-controller0-0 ceph --cluster central -n mon. -k /var/lib/ceph/mon/central-central-controller0-0/keyring mon_status [errno 13] error connecting to the cluster Error: non zero exit code: 13: OCI runtime error [root@central-controller0-0 ~]# However, if I use the client.admin.keyring it works fine. [root@central-controller0-0 ~]# podman exec ceph-mon-central-controller0-0 ceph --cluster central -n client.admin -k /etc/ceph/central.client.admin.keyring mon_status | jq . | head -2 { "name": "central-controller0-0", [root@central-controller0-0 ~]# So why does ceph-ansible 4.0.18 use a key which doesn't work? I do not have this issue if I use ceph-ansible 4.0.14 on a very similar system [5]. Footnotes: [1] Basically the keys list was changed to have an additional key (technically it would be faster to just run a ceph auth commands but tripleo is triggering it this way and we have always relied on ceph-ansible's idempotence). First run: keys: - caps: mgr: allow * mon: profile rbd osd: profile rbd pool=vms, profile rbd pool=volumes, profile rbd pool=images key: AQBG2pleAAAAABAAall65F/X9osE02z4wmopHA== mode: '0600' name: client.openstack Second run: keys: - caps: mgr: allow * mon: profile rbd osd: profile rbd pool=vms, profile rbd pool=volumes, profile rbd pool=images key: AQBG2pleAAAAABAAall65F/X9osE02z4wmopHA== mode: '0600' name: client.openstack - caps: mgr: allow * mon: profile rbd osd: profile rbd pool=vms, profile rbd pool=volumes, profile rbd pool=images key: AQCFIppeAAAAABAAIh5cLiUj+eA6WCGyIA/DFQ== mode: '0600' name: client.external [2] https://github.com/ceph/ceph-ansible/blob/v4.0.18/roles/ceph-mon/tasks/ceph_keys.yml#L2-L17 [3] 2020-04-17 22:20:40,724 p=709422 u=root | TASK [ceph-mon : include_tasks ceph_keys.yml] ********************************** 2020-04-17 22:20:40,724 p=709422 u=root | task path: /usr/share/ceph-ansible/roles/ceph-mon/tasks/main.yml:17 2020-04-17 22:20:40,724 p=709422 u=root | Friday 17 April 2020 22:20:40 +0000 (0:00:01.166) 0:03:27.807 ********** 2020-04-17 22:20:41,179 p=709422 u=root | included: /usr/share/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml for central-controller0-0, central-controller0-1, central-controller0-2 2020-04-17 22:20:41,299 p=709422 u=root | TASK [ceph-mon : waiting for the monitor(s) to form the quorum...] ************* 2020-04-17 22:20:41,300 p=709422 u=root | task path: /usr/share/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml:2 2020-04-17 22:20:41,300 p=709422 u=root | Friday 17 April 2020 22:20:41 +0000 (0:00:00.575) 0:03:28.383 ********** 2020-04-17 22:20:42,242 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (10 retries left). 2020-04-17 22:21:03,092 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (9 retries left). 2020-04-17 22:21:23,965 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (8 retries left). 2020-04-17 22:21:45,390 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (7 retries left). 2020-04-17 22:22:06,355 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (6 retries left). 2020-04-17 22:22:27,237 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (5 retries left). 2020-04-17 22:22:48,226 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (4 retries left). 2020-04-17 22:23:09,264 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (3 retries left). 2020-04-17 22:23:30,232 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (2 retries left). 2020-04-17 22:23:52,599 p=709422 u=root | FAILED - RETRYING: waiting for the monitor(s) to form the quorum... (1 retries left). 2020-04-17 22:24:13,535 p=709422 u=root | fatal: [central-controller0-0]: FAILED! => changed=false attempts: 10 cmd: - podman - exec - ceph-mon-central-controller0-0 - ceph - --cluster - central - -n - mon. - -k - /var/lib/ceph/mon/central-central-controller0-0/keyring - mon_status - --format - json delta: '0:00:00.601977' end: '2020-04-17 22:24:13.503790' msg: non-zero return code rc: 13 start: '2020-04-17 22:24:12.901813' stderr: |- [errno 13] error connecting to the cluster Error: non zero exit code: 13: OCI runtime error stderr_lines: <omitted> stdout: '' stdout_lines: <omitted> 2020-04-17 22:24:13,536 p=709422 u=root | NO MORE HOSTS LEFT ************************************************************* 2020-04-17 22:24:13,538 p=709422 u=root | PLAY RECAP ********************************************************************* 2020-04-17 22:24:13,538 p=709422 u=root | central-computehci0-0 : ok=64 changed=2 unreachable=0 failed=0 skipped=140 rescued=0 ignored=0 2020-04-17 22:24:13,539 p=709422 u=root | central-computehci0-1 : ok=64 changed=2 unreachable=0 failed=0 skipped=140 rescued=0 ignored=0 2020-04-17 22:24:13,539 p=709422 u=root | central-computehci0-2 : ok=64 changed=2 unreachable=0 failed=0 skipped=140 rescued=0 ignored=0 2020-04-17 22:24:13,539 p=709422 u=root | central-controller0-0 : ok=117 changed=12 unreachable=0 failed=1 skipped=210 rescued=0 ignored=0 2020-04-17 22:24:13,539 p=709422 u=root | central-controller0-1 : ok=107 changed=9 unreachable=0 failed=0 skipped=198 rescued=0 ignored=0 2020-04-17 22:24:13,539 p=709422 u=root | central-controller0-2 : ok=107 changed=9 unreachable=0 failed=0 skipped=198 rescued=0 ignored=0 2020-04-17 22:24:13,539 p=709422 u=root | INSTALLER STATUS *************************************************************** 2020-04-17 22:24:13,543 p=709422 u=root | Install Ceph Monitor : In Progress (0:04:27) 2020-04-17 22:24:13,543 p=709422 u=root | This phase can be restarted by running: roles/ceph-mon/tasks/main.yml 2020-04-17 22:24:13,544 p=709422 u=root | Friday 17 April 2020 22:24:13 +0000 (0:03:32.243) 0:07:00.627 ********** 2020-04-17 22:24:13,544 p=709422 u=root | =============================================================================== 2020-04-17 22:24:13,544 p=709422 u=root | ceph-mon : waiting for the monitor(s) to form the quorum... ----------- 212.24s /usr/share/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml:2 ------------------ 2020-04-17 22:24:13,544 p=709422 u=root | gather and delegate facts ---------------------------------------------- 10.92s /usr/share/ceph-ansible/site-container.yml.sample:34 -------------------------- 2020-04-17 22:24:13,544 p=709422 u=root | ceph-config : create ceph initial directories --------------------------- 4.70s /usr/share/ceph-ansible/roles/ceph-config/tasks/create_ceph_initial_dirs.yml:2 2020-04-17 22:24:13,544 p=709422 u=root | gather facts ------------------------------------------------------------ 3.29s /usr/share/ceph-ansible/site-container.yml.sample:29 -------------------------- 2020-04-17 22:24:13,545 p=709422 u=root | ceph-facts : set_fact _monitor_address to monitor_address_block ipv4 ---- 2.17s /usr/share/ceph-ansible/roles/ceph-facts/tasks/set_monitor_address.yml:2 ------ 2020-04-17 22:24:13,545 p=709422 u=root | ceph-facts : find a running mon container ------------------------------- 2.14s /usr/share/ceph-ansible/roles/ceph-facts/tasks/facts.yml:50 ------------------- 2020-04-17 22:24:13,545 p=709422 u=root | ceph-validate : read information about the devices ---------------------- 2.05s /usr/share/ceph-ansible/roles/ceph-validate/tasks/check_devices.yml:59 -------- 2020-04-17 22:24:13,545 p=709422 u=root | ceph-facts : resolve device link(s) ------------------------------------- 1.99s /usr/share/ceph-ansible/roles/ceph-facts/tasks/facts.yml:208 ------------------ 2020-04-17 22:24:13,545 p=709422 u=root | ceph-validate : get devices information --------------------------------- 1.81s /usr/share/ceph-ansible/roles/ceph-validate/tasks/check_devices.yml:80 -------- 2020-04-17 22:24:13,545 p=709422 u=root | ceph-infra : install chrony --------------------------------------------- 1.81s /usr/share/ceph-ansible/roles/ceph-infra/tasks/setup_ntp.yml:15 --------------- 2020-04-17 22:24:13,545 p=709422 u=root | ceph-infra : enable chronyd --------------------------------------------- 1.77s /usr/share/ceph-ansible/roles/ceph-infra/tasks/setup_ntp.yml:46 --------------- 2020-04-17 22:24:13,545 p=709422 u=root | ceph-config : generate central.conf configuration file ------------------ 1.76s /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:159 ------------------ 2020-04-17 22:24:13,545 p=709422 u=root | check for python -------------------------------------------------------- 1.72s /usr/share/ceph-ansible/raw_install_python.yml:2 ------------------------------ 2020-04-17 22:24:13,545 p=709422 u=root | ceph-container-common : pulling site-undercloud-0.ctlplane.localdomain:8787/ceph/ceph-4.0-rhel-8:latest image --- 1.70s /usr/share/ceph-ansible/roles/ceph-container-common/tasks/fetch_image.yml:179 - 2020-04-17 22:24:13,545 p=709422 u=root | ceph-facts : find a running mon container ------------------------------- 1.69s /usr/share/ceph-ansible/roles/ceph-facts/tasks/facts.yml:50 ------------------- 2020-04-17 22:24:13,546 p=709422 u=root | ceph-container-common : inspect ceph osd container ---------------------- 1.69s /usr/share/ceph-ansible/roles/ceph-container-common/tasks/fetch_image.yml:12 -- 2020-04-17 22:24:13,546 p=709422 u=root | ceph-validate : resolve devices in devices ------------------------------ 1.66s /usr/share/ceph-ansible/roles/ceph-validate/tasks/check_devices.yml:41 -------- 2020-04-17 22:24:13,546 p=709422 u=root | ceph-container-common : get ceph version -------------------------------- 1.63s /usr/share/ceph-ansible/roles/ceph-container-common/tasks/main.yml:26 --------- 2020-04-17 22:24:13,546 p=709422 u=root | ceph-mon : ceph monitor mkfs with keyring ------------------------------- 1.34s /usr/share/ceph-ansible/roles/ceph-mon/tasks/deploy_monitors.yml:98 ----------- 2020-04-17 22:24:13,546 p=709422 u=root | ceph-mon : create monitor initial keyring ------------------------------- 1.33s /usr/share/ceph-ansible/roles/ceph-mon/tasks/deploy_monitors.yml:29 ----------- [4] [root@central-controller0-0 ~]# ls -l /var/lib/ceph/mon/central-central-controller0-0/keyring -rw-------. 1 167 167 77 Apr 17 22:20 /var/lib/ceph/mon/central-central-controller0-0/keyring [root@central-controller0-0 ~]# cat /var/lib/ceph/mon/central-central-controller0-0/keyring [mon.] key = AQCtK5peAAAAABAALKgWUBknBPEd1q1XIxqxMQ== caps mon = "allow *" [root@central-controller0-0 ~]# [5] [root@control-plane-controller-0 ~]# podman exec ceph-mon-$HOSTNAME ceph --cluster central -n mon. -k /var/lib/ceph/mon/central-control-plane-controller-0/keyring mon_status {"name":"control-plane-controller-0","rank":0,"state":"leader","election_epoch":5,"quorum":[0],"quorum_age":434765,"features":{"required_con":"2449958747315912708","required_mon":["kraken","luminous","mimic","osdmap-prune","nautilus"],"quorum_con":"4611087854031667199","quorum_mon":["kraken","luminous","mimic","osdmap-prune","nautilus"]},"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":1,"fsid":"03c80ec0-43d6-4fc3-9579-01c3e280b56e","modified":"2020-04-13 21:32:41.999674","created":"2020-04-13 21:32:41.999674","min_mon_release":14,"min_mon_release_name":"nautilus","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus"],"optional":[]},"mons":[{"rank":0,"name":"control-plane-controller-0","public_addrs":{"addrvec":[{"type":"v2","addr":"172.16.11.5:3300","nonce":0},{"type":"v1","addr":"172.16.11.5:6789","nonce":0}]},"addr":"172.16.11.5:6789/0","public_addr":"172.16.11.5:6789/0"}]},"feature_map":{"mon":[{"features":"0x3ffddff8ffacffff","release":"luminous","num":1}],"osd":[{"features":"0x3ffddff8ffacffff","release":"luminous","num":1}],"client":[{"features":"0x3ffddff8ffacffff","release":"luminous","num":2}],"mgr":[{"features":"0x3ffddff8ffacffff","release":"luminous","num":1}]}} [root@control-plane-controller-0 ~]# [root@control-plane-controller-0 ~]# cat /var/lib/ceph/mon/central-control-plane-controller-0/keyring [mon.] key = AQB02pReAAAAABAAT6qrHsOAu3n3W+q6OzhBBA== caps mon = "allow *" [root@control-plane-controller-0 ~]#