Bug 1423459
Summary: | [ceph-ansible]: "wait for ceph.client.admin.keyring exists" step fails during cluster creation using RHEL ISO | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tejas <tchandra> | ||||
Component: | Ceph-Ansible | Assignee: | Andrew Schoen <aschoen> | ||||
Status: | CLOSED WONTFIX | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.0 | CC: | adeza, anharris, aschoen, ceph-eng-bugs, gmeno, hnallurv, icolle, kdreyer, nthomas, sankarshan, seb, shmohan, tchandra, v.hoebel, vimishra | ||||
Target Milestone: | rc | Keywords: | Reopened | ||||
Target Release: | 3.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-07-06 13:23:46 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1412867 | ||||||
Attachments: |
|
I'm looking into this and have found a couple curious things in the systemd logs. The mon daemon is not running on any of the mon nodes. All three mon nodes have entries like: "systemd[1]: [/usr/lib/systemd/system/ceph-mon@.service:24] Unknown lvalue 'TasksMax' in section 'Service'" Also, I see lots of messages from ceph-create-keys, which I suspect is there because the monitors did not come up. "ceph-create-keys[27599]: INFO:ceph-create-keys:ceph-mon is not in quorum: u'electing'" Attempting to start the mons manually also fails. Looking in the systemd logs after the failure I see logs like: "ceph-mon[7272]: monitor data directory at '/var/lib/ceph/mon/ceph-magna046' does not exist: have you run 'mkfs'?" Looking in /var/lib/ceph/mon shows that the mon data directory was created with the fqdn. [andrewschoen@magna046 ~]$ ls /var/lib/ceph/mon ceph-magna046.ceph.redhat.com Looking at your ansible node I see you've set 'mon_use_fqdn: True' in /usr/share/ceph-ansible/group_vars/all.yml. Did you have /usr/share/ceph-ansible/group_vars symlinked to /etc/ansible/group_vars when you did the initial deploy? I think this would work if we set 'mon_use_fqdn: False'. Could you try to running this again with that change made? Thanks, Andrew FWIW, the "TasksMax" systemd warning is just cosmetic, and might go away in RHEL 7.4 (bug 1337244) Hi Andrew, I have now created the symlink to group_vars folder. I did disable the mon_use_fqdn option and tried, but still seeing the same error. @magna028 ~]# ll /var/lib/ceph/mon/ceph-magna028/ keyring store.db/ Also I have created a cluster from scratch with repos, and with the mon_use_fqdn set to true. Not sure how using ISO would make a difference. Thanks, Tejas Tejas, what's the error in the mon logs this time? With FQDN disabled, I see that the mons are up and running on all the nodes. #mon_use_fqdn: false The ceph-create-keys is stuck on all mons and it is using the FQDN of the MONs for some reason. root@magna028 ~]# systemctl status ceph-mon ● ceph-mon - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2017-02-19 12:21:05 UTC; 15h ago Main PID: 28914 (ceph-mon) CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon └─28914 /usr/bin/ceph-mon -f --cluster ceph --id magna028 --setuser ceph --setgroup ceph Feb 19 12:21:05 magna028 systemd[1]: Started Ceph cluster monitor daemon. Feb 19 12:21:05 magna028 systemd[1]: Starting Ceph cluster monitor daemon... Feb 19 12:21:07 magna028 ceph-mon[28914]: starting mon.magna028 rank 0 at 10.8.128.28:6789/0 mon_data /var/lib/ceph/mon/ceph-magna028 fsid 1f522450-2d00-4aeb-a...93cfe275d96 Hint: Some lines were ellipsized, use -l to show in full. [root@magna028 ~]# [root@magna028 ~]# ps -ef | grep ceph root 5141 3624 0 04:06 pts/0 00:00:00 grep --color=auto ceph root 25824 1 0 Feb18 ? 00:01:40 python /usr/sbin/ceph-create-keys --cluster ceph --id magna028.ceph.redhat.com <--------- using fqdn ceph 28914 1 4 Feb19 ? 00:43:49 /usr/bin/ceph-mon -f --cluster ceph --id magna028 --setuser ceph --setgroup ceph [root@magna028 ~]# [root@magna028 ~]# [root@magna028 ~]# ceph -s 2017-02-20 04:06:25.566853 7f62a80c8700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2017-02-20 04:06:28.567847 7f629eefc700 0 -- 10.8.128.28:0/975065679 >> 10.8.128.31:6789/0 pipe(0x7f628c000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f628c001f90).fault 2017-02-20 04:06:37.568510 7f629effd700 0 -- 10.8.128.28:0/975065679 >> 10.8.128.31:6789/0 pipe(0x7f628c005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f628c002ba0).fault 2017-02-20 04:06:52.569869 7f629eefc700 0 -- 10.8.128.28:0/975065679 >> 10.8.128.31:6789/0 pipe(0x7f628c000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f628c008e10).fault root@magna031 ~]# ps -ef | grep ceph root 5410 2950 0 04:07 pts/0 00:00:00 grep --color=auto ceph root 5749 1 0 Feb18 ? 00:01:40 python /usr/sbin/ceph-create-keys --cluster ceph --id magna031.ceph.redhat.com root 5751 1 0 Feb18 ? 00:01:47 python /usr/sbin/ceph-create-keys --cluster ceph --id magna031 ceph 8674 1 20 00:18 ? 00:47:42 /usr/bin/ceph-mon -f --cluster ceph --id magna031 --setuser ceph --setgroup ceph root@magna046 ~]# ps -ef | grep ceph root 12022 1 0 Feb18 ? 00:01:39 python /usr/sbin/ceph-create-keys --cluster ceph --id magna046.ceph.redhat.com root 12208 12134 0 04:08 pts/0 00:00:00 grep --color=auto ceph ceph 26487 1 3 Feb19 ? 00:43:13 /usr/bin/ceph-mon -f --cluster ceph --id magna046 --setuser ceph --setgroup ceph [root@magna046 ~]# [root@magna046 ~]# [root@magna046 ~]# ceph -s 2017-02-20 04:08:47.641372 7fa2463c8700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2017-02-20 04:08:50.642048 7fa244251700 0 -- 10.8.128.46:0/1547200244 >> 10.8.128.31:6789/0 pipe(0x7fa22c000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fa22c001f90).fault ^CTraceback (most recent call last): Seeing logs like this in the leader mon: 2017-02-20 05:17:44.077348 7f3da4362700 0 log_channel(cluster) log [INF] : mon.magna028@0 won leader election with quorum 0,2 2017-02-20 05:17:54.084496 7f3da4362700 1 mon.magna028@0(leader).paxos(paxos recovering c 0..0) collect timeout, calling fresh election 2017-02-20 05:17:54.440945 7f3da3b61700 0 log_channel(cluster) log [INF] : mon.magna028 calling new monitor election 2017-02-20 05:17:54.441128 7f3da3b61700 1 mon.magna028@0(electing).elector(16346) init, last seen epoch 16346 magna031: 2017-02-20 05:20:15.962260 7f336575a700 0 log_channel(cluster) log [INF] : mon.magna031 calling new monitor election 2017-02-20 05:20:15.962366 7f336575a700 1 mon.magna031@0(electing).elector(16363) init, last seen epoch 16363 2017-02-20 05:20:15.989957 7f33682ca700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch 2017-02-20 05:20:15.990062 7f33682ca700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished 2017-02-20 05:20:17.069571 7f33682ca700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch 2017-02-20 05:20:17.069683 7f33682ca700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished The issue seen in this BZ seems to be an absolute one-off case. I have installed 3 other clusters on other machines with ISO, and fqdn set to true. No issues seen in any of those installations. Is there a possibility of any race condition? Thanks, Tejas Hi Andrew, sorry about that, we had overlooked the interface. Also the public network subnet was wrongly specified. We corrected those and ran again, but still seeing the same. Could ypu please take a look now? Thanks, Tejas Yes, we are trying it now. I am seeing the same behaviour when I tried using the Installer APIs too. Its the 3rd setup where we have seen this intermittently. I will try the same on a fresh setup from scratch to try and narrow it down. Thanks, Tejas Tejas, what's the latest status of this? Please try from scratch and give us access to the env if something goes wrong. Tejas, Any update here? Are we still seeing the same issue on a fresh environment? Thanks, Andrew I am not able to repro this on a fresh setup. If we are not able to repro this today, I will close the bug by today. Thanks, Tejas If QE cannot reproduce this, we expect this to be CLOSED/NOTABUG. I am not able to repro this issue anymore. So closing this. I hope it is ok to still comment here although this bug was already closed a couple of months ago. I am currently extensively testing Red Hat Storage 2 (Ceph) and am running into this issue. An installation through Red Hat Storage Console and through Ansible fails. In both cases, I can see the same log entries as they were described here. When running the deployment directly with Ansible, the installation throws errors: fatal: [rh-ceph-mon01]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"} fatal: [rh-ceph-mon03]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"} fatal: [rh-ceph-mon02]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"} When running "ps aux |grep ceph" on those nodes, I can see that there is a process running for a couple of minutes: python /usr/sbin/ceph-create-keys --cluster ceph --id rh-ceph-mon01 I modified group_vars/all.yml the following way: mon_use_fqdn: true After re-running the deployment directly via Ansible, the installation passes without errors. Interesting fact, though: During the installation process I agan ran "ps aux |grep ceph" on the MON nodes: [root@rh-ceph-mon01 ~]# ps aux |grep ceph root 3409 0.1 0.3 158956 7388 ? Ss 14:58 0:00 python /usr/sbin/ceph-create-keys --cluster ceph --id rh-ceph-mon01 root 3411 0.1 0.3 158956 7384 ? Ss 14:58 0:00 python /usr/sbin/ceph-create-keys --cluster ceph --id rh-ceph-mon01.rhceph.test This time, the process "ceph-create-keys" seemed to have started two times. After the Ansible deployment was completed, the process python /usr/sbin/ceph-create-keys --cluster ceph --id rh-ceph-mon01 still ran. So this one "never" ends :( Package versions on the MON nodes: [root@rh-ceph-mon01 ~]# rpm -qa |grep rpm rpm-libs-4.11.3-21.el7.x86_64 rpm-python-4.11.3-21.el7.x86_64 rpm-4.11.3-21.el7.x86_64 rpm-build-libs-4.11.3-21.el7.x86_64 Package versions on the Ansible master: [root@rh-ceph-console ceph-ansible]# rpm -qa |grep ceph ceph-installer-1.2.2-1.el7scon.noarch ceph-ansible-2.1.9-1.el7scon.noarch rhscon-ceph-0.0.43-1.el7scon.x86_64 All nodes run RH 7.3. Disclaimer: I am working for a Red Hat partner. I encountered other issues during installation and found a couple of errors within the official docs, but I guess this is not the right place to "report" them. Sorry, I can't edit my last comment. Those are the correct packages from the MON nodes: [root@rh-ceph-mon01 ~]# rpm -qa |grep ceph python-cephfs-10.2.5-37.el7cp.x86_64 ceph-base-10.2.5-37.el7cp.x86_64 ceph-common-10.2.5-37.el7cp.x86_64 ceph-selinux-10.2.5-37.el7cp.x86_64 libcephfs1-10.2.5-37.el7cp.x86_64 ceph-mon-10.2.5-37.el7cp.x86_64 Andrew would you please work with Valentin to see if we can determine the cause? Valentin, The original issue you saw there is typically related to monitor communication issues. Could you please check on these few things for me? - Can you include the contents of your configuration in group_vars? What do you have set in all.yml, mons.yml, etc. - can you reach every monitor from all monitors? Can they ping each other and resolve each others domain names? - do you have a firewall up? are the ports of the monitors open/closed? - Was that value you set for monitor_interface the same interface used for public_network? There is also this troubleshooting guide which might be helpful: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/ I'm not entirely sure why you'd see two ceph-create-keys processes going but I do know that in earlier versions of ceph (including the one you're using) ceph-create-keys can hang indefinitely if the monitors can't be reached. I see a difference in the two commands as well, the --id value is different. The one that hung was 'rh-ceph-mon01' an d the one that succeeded was 'rh-ceph-mon01.rhceph.test'. My guess is that the hostname 'rh-ceph-mon01' was not resolvable which caused ceph-create-keys to run indefinitely trying to connect to the monitor. Thanks, Andrew Hi Andrew, thanks for your very fast reply, appreciate it. I changed within group_vars: - all.yml: fetch_directory: ~/ceph-ansible-keys ceph_stable_rh_storage: true ceph_stable_rh_storage_cdn_install: true cephx: true monitor_interface: eth0 journal_size: 5120 # OSD journal size in MB public_network: 172.28.160.0/24 - mons.yml: calamari: true - osds.yml: journal_collocation: true (I copied the sample yml files and changed the values I mentioned above.) With these values, the installation trows an error at the keyring step. I then configured within all.yml: mon_use_fqdn: true Then the installation passes without errors. > - can you reach every monitor from all monitors? > Can they ping each other and resolve each others domain names? Yes, no issues here. > - do you have a firewall up? are the ports of the monitors open/closed? I have firewalld running and configured all ports according to official documentation. However, even when I flush IPTABLES there is no change at all. I think the firewall is not the issue here... > - Was that value you set for monitor_interface the same interface used for > public_network? All my VMs only have eth0. monitor_interface is set to eth0. So yes, public_network and monitor_interface are both running through eth0. Should I change that? > There is also this troubleshooting guide which might be helpful: Thanks! I checked everything from this guide. mon_status look ok (all have quorum) and ceph health detail gives "HEALTH_OK". I think the cluster itself is fine. Regarding the resolving stuff: I think that there was never a resolving issue. I regularely checked through salt (e.g. salt "'" cmd.run "ping <all other nodes>"). But if you say that "normally this issue should never occur and it is most likely that this is some issue in your lab environment" I am completly fine with that. I just wanted to rise awareness that the issue from this bug report might still exist. If it is my lab setup doing bad things, though, one should consider making the deployment part more robust (e.g. let Ansible check DNS resolving or write to /etc/hosts etc.). Again, I am quite sure that there is no resolving issue, but of course I know that one can never be a 100% sure. Please open a support case if you are still experiencing an issue. |
Created attachment 1251873 [details] ansible playbook log Description of problem: I am trying to create a cluster on RHEL using ISO. The client.admin.keyring fails to generate. TASK [ceph-mon : wait for ceph.client.admin.keyring exists] ******************** fatal: [magna031]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"} fatal: [magna046]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"} fatal: [magna028]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"} I have tried the same using repos on RHEl and it works. Version-Release number of selected component (if applicable): ceph-ansible-2.1.9-1.el7scon.noarch How reproducible: Always Steps to Reproduce: 1. Try to create a ceph cluster using RHEL ISO's using ceph-ansible. Additional info: A temp keyring is generated: ~]# ll /etc/ceph/ total 8 -rw-------. 1 ceph ceph 0 Feb 17 11:20 ceph.client.admin.keyring.23511.tmp -rw-r--r--. 1 ceph ceph 1047 Feb 17 11:20 ceph.conf -rwxr-xr-x. 1 root root 92 Feb 13 23:09 rbdmap Attaching the playbook log here. Please take a look at the setup, its in the same state.