Bug 1423459 - [ceph-ansible]: "wait for ceph.client.admin.keyring exists" step fails during cluster creation using RHEL ISO
Summary: [ceph-ansible]: "wait for ceph.client.admin.keyring exists" step fails during...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: rc
: 3.0
Assignee: Andrew Schoen
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1412867
TreeView+ depends on / blocked
 
Reported: 2017-02-17 11:37 UTC by Tejas
Modified: 2020-04-15 15:18 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-06 13:23:46 UTC
Embargoed:


Attachments (Terms of Use)
ansible playbook log (28.00 KB, text/plain)
2017-02-17 11:37 UTC, Tejas
no flags Details

Description Tejas 2017-02-17 11:37:14 UTC
Created attachment 1251873 [details]
ansible playbook  log

Description of problem:
     I am trying to create a cluster on RHEL using ISO. The client.admin.keyring fails to generate.

TASK [ceph-mon : wait for ceph.client.admin.keyring exists] ********************
fatal: [magna031]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [magna046]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [magna028]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}

I have tried the same using repos on RHEl and it works.


Version-Release number of selected component (if applicable):
ceph-ansible-2.1.9-1.el7scon.noarch

How reproducible:
Always

Steps to Reproduce:
1. Try to create a ceph cluster using RHEL ISO's using ceph-ansible.


Additional info:

A temp keyring is generated:

 ~]# ll /etc/ceph/
total 8
-rw-------. 1 ceph ceph    0 Feb 17 11:20 ceph.client.admin.keyring.23511.tmp
-rw-r--r--. 1 ceph ceph 1047 Feb 17 11:20 ceph.conf
-rwxr-xr-x. 1 root root   92 Feb 13 23:09 rbdmap

Attaching the playbook log here.
Please take a look at the setup, its in the same state.

Comment 4 Andrew Schoen 2017-02-17 15:40:54 UTC
I'm looking into this and have found a couple curious things in the systemd logs. The mon daemon is not running on any of the mon nodes.

All three mon nodes have entries like: "systemd[1]: [/usr/lib/systemd/system/ceph-mon@.service:24] Unknown lvalue 'TasksMax' in section 'Service'"

Also, I see lots of messages from ceph-create-keys, which I suspect is there because the monitors did not come up. "ceph-create-keys[27599]: INFO:ceph-create-keys:ceph-mon is not in quorum: u'electing'"

Comment 5 Andrew Schoen 2017-02-17 15:58:38 UTC
Attempting to start the mons manually also fails. Looking in the systemd logs after the failure I see logs like: "ceph-mon[7272]: monitor data directory at '/var/lib/ceph/mon/ceph-magna046' does not exist: have you run 'mkfs'?"

Looking in /var/lib/ceph/mon shows that the mon data directory was created with the fqdn.

[andrewschoen@magna046 ~]$ ls /var/lib/ceph/mon
ceph-magna046.ceph.redhat.com

Looking at your ansible node I see you've set 'mon_use_fqdn: True' in /usr/share/ceph-ansible/group_vars/all.yml. Did you have /usr/share/ceph-ansible/group_vars symlinked to /etc/ansible/group_vars when you did the initial deploy?

I think this would work if we set 'mon_use_fqdn: False'. Could you try to running this again with that change made?

Thanks,
Andrew

Comment 6 Ken Dreyer (Red Hat) 2017-02-17 16:03:37 UTC
FWIW, the "TasksMax" systemd warning is just cosmetic, and might go away in RHEL 7.4 (bug 1337244)

Comment 7 Tejas 2017-02-18 17:20:23 UTC
Hi Andrew,

    I have now created the symlink to group_vars folder.
I did disable the mon_use_fqdn option and tried, but still seeing the same error.

@magna028 ~]# ll /var/lib/ceph/mon/ceph-magna028/
keyring   store.db/ 

Also I have created a cluster from scratch with repos, and with the mon_use_fqdn set to true. Not sure how using ISO would make a difference.

Thanks,
Tejas

Comment 8 seb 2017-02-18 20:18:22 UTC
Tejas, what's the error in the mon logs this time?

Comment 9 Tejas 2017-02-20 04:13:23 UTC
With FQDN disabled, I see that the mons are up and running on all the nodes.
#mon_use_fqdn: false

The ceph-create-keys is stuck on all mons and it is using the FQDN of the MONs for some reason.

root@magna028 ~]# systemctl status ceph-mon 
● ceph-mon - Ceph cluster monitor daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2017-02-19 12:21:05 UTC; 15h ago
 Main PID: 28914 (ceph-mon)
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon
           └─28914 /usr/bin/ceph-mon -f --cluster ceph --id magna028 --setuser ceph --setgroup ceph

Feb 19 12:21:05 magna028 systemd[1]: Started Ceph cluster monitor daemon.
Feb 19 12:21:05 magna028 systemd[1]: Starting Ceph cluster monitor daemon...
Feb 19 12:21:07 magna028 ceph-mon[28914]: starting mon.magna028 rank 0 at 10.8.128.28:6789/0 mon_data /var/lib/ceph/mon/ceph-magna028 fsid 1f522450-2d00-4aeb-a...93cfe275d96
Hint: Some lines were ellipsized, use -l to show in full.
[root@magna028 ~]# 
[root@magna028 ~]# ps -ef | grep ceph
root      5141  3624  0 04:06 pts/0    00:00:00 grep --color=auto ceph
root     25824     1  0 Feb18 ?        00:01:40 python /usr/sbin/ceph-create-keys --cluster ceph --id magna028.ceph.redhat.com    <--------- using fqdn
ceph     28914     1  4 Feb19 ?        00:43:49 /usr/bin/ceph-mon -f --cluster ceph --id magna028 --setuser ceph --setgroup ceph
[root@magna028 ~]# 
[root@magna028 ~]# 
[root@magna028 ~]# ceph -s
2017-02-20 04:06:25.566853 7f62a80c8700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2017-02-20 04:06:28.567847 7f629eefc700  0 -- 10.8.128.28:0/975065679 >> 10.8.128.31:6789/0 pipe(0x7f628c000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f628c001f90).fault
2017-02-20 04:06:37.568510 7f629effd700  0 -- 10.8.128.28:0/975065679 >> 10.8.128.31:6789/0 pipe(0x7f628c005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f628c002ba0).fault
2017-02-20 04:06:52.569869 7f629eefc700  0 -- 10.8.128.28:0/975065679 >> 10.8.128.31:6789/0 pipe(0x7f628c000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f628c008e10).fault



root@magna031 ~]# ps -ef | grep ceph
root      5410  2950  0 04:07 pts/0    00:00:00 grep --color=auto ceph
root      5749     1  0 Feb18 ?        00:01:40 python /usr/sbin/ceph-create-keys --cluster ceph --id magna031.ceph.redhat.com
root      5751     1  0 Feb18 ?        00:01:47 python /usr/sbin/ceph-create-keys --cluster ceph --id magna031
ceph      8674     1 20 00:18 ?        00:47:42 /usr/bin/ceph-mon -f --cluster ceph --id magna031 --setuser ceph --setgroup ceph


root@magna046 ~]# ps -ef | grep ceph
root     12022     1  0 Feb18 ?        00:01:39 python /usr/sbin/ceph-create-keys --cluster ceph --id magna046.ceph.redhat.com
root     12208 12134  0 04:08 pts/0    00:00:00 grep --color=auto ceph
ceph     26487     1  3 Feb19 ?        00:43:13 /usr/bin/ceph-mon -f --cluster ceph --id magna046 --setuser ceph --setgroup ceph
[root@magna046 ~]# 
[root@magna046 ~]# 
[root@magna046 ~]# ceph -s
2017-02-20 04:08:47.641372 7fa2463c8700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2017-02-20 04:08:50.642048 7fa244251700  0 -- 10.8.128.46:0/1547200244 >> 10.8.128.31:6789/0 pipe(0x7fa22c000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fa22c001f90).fault
^CTraceback (most recent call last):

Comment 10 Tejas 2017-02-20 05:20:41 UTC
Seeing logs like this in the leader mon:

2017-02-20 05:17:44.077348 7f3da4362700  0 log_channel(cluster) log [INF] : mon.magna028@0 won leader election with quorum 0,2
2017-02-20 05:17:54.084496 7f3da4362700  1 mon.magna028@0(leader).paxos(paxos recovering c 0..0) collect timeout, calling fresh election
2017-02-20 05:17:54.440945 7f3da3b61700  0 log_channel(cluster) log [INF] : mon.magna028 calling new monitor election
2017-02-20 05:17:54.441128 7f3da3b61700  1 mon.magna028@0(electing).elector(16346) init, last seen epoch 16346


magna031:
2017-02-20 05:20:15.962260 7f336575a700  0 log_channel(cluster) log [INF] : mon.magna031 calling new monitor election
2017-02-20 05:20:15.962366 7f336575a700  1 mon.magna031@0(electing).elector(16363) init, last seen epoch 16363
2017-02-20 05:20:15.989957 7f33682ca700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2017-02-20 05:20:15.990062 7f33682ca700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2017-02-20 05:20:17.069571 7f33682ca700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2017-02-20 05:20:17.069683 7f33682ca700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished

Comment 11 Tejas 2017-02-20 05:23:09 UTC
The issue seen in this BZ seems to be an absolute one-off case.
I have installed  3 other clusters on other machines with ISO, and fqdn set to true. No issues seen in any of those installations.

Is there a possibility of any race condition?

Thanks,
Tejas

Comment 16 Tejas 2017-02-20 15:25:28 UTC
Hi Andrew,

    sorry about that, we had overlooked the interface. Also the public network subnet was wrongly specified.
We corrected those and ran again, but still seeing the same. Could ypu please take a look now?

Thanks,
Tejas

Comment 18 Tejas 2017-02-20 17:11:17 UTC
Yes, we are trying it now.
I am seeing the same behaviour when I tried using the Installer APIs too.
Its the 3rd setup where we have seen this intermittently.
I will try the same on a fresh setup from scratch to try and narrow it down.

Thanks,
Tejas

Comment 20 seb 2017-02-20 20:00:24 UTC
Tejas, what's the latest status of this? Please try from scratch and give us access to the env if something goes wrong.

Comment 22 Andrew Schoen 2017-02-21 13:08:01 UTC
Tejas,

Any update here? Are we still seeing the same issue on a fresh environment?

Thanks,
Andrew

Comment 23 Tejas 2017-02-21 13:19:39 UTC
I am not able to repro this on a fresh  setup.
If we are not able to repro this today, I will close the bug by today.

Thanks,
Tejas

Comment 24 Ken Dreyer (Red Hat) 2017-02-21 16:36:52 UTC
If QE cannot reproduce this, we expect this to be CLOSED/NOTABUG.

Comment 25 Tejas 2017-02-22 09:35:22 UTC
I am not able to repro this issue anymore.
So closing this.

Comment 26 Valentin Höbel 2017-06-01 13:11:37 UTC
I hope it is ok to still comment here although this bug was already closed a couple of months ago.

I am currently extensively testing Red Hat Storage 2 (Ceph) and am running into this issue.

An installation through Red Hat Storage Console and through Ansible fails. In both cases, I can see the same log entries as they were described here.

When running the deployment directly with Ansible, the installation throws errors:

fatal: [rh-ceph-mon01]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [rh-ceph-mon03]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [rh-ceph-mon02]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}

When running "ps aux |grep ceph" on those nodes, I can see that there is a process running for a couple of minutes:

python /usr/sbin/ceph-create-keys --cluster ceph --id rh-ceph-mon01


I modified group_vars/all.yml the following way:
mon_use_fqdn: true

After re-running the deployment directly via Ansible, the installation passes without errors.

Interesting fact, though: During the installation process I agan ran "ps aux |grep ceph" on the MON nodes:
[root@rh-ceph-mon01 ~]# ps aux |grep ceph
root      3409  0.1  0.3 158956  7388 ?        Ss   14:58   0:00 python /usr/sbin/ceph-create-keys --cluster ceph --id rh-ceph-mon01
root      3411  0.1  0.3 158956  7384 ?        Ss   14:58   0:00 python /usr/sbin/ceph-create-keys --cluster ceph --id rh-ceph-mon01.rhceph.test


This time, the process "ceph-create-keys" seemed to have started two times.

After the Ansible deployment was completed, the process
python /usr/sbin/ceph-create-keys --cluster ceph --id rh-ceph-mon01
still ran. So this one "never" ends :(

Package versions on the MON nodes:
[root@rh-ceph-mon01 ~]# rpm -qa |grep rpm
rpm-libs-4.11.3-21.el7.x86_64
rpm-python-4.11.3-21.el7.x86_64
rpm-4.11.3-21.el7.x86_64
rpm-build-libs-4.11.3-21.el7.x86_64


Package versions on the Ansible master:
[root@rh-ceph-console ceph-ansible]# rpm -qa |grep ceph
ceph-installer-1.2.2-1.el7scon.noarch
ceph-ansible-2.1.9-1.el7scon.noarch
rhscon-ceph-0.0.43-1.el7scon.x86_64


All nodes run RH 7.3.

Disclaimer: I am working for a Red Hat partner.



I encountered other issues during installation and found a couple of errors within the official docs, but I guess this is not the right place to "report" them.

Comment 27 Valentin Höbel 2017-06-01 13:13:20 UTC
Sorry, I can't edit my last comment. Those are the correct packages from the MON nodes:

[root@rh-ceph-mon01 ~]# rpm -qa |grep ceph
python-cephfs-10.2.5-37.el7cp.x86_64
ceph-base-10.2.5-37.el7cp.x86_64
ceph-common-10.2.5-37.el7cp.x86_64
ceph-selinux-10.2.5-37.el7cp.x86_64
libcephfs1-10.2.5-37.el7cp.x86_64
ceph-mon-10.2.5-37.el7cp.x86_64

Comment 28 Christina Meno 2017-06-01 16:35:32 UTC
Andrew would you please work with Valentin to see if we can determine the cause?

Comment 31 Andrew Schoen 2017-06-01 18:34:47 UTC
Valentin,

The original issue you saw there is typically related to monitor communication issues. Could you please check on these few things for me?

- Can you include the contents of your configuration in group_vars? What do you have set in all.yml, mons.yml, etc.

- can you reach every monitor from all monitors? Can they ping each other and resolve each others domain names?

- do you have a firewall up? are the ports of the monitors open/closed?

- Was that value you set for monitor_interface the same interface used for public_network?

There is also this troubleshooting guide which might be helpful: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/

I'm not entirely sure why you'd see two ceph-create-keys processes going but I do know that in earlier versions of ceph (including the one you're using) ceph-create-keys can hang indefinitely if the monitors can't be reached. I see a difference in the two commands as well, the --id value is different. The one that hung was 'rh-ceph-mon01' an d the one that succeeded was 'rh-ceph-mon01.rhceph.test'. My guess is that the hostname 'rh-ceph-mon01' was not resolvable which caused ceph-create-keys to run indefinitely trying to connect to the monitor.

Thanks,
Andrew

Comment 32 Valentin Höbel 2017-06-02 15:00:59 UTC
Hi Andrew,

thanks for your very fast reply, appreciate it.

I changed within group_vars:
- all.yml:
fetch_directory: ~/ceph-ansible-keys
ceph_stable_rh_storage: true
ceph_stable_rh_storage_cdn_install: true
cephx: true
monitor_interface: eth0
journal_size: 5120 # OSD journal size in MB
public_network: 172.28.160.0/24

- mons.yml:
calamari: true

- osds.yml:
journal_collocation: true

(I copied the sample yml files and changed the values I mentioned above.)


With these values, the installation trows an error at the keyring step.
I then configured within all.yml:
mon_use_fqdn: true 

Then the installation passes without errors.



> - can you reach every monitor from all monitors?
> Can they ping each other and resolve each others domain names?
Yes, no issues here.


> - do you have a firewall up? are the ports of the monitors open/closed?
I have firewalld running and configured all ports according to official documentation.
However, even when I flush IPTABLES there is no change at all. I think the firewall is not the issue here...


> - Was that value you set for monitor_interface the same interface used for
> public_network?
All my VMs only have eth0. monitor_interface is set to eth0.
So yes, public_network and monitor_interface are both running through eth0.
Should I change that?


> There is also this troubleshooting guide which might be helpful: 
Thanks! I checked everything from this guide. mon_status look ok (all have quorum) and ceph health detail gives "HEALTH_OK".

I think the cluster itself is fine.


Regarding the resolving stuff: I think that there was never a resolving issue. I regularely checked through salt (e.g. salt "'" cmd.run "ping <all other nodes>").

But if you say that "normally this issue should never occur and it is most likely that this is some issue in your lab environment" I am completly fine with that. I just wanted to rise awareness that the issue from this bug report might still exist. If it is my lab setup doing bad things, though, one should consider making the deployment part more robust (e.g. let Ansible check DNS resolving or write to /etc/hosts etc.).

Again, I am quite sure that there is no resolving issue, but of course I know that one can never be a 100% sure.

Comment 34 Drew Harris 2017-07-06 13:23:46 UTC
Please open a support case if you are still experiencing an issue.


Note You need to log in before you can comment on or make changes to this bug.