Bug 1544808
Summary: | [ceph-container] - client.admin authentication error after rebooting all nodes together | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasishta <vashastr> | ||||
Component: | Container | Assignee: | Sébastien Han <shan> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Vasishta <vashastr> | ||||
Severity: | high | Docs Contact: | Aron Gunn <agunn> | ||||
Priority: | high | ||||||
Version: | 3.0 | CC: | agunn, dang, gmeno, hchen, hnallurv, jim.curtis, kdreyer, pprakash, seb, vashastr | ||||
Target Milestone: | z1 | ||||||
Target Release: | 3.1 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Known Issue | |||||
Doc Text: |
.Rebooting all Ceph nodes simultaneously, will cause an authentication error
When performing a simultaneous reboot of all the Ceph nodes in the storage cluster, a resulting `client.admin` authentication error will occur when issuing any Ceph-related commands from the command-line interface. To work around this issue, avoid rebooting all Ceph nodes simultaneously.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-10-04 15:35:42 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1494421, 1544643 | ||||||
Attachments: |
|
Already release noted, this needs to move to z3 Looking at the code, this seems to be expected, however, if I can get an env with the error then I'll be able to validate my thoughts and potentially come up with a fix. Thanks in advance. I think the real concern here is not someone rebooting the whole platform but more a platform suffering a complete outage. I can't reproduce it, see: [leseb@tarox~/ceph-ansible/tests/functional/centos/7/docker][ceph-volume-container !] vagrant status Current machine states: mon0 running (libvirt) mon1 running (libvirt) mon2 running (libvirt) osd0 running (libvirt) osd1 running (libvirt) This environment represents multiple VMs. The VMs are all listed above with their current state. For more information about a specific VM, run `vagrant status NAME`. [leseb@tarox~/ceph-ansible/tests/functional/centos/7/docker][ceph-volume-container !] vagrant halt mon0 mon1 mon2 ==> mon2: Halting domain... ==> mon1: Halting domain... ==> mon0: Halting domain... [leseb@tarox~/ceph-ansible/tests/functional/centos/7/docker][ceph-volume-container !] vagrant up mon0 mon1 mon2 Bringing machine 'mon0' up with 'libvirt' provider... Bringing machine 'mon1' up with 'libvirt' provider... Bringing machine 'mon2' up with 'libvirt' provider... ==> mon2: Starting domain. ==> mon2: Waiting for domain to get an IP address... ==> mon1: Starting domain. ==> mon2: Waiting for SSH to become available... ==> mon0: Starting domain. ==> mon1: Waiting for domain to get an IP address... ==> mon1: Waiting for SSH to become available... ==> mon0: Waiting for domain to get an IP address... ==> mon0: Waiting for SSH to become available... ==> mon1: Creating shared folders metadata... ==> mon2: Creating shared folders metadata... ==> mon0: Creating shared folders metadata... ==> mon0: Rsyncing folder: /home/leseb/ceph-ansible/tests/functional/centos/7/docker/ => /home/vagrant/sync ==> mon2: Rsyncing folder: /home/leseb/ceph-ansible/tests/functional/centos/7/docker/ => /home/vagrant/sync ==> mon1: Rsyncing folder: /home/leseb/ceph-ansible/tests/functional/centos/7/docker/ => /home/vagrant/sync [leseb@tarox~/ceph-ansible/tests/functional/centos/7/docker][ceph-volume-container !] vagrant ssh mon0 [vagrant@mon0 ~]$ [vagrant@mon0 ~]$ [vagrant@mon0 ~]$ sudo -i -bash-4.2# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e9e2ccfcfcda docker.io/ceph/daemon:latest-luminous "/entrypoint.sh" About a minute ago Up About a minute ceph-mgr-mon0 96371e64ed3f docker.io/ceph/daemon:latest-luminous "/entrypoint.sh" About a minute ago Up About a minute ceph-mon-mon0 -bash-4.2# docker exec ceph-mon-mon0 ceph --cluster test -s cluster: id: 141ca8ea-ed9c-4ad7-99f3-8ad26d069508 health: HEALTH_WARN too few PGs per OSD (4 < min 30) services: mon: 3 daemons, quorum mon0,mon1,mon2 mgr: mon0(active) osd: 4 osds: 4 up, 4 in data: pools: 2 pools, 16 pgs objects: 0 objects, 0B usage: 4.01GiB used, 189GiB / 193GiB avail pgs: 16 active+clean -bash-4.2# No issue, all the machines went down and started at the same time so if this simulated a complete outage. I'm closing this as CURRENTRELEASE, feel free to re-open if you have any concerns. Thanks. |
Created attachment 1395412 [details] Docker Monitor log Description of problem: After initializing containerized cluster and rebooting all nodes, sevice are all up, but ceph cli operations failing saying client.admin authentication error after connection timeout Version-Release number of selected component (if applicable): ceph-3.0-rhel-7-docker-candidate-43629-20180208163916 ceph-ansible-3.0.24-1.el7cp.noarch How reproducible: Always Steps to Reproduce: 1. Initialize a containerized cluster 2. Reboot nodes Actual results: $ sudo docker exec ceph-mon-magna103 ceph -s --cluster 3OZ1 2018-02-13 13:58:58.315008 7f1d63bd3700 0 monclient(hunting): authenticate timed out after 300 2018-02-13 13:58:58.315038 7f1d63bd3700 0 librados: client.admin authentication error (110) Connection timed out [errno 110] error connecting to the cluster Expected results: No authentication error Additional info: Basic network configuration troubleshooting has been tried, and looked good.