Bug 1544808 - [ceph-container] - client.admin authentication error after rebooting all nodes together
Summary: [ceph-container] - client.admin authentication error after rebooting all node...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Container
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z1
: 3.1
Assignee: Sébastien Han
QA Contact: Vasishta
Aron Gunn
URL:
Whiteboard:
Depends On:
Blocks: 1494421 1544643
TreeView+ depends on / blocked
 
Reported: 2018-02-13 14:45 UTC by Vasishta
Modified: 2018-10-08 12:25 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Rebooting all Ceph nodes simultaneously, will cause an authentication error When performing a simultaneous reboot of all the Ceph nodes in the storage cluster, a resulting `client.admin` authentication error will occur when issuing any Ceph-related commands from the command-line interface. To work around this issue, avoid rebooting all Ceph nodes simultaneously.
Clone Of:
Environment:
Last Closed: 2018-10-04 15:35:42 UTC
Embargoed:


Attachments (Terms of Use)
Docker Monitor log (31.26 KB, text/plain)
2018-02-13 14:45 UTC, Vasishta
no flags Details

Description Vasishta 2018-02-13 14:45:48 UTC
Created attachment 1395412 [details]
Docker Monitor log

Description of problem:
After initializing containerized cluster and rebooting all nodes, sevice are all up, but ceph cli operations failing saying client.admin authentication error after connection timeout 


Version-Release number of selected component (if applicable):
ceph-3.0-rhel-7-docker-candidate-43629-20180208163916
ceph-ansible-3.0.24-1.el7cp.noarch

How reproducible:
Always

Steps to Reproduce:
1. Initialize a containerized cluster 
2. Reboot nodes

Actual results:
$ sudo docker exec ceph-mon-magna103 ceph -s --cluster 3OZ1
2018-02-13 13:58:58.315008 7f1d63bd3700  0 monclient(hunting): authenticate timed out after 300
2018-02-13 13:58:58.315038 7f1d63bd3700  0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster


Expected results:
No authentication error

Additional info:
Basic network configuration troubleshooting has been tried, and looked good.

Comment 5 Christina Meno 2018-03-13 19:54:28 UTC
Already release noted, this needs to move to z3

Comment 6 Sébastien Han 2018-04-04 12:38:53 UTC
Looking at the code, this seems to be expected, however, if I can get an env with the error then I'll be able to validate my thoughts and potentially come up with a fix.

Thanks in advance.

Comment 8 seb 2018-07-25 13:18:31 UTC
I think the real concern here is not someone rebooting the whole platform but more a platform suffering a complete outage.

Comment 9 Sébastien Han 2018-10-04 15:35:42 UTC
I can't reproduce it, see:


[leseb@tarox~/ceph-ansible/tests/functional/centos/7/docker][ceph-volume-container !] vagrant status
Current machine states:

mon0                      running (libvirt)
mon1                      running (libvirt)
mon2                      running (libvirt)
osd0                      running (libvirt)
osd1                      running (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
[leseb@tarox~/ceph-ansible/tests/functional/centos/7/docker][ceph-volume-container !] vagrant halt mon0 mon1 mon2
==> mon2: Halting domain...
==> mon1: Halting domain...
==> mon0: Halting domain...
[leseb@tarox~/ceph-ansible/tests/functional/centos/7/docker][ceph-volume-container !] vagrant up mon0 mon1 mon2
Bringing machine 'mon0' up with 'libvirt' provider...
Bringing machine 'mon1' up with 'libvirt' provider...
Bringing machine 'mon2' up with 'libvirt' provider...
==> mon2: Starting domain.
==> mon2: Waiting for domain to get an IP address...
==> mon1: Starting domain.
==> mon2: Waiting for SSH to become available...
==> mon0: Starting domain.
==> mon1: Waiting for domain to get an IP address...
==> mon1: Waiting for SSH to become available...
==> mon0: Waiting for domain to get an IP address...
==> mon0: Waiting for SSH to become available...
==> mon1: Creating shared folders metadata...
==> mon2: Creating shared folders metadata...
==> mon0: Creating shared folders metadata...
==> mon0: Rsyncing folder: /home/leseb/ceph-ansible/tests/functional/centos/7/docker/ => /home/vagrant/sync
==> mon2: Rsyncing folder: /home/leseb/ceph-ansible/tests/functional/centos/7/docker/ => /home/vagrant/sync
==> mon1: Rsyncing folder: /home/leseb/ceph-ansible/tests/functional/centos/7/docker/ => /home/vagrant/sync

[leseb@tarox~/ceph-ansible/tests/functional/centos/7/docker][ceph-volume-container !] vagrant ssh mon0


[vagrant@mon0 ~]$
[vagrant@mon0 ~]$
[vagrant@mon0 ~]$ sudo -i
-bash-4.2# docker ps
CONTAINER ID        IMAGE                                   COMMAND             CREATED              STATUS              PORTS               NAMES
e9e2ccfcfcda        docker.io/ceph/daemon:latest-luminous   "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-mgr-mon0
96371e64ed3f        docker.io/ceph/daemon:latest-luminous   "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-mon-mon0
-bash-4.2# docker exec ceph-mon-mon0 ceph --cluster test -s
  cluster:
    id:     141ca8ea-ed9c-4ad7-99f3-8ad26d069508
    health: HEALTH_WARN
            too few PGs per OSD (4 < min 30)

  services:
    mon: 3 daemons, quorum mon0,mon1,mon2
    mgr: mon0(active)
    osd: 4 osds: 4 up, 4 in

  data:
    pools:   2 pools, 16 pgs
    objects: 0 objects, 0B
    usage:   4.01GiB used, 189GiB / 193GiB avail
    pgs:     16 active+clean

-bash-4.2#


No issue, all the machines went down and started at the same time so if this simulated a complete outage.

I'm closing this as CURRENTRELEASE, feel free to re-open if you have any concerns.
Thanks.


Note You need to log in before you can comment on or make changes to this bug.