Description of problem: with the latest ansible, the osd's are not coming up, 2016-11-09T16:17:16.086 INFO:teuthology.orchestra.run.pluto007:Running: 'sudo ceph -s' 2016-11-09T16:17:16.339 INFO:teuthology.orchestra.run.pluto007.stdout: cluster f74830b0-c6f2-4a0f-8246-4a6e0d4362e2 2016-11-09T16:17:16.373 INFO:teuthology.orchestra.run.pluto007.stdout: health HEALTH_ERR 2016-11-09T16:17:16.375 INFO:teuthology.orchestra.run.pluto007.stdout: 320 pgs are stuck inactive for more than 300 seconds 2016-11-09T16:17:16.415 INFO:teuthology.orchestra.run.pluto007.stdout: 320 pgs stuck inactive 2016-11-09T16:17:16.417 INFO:teuthology.orchestra.run.pluto007.stdout: no osds 2016-11-09T16:17:16.457 INFO:teuthology.orchestra.run.pluto007.stdout: monmap e1: 2 mons at {clara005=10.8.129.5:6789/0,pluto007=10.8.129.107:6789/0} 2016-11-09T16:17:16.459 INFO:teuthology.orchestra.run.pluto007.stdout: election epoch 4, quorum 0,1 clara005,pluto007 2016-11-09T16:17:16.499 INFO:teuthology.orchestra.run.pluto007.stdout: fsmap e3: 0/0/1 up 2016-11-09T16:17:16.501 INFO:teuthology.orchestra.run.pluto007.stdout: osdmap e4: 0 osds: 0 up, 0 in 2016-11-09T16:17:16.541 INFO:teuthology.orchestra.run.pluto007.stdout: flags sortbitwise 2016-11-09T16:17:16.543 INFO:teuthology.orchestra.run.pluto007.stdout: pgmap v5: 320 pgs, 3 pools, 0 bytes data, 0 objects 2016-11-09T16:17:16.583 INFO:teuthology.orchestra.run.pluto007.stdout: 0 kB used, 0 kB / 0 kB avail 2016-11-09T16:17:16.584 INFO:teuthology.orchestra.run.pluto007.stdout: 320 creating 2016-11-09T16:17:16.625 INFO:teuthology.task.ceph_ansible:Waiting for Ceph health to reach HEALTH_OK Full logs: http://magna002.ceph.redhat.com/vasu-2016-11-09_16:04:17-smoke-jewel---basic-multi/259550/teuthology.log
Hi, my observation was this issue was not consistent in the rolling_update tests that I ran yesterday. Build used: 1.0.5-42 Thanks, Tejas
Workaround that worked: The cause for this is in some cases one of the OSD daemons on a node is failing to come back up after the OSD restart. So I manually rebooted the OSD node, and the OSD daemon came back up, and only then the cluster will come to a HEALTH_OK state. But its a very destructive workaround. Thanks, Tejas
With ceph-1.0.5-42.el7scon in the raw_multi_journal scenario (using ceph-installer), my OSDs do not come up because they cannot find a keyring.
(In reply to Ken Dreyer (Red Hat) from comment #5) > With ceph-1.0.5-42.el7scon in the raw_multi_journal scenario (using > ceph-installer), my OSDs do not come up because they cannot find a keyring. This was a red-herring problem with my own test environment (I didn't realize ceph-installer does not actually put the client.admin keyring on the OSDs). My bad. The real error I'm seeing with ceph-ansible-1.0.5-42 is the same one that is reported here in Vasu's bug decription above, and in bug 1393684 as well: "HEALTH_ERR" and "no osds". ceph-ansible-1.0.5-43 reverts the latest dmcrypt changes, so this regression bug should now be fixed in that build. This is in today's compose, RHSCON-2.0-RHEL-7-20161110.t.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2817