Description of problem: After bringing up a ceph cluster, ceph health shows: HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds On the OSD, ps shows no osd daemons are running. In order to get out of this state, I needed to: /etc/init.d/ceph stop /etc/init.d/ceph start Note: Just doing the start appeared to work but did not get me out of this state. Version-Release number of selected component (if applicable): 1.2.3 -- for sure. I believe that I have also seen this in 1.3 How reproducible: 100% of the last few times that I tried this. Steps to Reproduce: 1. Bring up a cluster using ceph-deploy: 2. on the monitor node, run sudo ceph health 3. Note the HEALTH_ERR 3. on the osds, run /etc/init.d/ceph stop; /etc/init.d/ceph start 4. Watch the health of the ceph cluster improve. Actual results: HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds Expected results: HEALTH_OK Additional info: Executing the ceph stop and start on the OSDs is a workaround to this problem.
am elevating this to high priority. It's still happening on RHCS 2.3 on John Harrigan's BAGL cluster, and I've seen it in the scale lab at times too. I should have bz'ed it sooner. Is this part of QE regression testing? This problem causes massive data movement at best and data unavailability at worst, and would disrupt any application running on the cluster. My workaround is: for d in /dev/sd[b-z] ; do ceph-disk activate ${d}1 ; done What's weird is that some of the drives come up and some do not, almost like a race condition. We'll try to get more data for you, including a sosreport and some system status. John, is this reproducible? Cluster was installed with ceph-ansible.
this is the wrong bz to put this in, though it is remotely related (OSDs failing to start). John Harrigan is going to create a new bz for this problem. Lowering priority of this bz. It wasn't clear to me whether there was a missing step in the installation procedure above that would have activated the OSDs, sort of like ceph-disk activate only with ceph deploy.