Description of problem: When a disk that is consumed by an osd is removed the state of the OSD is not changed. However "ceph-disk list" did not list the removed disk partition. we can see IO errors. But monitor is not getting this error. dmesg output: ----------- [96856.314171] ACPI: \_SB_.PCI0.S08_: ACPI_NOTIFY_EJECT_REQUEST event [96856.314227] ACPI: \_SB_.PCI0.S08_: Eject request in hotplug_event() [96857.245117] ACPI: \_SB_.PCI0.S09_: ACPI_NOTIFY_EJECT_REQUEST event [96857.245172] ACPI: \_SB_.PCI0.S09_: Eject request in hotplug_event() [96870.490975] XFS (vdc1): metadata I/O error: block 0x12c00364 ("xlog_iodone") error 5 numblks 64 [96870.491106] XFS (vdc1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa023db1e [96870.491119] XFS (vdc1): Log I/O Error Detected. Shutting down filesystem [96870.491177] XFS (vdc1): Please umount the filesystem and rectify the problem(s) [96900.569572] XFS (vdc1): xfs_log_force: error -5 returned. NOTE: This flow was tried in a virtual machine managed by RHEVM Version-Release number of selected component (if applicable): ceph-osd-10.2.2-5.el7cp.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a ceph cluster with OSDs 2. Remove a disk used by any OSD 3. Check OSD status using "ceph -s " command Actual results: The osd is still shown as up and in when "ceph -s" command is invoked Expected results: The osd state should be updated accordingly. Additional info: We are banking on calamari to report that osd state has changed if at all a disk is pulled out. Since ceph itself is not recognizing this, calamari is not reporting this.
Was there only one osd? The process is that when the disk is removed 1) the filestore threads will probably hang 2) after the suicide timeout, the osd process will die 3) the osd's peers will report it dead and the mons will mark it down Did the osd process die? Did you see reports in ceph.log from the osd's peers indicating that it's missing heartbeats? Please reproduce this and leave the environment up so that I can examine the logs.
I emailed Darshan requesting access to the environment.
(In reply to Samuel Just from comment #2) > Was there only one osd? The process is that when the disk is removed > 1) the filestore threads will probably hang > 2) after the suicide timeout, the osd process will die > 3) the osd's peers will report it dead and the mons will mark it down > > Did the osd process die? Did you see reports in ceph.log from the osd's > peers indicating that it's missing heartbeats? Please reproduce this and > leave the environment up so that I can examine the logs. There were two OSDs and the OSD process did not die even after an hour. I have mailed you the details of machine where this is reproducible.
There is a config (mon_osd_min_down_reporters) which controls how many osds are required to complain about a down osd before the mons will mark it down. It defaults to 2. You only have 1 osd left when you mark one of the two down. That leaves only one reporter. Either retest with 3 osds (leaving 2 reporters), or change the setting on the mon for mon_osd_min_down_reporters to 1.
Hmm, the osd process is still running. The mount is actually readable. I guess the process just hasn't tried to write anything.
Yeah, setting mon_osd_min_down_reporters to 1 and performing a single write causes the osd daemon to die and the system to detect the failure. Arguably, we should do periodic background write+fsyncs in the backing store to detect this kind of thing. I'd consider that an RFE, however.
(In reply to Samuel Just from comment #6) > There is a config (mon_osd_min_down_reporters) which controls how many osds > are required to complain about a down osd before the mons will mark it down. > It defaults to 2. You only have 1 osd left when you mark one of the two > down. That leaves only one reporter. Either retest with 3 osds (leaving 2 > reporters), or change the setting on the mon for mon_osd_min_down_reporters > to 1. As suggested, I tried this with 3 OSDs. Removed the disk underlying one of the OSDs and a write following that. Now was able to see the OSD process going down and even the osd state was marked as down.
It would be nice if ceph can automatically detect this as mentioned in comment 8, than detect it only after a write is invoked.
Ok, I'll change it to an RFE.
This can only happen if you are only reading data that is available in cache, and there are no cluster updates (or the OSD will be doing writes to handle new OSDMaps, etc). BlueStore probably also notices this more quickly since it's doing direct disk IO. I don't see any way this can cause issues except in failure testing of the very least-realistic sort, so closing.