Bug 1349813 - RFE: osd should periodically perform canary writes to the backing disk in order to detect RO or broken backing storage without a client write
Summary: RFE: osd should periodically perform canary writes to the backing disk in ord...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 2.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: rc
: 3.*
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1341640
TreeView+ depends on / blocked
 
Reported: 2016-06-24 10:37 UTC by Darshan
Modified: 2019-03-08 22:51 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-08 22:51:30 UTC
Embargoed:


Attachments (Terms of Use)

Description Darshan 2016-06-24 10:37:54 UTC
Description of problem:
When a disk that is consumed by an osd is removed the state of the OSD is not changed. However "ceph-disk list" did not list the removed disk partition. we can see IO errors. But monitor is not getting this error.

dmesg output:
-----------
[96856.314171] ACPI: \_SB_.PCI0.S08_: ACPI_NOTIFY_EJECT_REQUEST event
[96856.314227] ACPI: \_SB_.PCI0.S08_: Eject request in hotplug_event()
[96857.245117] ACPI: \_SB_.PCI0.S09_: ACPI_NOTIFY_EJECT_REQUEST event
[96857.245172] ACPI: \_SB_.PCI0.S09_: Eject request in hotplug_event()
[96870.490975] XFS (vdc1): metadata I/O error: block 0x12c00364 ("xlog_iodone") error 5 numblks 64
[96870.491106] XFS (vdc1): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa023db1e
[96870.491119] XFS (vdc1): Log I/O Error Detected.  Shutting down filesystem
[96870.491177] XFS (vdc1): Please umount the filesystem and rectify the problem(s)
[96900.569572] XFS (vdc1): xfs_log_force: error -5 returned.


NOTE: This flow was tried in a virtual machine managed by RHEVM

Version-Release number of selected component (if applicable):
ceph-osd-10.2.2-5.el7cp.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Create a ceph cluster with OSDs
2. Remove a disk used by any OSD
3. Check OSD status using "ceph -s " command

Actual results:
The osd is still shown as up and in when "ceph -s" command is invoked


Expected results:
The osd state should be updated accordingly.


Additional info:
We are banking on calamari to report that osd state has changed if at all a disk is pulled out. Since ceph itself is not recognizing this, calamari is not reporting this.

Comment 2 Samuel Just 2016-06-24 14:51:58 UTC
Was there only one osd?  The process is that when the disk is removed
1) the filestore threads will probably hang
2) after the suicide timeout, the osd process will die
3) the osd's peers will report it dead and the mons will mark it down

Did the osd process die?  Did you see reports in ceph.log from the osd's peers indicating that it's missing heartbeats?  Please reproduce this and leave the environment up so that I can examine the logs.

Comment 3 Samuel Just 2016-06-24 14:53:20 UTC
I emailed Darshan requesting access to the environment.

Comment 5 Darshan 2016-06-27 13:35:49 UTC
(In reply to Samuel Just from comment #2)
> Was there only one osd?  The process is that when the disk is removed
> 1) the filestore threads will probably hang
> 2) after the suicide timeout, the osd process will die
> 3) the osd's peers will report it dead and the mons will mark it down
> 
> Did the osd process die?  Did you see reports in ceph.log from the osd's
> peers indicating that it's missing heartbeats?  Please reproduce this and
> leave the environment up so that I can examine the logs.

There were two OSDs and the OSD process did not die even after an hour. I have mailed you the details of machine where this is reproducible.

Comment 6 Samuel Just 2016-06-27 14:22:41 UTC
There is a config (mon_osd_min_down_reporters) which controls how many osds are required to complain about a down osd before the mons will mark it down.  It defaults to 2.  You only have 1 osd left when you mark one of the two down. That leaves only one reporter.  Either retest with 3 osds (leaving 2 reporters), or change the setting on the mon for mon_osd_min_down_reporters to 1.

Comment 7 Samuel Just 2016-06-27 14:41:48 UTC
Hmm, the osd process is still running.  The mount is actually readable.  I guess the process just hasn't tried to write anything.

Comment 8 Samuel Just 2016-06-27 14:45:39 UTC
Yeah, setting mon_osd_min_down_reporters to 1 and performing a single write causes the osd daemon to die and the system to detect the failure.  Arguably, we should do periodic background write+fsyncs in the backing store to detect this kind of thing.  I'd consider that an RFE, however.

Comment 9 Darshan 2016-06-28 07:08:53 UTC
(In reply to Samuel Just from comment #6)
> There is a config (mon_osd_min_down_reporters) which controls how many osds
> are required to complain about a down osd before the mons will mark it down.
> It defaults to 2.  You only have 1 osd left when you mark one of the two
> down. That leaves only one reporter.  Either retest with 3 osds (leaving 2
> reporters), or change the setting on the mon for mon_osd_min_down_reporters
> to 1.

As suggested, I tried this with 3 OSDs. Removed the disk underlying one of the OSDs and a write following that. Now was able to see the OSD process going down and even the osd state was marked as down.

Comment 10 Darshan 2016-06-28 08:59:11 UTC
It would be nice if ceph can automatically detect this as mentioned in comment 8, than detect it only after a write is invoked.

Comment 11 Samuel Just 2016-06-29 20:52:32 UTC
Ok, I'll change it to an RFE.

Comment 14 Greg Farnum 2019-03-08 22:51:30 UTC
This can only happen if you are only reading data that is available in cache, and there are no cluster updates (or the OSD will be doing writes to handle new OSDMaps, etc). BlueStore probably also notices this more quickly since it's doing direct disk IO. I don't see any way this can cause issues except in failure testing of the very least-realistic sort, so closing.


Note You need to log in before you can comment on or make changes to this bug.