| Summary: | Hard shutdowning of a machine with fs operations on lvm on top of hardware raid caused fs corruption | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Igor Zhang <yugzhang> |
| Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Filesystem QE <fs-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.1 | CC: | bstevens, coughlan, eguan, esandeen, rwheeler, tlavigne |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-05-09 03:58:53 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 846704, 961026 | ||
| Attachments: | |||
|
Description
Igor Zhang
2011-04-01 02:39:57 UTC
Created attachment 489285 [details]
When fs corruptions were found for the first time, its dmesg.
Created attachment 489286 [details]
When fs corruptions were found for the first time, its test log.
Created attachment 489287 [details]
When fs corruptions were found for the second time, its test log.
For the second failure of this case, I forgot to collect its dmesg. Sorry. Hi Igor, What hardware raid card do we have in this test? Thanks! When you say local write caches are off - are they off on both the raid card and the drive behind it? The boot log in comment 1 shows: megaraid_sas 0000:07:00.0: irq 88 for MSI/MSI-X megaraid_sas: fw state:c0000000 megasas: fwstate:c0000000, dis_OCR=0 scsi0 : LSI SAS based MegaRAID driver scsi 0:0:10:0: Direct-Access SEAGATE ST9146802SS 0003 PQ: 0 ANSI: 5 scsi 0:0:11:0: Direct-Access SEAGATE ST9146802SS 0003 PQ: 0 ANSI: 5 scsi 0:0:12:0: Direct-Access SEAGATE ST9146802SS 0003 PQ: 0 ANSI: 5 scsi 0:0:13:0: Direct-Access SEAGATE ST9146802SS 0003 PQ: 0 ANSI: 5 scsi 0:2:0:0: Direct-Access INTEL RS2BL080 2.90 PQ: 0 ANSI: 5 ... sd 0:2:0:0: [sda] 1140621312 512-byte logical blocks: (583 GB/543 GiB) sd 0:2:0:0: [sda] Write Protect is off sd 0:2:0:0: [sda] Mode Sense: 1f 00 10 08 sd 0:2:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA sda: sda1 sda2 sda3 sda4 < sda5 > sd 0:2:0:0: [sda] Attached SCSI disk The tcms test description referenced shows commands like: /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp -EnDskCache -L0 -a0 mkfs.ext4 /dev/mapper/vg1-vol1 and /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp -DisDskCache -L0 -a0 mkfs.ext4 /dev/mapper/vg1-vol1 Igor: as Eric said, these commands control the state of the cache on the RAID card. It is also necessary to determine the state of the volatile cache on the back-end disk drives. Please take a look at MegaCli64 and see if there is a way to determine this. Eric/Ric: Should we be doing some sort of a re-scan of the sd device, to update the state in the o.s. block layer, after the MegaCli64 utility is used to change to the state of the device's cache? Or is it adequate to just use mount -o barrier etc. ? Tom, well, extN and XFS will both give up on sending barriers once they fail; but the filesystems themselves don't care directly about write cache state, I >think<. So from the fs perspective, I don't think we need a rescan, but maybe lower levels care? I think that we toggle barrier behavior correctly when the WCE bit changes for whatever reason. Christoph, is that correct? (In reply to comment #10) > I think that we toggle barrier behavior correctly when the WCE bit changes for > whatever reason. Christoph, is that correct? No. Christoph and Mike S. discussed this on IRC a bit. If the device's cache state changes (like if the battery dies, or someone uses an out-of-band utility to change it), the device should return a Unit Attention to the o.s. (which UA is an interesting question, since there is not one I know of for this specific event. Probably just Parameters Changed...). Linux currently ignores these UAs. (I thought that UA handling was proposed for the LSF agenda, but I don't see it there now...) I do no know to what extent it matters, if the FC is explicit about barrier on/off when it is mounted. I'm sure Christoph can help with that. I mean "...if the FS is explicit..." That would be nice to have us do something with those. Absent that support, I suppose that user space needs to monitor and remount (which is certainly not the best way to handle this). Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. (In reply to comment #6) > Hi Igor, > > What hardware raid card do we have in this test? > > Thanks! One raid 1 disk with two SEAGATE ST9146802SS 145 GB And it's LSI SAS based MegaRAID. (In reply to comment #7) > When you say local write caches are off - are they off on both the raid card > and the drive behind it? I just set up the raid card cache and didn't touch the drive. Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. |