Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 692746

Summary:

Hard shutdowning of a machine with fs operations on lvm on top of hardware raid caused fs corruption

Product:

Red Hat Enterprise Linux 6

Reporter:

Igor Zhang <yugzhang>

Component:

kernel

Assignee:

Red Hat Kernel Manager <kernel-mgr>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Filesystem QE <fs-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.1

CC:

bstevens, coughlan, eguan, esandeen, rwheeler, tlavigne

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-05-09 03:58:53 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

846704, 961026

Attachments:

Description	Flags
When fs corruptions were found for the first time, its dmesg.	none
When fs corruptions were found for the first time, its test log.	none
When fs corruptions were found for the second time, its test log.	none

Description Igor Zhang 2011-04-01 02:39:57 UTC

Description of problem:
Hard shutdowning of a machine with fs operations on lvm on top of hardware raid caused fs corruption.
when I did power failure testing for rhel6.1(https://tcms.engineering.redhat.com/run/18635/?from_plan=1232), I found:
For the test scenario "nobarriers and local write cache off" with workload "fs_mark -d /media/vol1/dir -d /media/vol2/dir -s 51200 -n 4096 -L 10 -r 8 -D 128", I got twice failed. Logs are attached.

Version-Release number of selected component (if applicable):
RHEL6.1-20110311.3
Kernel 2.6.32-122.el6.x86_64

How reproducible:
At times

Steps to Reproduce:
1.Please view https://tcms.engineering.redhat.com/run/18635/?from_plan=1232 and the case "power testing: fs build on LVM on hard RAID".

  
Actual results:
Filesystems corruptions found.

Expected results:
Filesystems are sane.

Additional info:

Comment 1 Igor Zhang 2011-04-01 02:41:44 UTC

Created attachment 489285 [details]
When fs corruptions were found for the first time, its dmesg.

Comment 3 Igor Zhang 2011-04-01 02:42:45 UTC

Created attachment 489286 [details]
When fs corruptions were found for the first time, its test log.

Comment 4 Igor Zhang 2011-04-01 02:43:59 UTC

Created attachment 489287 [details]
When fs corruptions were found for the second time, its test log.

Comment 5 Igor Zhang 2011-04-01 02:45:18 UTC

For the second failure of this case, I forgot to collect its dmesg. Sorry.

Comment 6 Ric Wheeler 2011-04-01 12:38:01 UTC

Hi Igor,

What hardware raid card do we have in this test?

Thanks!

Comment 7 Eric Sandeen 2011-04-01 14:30:08 UTC

When you say local write caches are off - are they off on both the raid card and the drive behind it?

Comment 8 Tom Coughlan 2011-04-01 15:14:52 UTC

The boot log in comment 1 shows: 

megaraid_sas 0000:07:00.0: irq 88 for MSI/MSI-X
megaraid_sas: fw state:c0000000
megasas: fwstate:c0000000, dis_OCR=0
scsi0 : LSI SAS based MegaRAID driver
scsi 0:0:10:0: Direct-Access     SEAGATE  ST9146802SS      0003 PQ: 0 ANSI: 5
scsi 0:0:11:0: Direct-Access     SEAGATE  ST9146802SS      0003 PQ: 0 ANSI: 5
scsi 0:0:12:0: Direct-Access     SEAGATE  ST9146802SS      0003 PQ: 0 ANSI: 5
scsi 0:0:13:0: Direct-Access     SEAGATE  ST9146802SS      0003 PQ: 0 ANSI: 5
scsi 0:2:0:0: Direct-Access     INTEL    RS2BL080         2.90 PQ: 0 ANSI: 5
...
sd 0:2:0:0: [sda] 1140621312 512-byte logical blocks: (583 GB/543 GiB)
sd 0:2:0:0: [sda] Write Protect is off
sd 0:2:0:0: [sda] Mode Sense: 1f 00 10 08
sd 0:2:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
 sda: sda1 sda2 sda3 sda4 < sda5 >
sd 0:2:0:0: [sda] Attached SCSI disk

The tcms test description referenced shows commands like:

/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp  -EnDskCache -L0 -a0
mkfs.ext4 /dev/mapper/vg1-vol1

and

/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp  -DisDskCache -L0 -a0
mkfs.ext4 /dev/mapper/vg1-vol1

Igor: as Eric said, these commands control the state of the cache on the RAID card. It is also necessary to determine the state of the volatile cache on the back-end disk drives. Please take a look at MegaCli64 and see if there is a way to determine this. 

Eric/Ric: Should we be doing some sort of a re-scan of the sd device, to update the  state in the o.s. block layer, after the MegaCli64 utility is used to change to the state of the device's cache?  Or is it adequate to just use mount -o barrier etc. ?

Comment 9 Eric Sandeen 2011-04-01 15:33:01 UTC

Tom, well, extN and XFS will both give up on sending barriers once they fail; but the filesystems themselves don't care directly about write cache state, I >think<.

So from the fs perspective, I don't think we need a rescan, but maybe lower levels care?

Comment 10 Ric Wheeler 2011-04-01 15:58:18 UTC

I think that we toggle barrier behavior correctly when the WCE bit changes for whatever reason. Christoph, is that correct?

Comment 11 Tom Coughlan 2011-04-01 16:25:21 UTC

(In reply to comment #10)
> I think that we toggle barrier behavior correctly when the WCE bit changes for
> whatever reason. Christoph, is that correct?

No. Christoph and Mike S. discussed this on IRC a bit. If the device's cache state changes (like if the battery dies, or someone uses an out-of-band utility to change it), the device should return a Unit Attention to the o.s. (which UA is an interesting question, since there is not one I know of for this specific event. Probably just Parameters Changed...). Linux currently ignores these UAs. (I thought that UA handling was proposed for the LSF agenda, but I don't see it there now...)

I do no know to what extent it matters, if the FC is explicit about barrier on/off when it is mounted. I'm sure Christoph can help with that.

Comment 12 Tom Coughlan 2011-04-01 16:27:18 UTC

I mean "...if the FS is explicit..."

Comment 13 Ric Wheeler 2011-04-01 16:36:43 UTC

That would be nice to have us do something with those.

Absent that support, I suppose that user space needs to monitor and remount (which is certainly not the best way to handle this).

Comment 14 RHEL Program Management 2011-04-04 02:42:30 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 15 Igor Zhang 2011-04-07 01:48:39 UTC

(In reply to comment #6)
> Hi Igor,
> 
> What hardware raid card do we have in this test?
> 
> Thanks!

One raid 1 disk with two SEAGATE ST9146802SS 145 GB
And it's LSI SAS based MegaRAID.

Comment 16 Igor Zhang 2011-04-07 01:52:18 UTC

(In reply to comment #7)
> When you say local write caches are off - are they off on both the raid card
> and the drive behind it?

I just set up the raid card cache and didn't touch the drive.

Comment 17 RHEL Program Management 2011-10-07 15:28:31 UTC

Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 18 RHEL Program Management 2012-12-14 07:41:12 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.