Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 626417

Summary: [Intel 5.7 Bug] Kernel panic when rebuilding and reboot occurs on IMSM RAID5 volume.
Product: Red Hat Enterprise Linux 5 Reporter: Ignacy Kasperowicz <ignacy.kasperowicz>
Component: dmraidAssignee: Heinz Mauelshagen <heinzm>
Status: CLOSED ERRATA QA Contact: Gris Ge <fge>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: agk, artur.wojcik, bdonahue, chyang, dave.jiang, dwysocha, ed.ciechanowski, fge, fnadge, haipao.fan, heinzm, ignacy.kasperowicz, jane.lv, jbrassow, jvillalo, jwilleford, keve.a.gabbert, krzysztof.wojcik, luyu, marcin.labun, mbroz, prockai, rdoty, wojciech.neubauer, zkabelac
Target Milestone: rc   
Target Release: 5.7   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: dmraid-1.0.0.rc13-64.el5 Doc Type: Bug Fix
Doc Text:
cause Consequence Fix Result
Story Points: ---
Clone Of:
: 649788 (view as bug list) Environment:
Last Closed: 2011-07-21 07:49:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 629795, 649788    
Attachments:
Description Flags
dmraid -rD
none
console log
none
dm -rD output
none
dmestup table output
none
dmsetup -n output
none
Allow to register to the events monitoring for volumes in degraded state
none
faulty message after unsuccessful vol registration
none
Allow registration to the event monitor for volume in inconsistent state
none
Allow registration to the event monitor v2
none
Correction of Allow to registr. degr. vol patch
none
Complette patch enabling registration degraded volume
none
Cleanup some compilation warnings
none
Patch introduces option that postpone any metadata updates (-u). none

Description Ignacy Kasperowicz 2010-08-23 14:12:53 UTC
Description of problem:
OS does not boot from RAID volume after interrupted rebuild operation.

How reproducible:
Always

HW:
ICH9 - DQ35J0E
One CPU
4 HDDs

Steps to Reproduce:
1.Boot into RAID ROM(9.5.0.1037)
2.Create a volume with Raid5 level with 4HDDs
3.Boot from RHEL5.4 DVD
4.Full installation with all package
5.Boot into OS after installation finish
6.Remove one of the raid volume HDDs
7.Insert new empty HDD
8.Run command “dmraid --rebuild {raid volume name} {new hdd location}” 
9.Run command “dmraid -n >> log.txt” for dmraid log
10.Reboot system and cannot boot into OS (see console.log)
11.Reboot(reset) system and into RAID ROM(Raid volume status is Normal)
  
Actual results:
System is not bootable "Kernel panic" occurs:
"Kernel panic - not syncing: Attempted to kill init!"

Expected results:
OS should be fully operational after rebuild.

Additional info attached:
 - log.txt (dmraid -n) from step 10).
 - metadata dump (dmraid -rD; tar+bzip2 the dmraid.osw directory created) saved during step 10 - rebuild process
 - the mapping table generated on the test system (dmsetup table --target=raid45) saved during step 10 - rebuild process
 - console log when kernel panic occurs.

Comment 1 Ignacy Kasperowicz 2010-08-23 14:13:32 UTC
Created attachment 440392 [details]
dmraid -rD

Comment 2 Ignacy Kasperowicz 2010-08-23 14:13:51 UTC
Created attachment 440393 [details]
console log

Comment 3 Ignacy Kasperowicz 2010-08-23 14:14:14 UTC
Created attachment 440394 [details]
dm -rD output

Comment 4 Ignacy Kasperowicz 2010-08-23 14:14:41 UTC
Created attachment 440395 [details]
dmestup table output

Comment 5 Ignacy Kasperowicz 2010-08-23 14:15:00 UTC
Created attachment 440396 [details]
dmsetup -n output

Comment 6 Wojciech Neubauer 2010-08-24 16:19:47 UTC
The problem occurs because event monitoring for isw volume is disabled when starting reboot. Usually, event monitoring is enabled together with volume activation. However, when we deal with boot volume the event monitoring cannot be enabled during activation because it is too early during the boot phase.

Consequently, when event monitoring is disabled when rebuild starts, dmraid updates the metadata on the disks at the begining of the rebuild process which in case of the reboot during rebuild causes the volume corruption (normal volume state without rebuild completion). This is the feature as designed with having in mind that event monitoring should always be anabled. If the event monitoring is enabled the metadata is finally updated after the rebuild is complete.

Right now dmraid tries always enable event monitoring before rebuild. This action, however, fails always when the volume is degraded. Fix will be to allow enabling event monitoring for degraded volumes. 

However, I see that there is a problem in the event deamon which does not enable volume monitoring even though it returns success for the register function call.

Comment 7 Wojciech Neubauer 2010-08-25 15:38:28 UTC
Created attachment 440974 [details]
Allow to register to the events monitoring for volumes in degraded state

input to the solution discussion.

Comment 8 Wojciech Neubauer 2010-08-25 15:39:56 UTC
Created attachment 440975 [details]
faulty message after unsuccessful vol registration

If registration command does not success we get faulty message:
"device "xyz" is now registered with dmeventd for monitoring"
In fact volume is not registered to events monitoring.
Reason: bad return value in function _register_for_event

input for the discussion on the solution

Comment 9 Krzysztof Wojcik 2010-08-26 13:34:41 UTC
Created attachment 441208 [details]
Allow registration to the event monitor for volume in inconsistent state

Originally registration to the event monitoring for volume in
inconsistent state was not allowed.
It causes an issue in case of user has inconsistent, unregistered
volume and trigger rebuild on it and then rebuild is interrupted.
After this sequence user get information that volume is in OK state.
It is fault information because volume rebuild has not been finished
and data may be corrupted.
This patch enables this feature in libdmraid-events-isw.so library. 

Heinz,

Could you look at the patches in comment 8 and 9.
They should resolve issue described in this Bugzilla.
Please review them.

Comment 11 Krzysztof Wojcik 2010-09-02 07:02:53 UTC
More complex tests show that issue has not been fully fixed.
In different timings between reproduction steps issue still is reproducible.
Investigation is in progress...

Comment 12 Krzysztof Wojcik 2010-09-02 16:29:24 UTC
Created attachment 442669 [details]
Allow registration to the event monitor v2

Another issue has been found...
When user gives full path to the DSO library during volume registration dmraid is updating metadata directly after rebuild starts to OK state. It is not good...
This patch fix this issue.

Comment 13 Krzysztof Wojcik 2010-09-02 19:32:36 UTC
Last patch fix issue described in this BZ but only if user manually register the RAID volume to the events monitoring before rebuild command so this is not complete solution.
It happens because there is another issue in the event service.
After volume registration to the event monitoring a faulty event "end_of_rebuild" is triggered. It happens always when user register volume- after manual registration, auto-registration during vol activation, after --rebuild command etc.
This faulty event causes metadata update to OK state despite of the actual state of volume.

Comment 14 Krzysztof Wojcik 2010-09-03 13:11:42 UTC
Created attachment 442893 [details]
Correction of Allow to registr. degr. vol patch

Small correction of the last patch- replace dso_name to dso_lib_name.

BTW:
Heinz,
Maybe do you know why the "in-sync" event is triggered after volume registration deride of actual volume status?

Comment 15 RHEL Program Management 2010-09-20 15:27:41 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 16 Krzysztof Wojcik 2010-09-21 10:33:50 UTC
New bug connected with this issue has been discovered:
Bug 635995 - Data corruption during activation volume marked for rebuild

Comment 18 Heinz Mauelshagen 2010-09-22 08:40:43 UTC
Patch reviewed: looks ok.

Intel:
please confirm in conjunction with the bz#635995 fix, that there's no regressions, thanks.

5.7 requested. 5.6.z optional.

Comment 19 Krzysztof Wojcik 2010-09-23 09:59:46 UTC
Created attachment 449160 [details]
Complette patch enabling registration degraded volume

Comment 20 Krzysztof Wojcik 2010-09-23 10:00:39 UTC
Created attachment 449161 [details]
Cleanup some compilation warnings

Comment 21 Krzysztof Wojcik 2010-09-23 10:03:22 UTC
Created attachment 449162 [details]
Patch introduces option that postpone any metadata updates (-u).

At early stage of OS booting there is no possibility to register
volume to the events monitoring. It causes that metadata on disks
belong to the volume is updated to OK state just after rebuild is
triggered.
It may be confusing- if rebuild will be interupted, user get "OK"
state (should be "nosync" state)
This patch introduces option that postpone any metadata updates.

Comment 22 Krzysztof Wojcik 2010-09-23 10:13:26 UTC
To finally resolve this issue following patches have to be applied:
- comment 8
- comment 19
- comment 20
- comment 21
- patch resolving Bug 635995

Moreover two changes are needed:
1. In startup scripts have to be added volumes registration to the event monitoring:
dmevent_tool -r <vol-name> 
2. In initrd image should be used option '-u' during volume activation

Comment 26 Ignacy Kasperowicz 2010-10-06 09:31:46 UTC
I've verified that on DM with modifications from #c22 issue is nor reproducible.
Fix works, next step after fix verification is to perform full regression tests on DM after these changes.

Comment 27 Ignacy Kasperowicz 2010-10-13 10:51:01 UTC
Regression teste on DM with these patches has passed.

Thanks,
Ignacy

Comment 28 Heinz Mauelshagen 2010-10-14 14:32:27 UTC
Zhanks Ignacy,

let us double-check the list of patches which where applied in the regrssion tests:

the ones attached to comments 8, 19, 20, 21 in this bz *and*
the one fixing bz#635995?

Comment 29 Krzysztof Wojcik 2010-10-27 10:53:52 UTC
I confirm. All patches were applied in tests.

Comment 34 RHEL Program Management 2011-01-11 20:44:10 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 35 RHEL Program Management 2011-01-11 22:30:02 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 36 Jane Lv 2011-01-12 03:13:21 UTC
*** Bug 649788 has been marked as a duplicate of this bug. ***

Comment 37 Zdenek Kabelac 2011-04-06 12:16:29 UTC
Fixed in dmraid-1.0.0.rc13-64.el5

Comment 39 Florian Nadge 2011-05-27 13:01:45 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
cause
Consequence
Fix
Result

Comment 40 Gris Ge 2011-06-03 06:52:21 UTC
Reproduced the panic on RHEL 5.5 with ism dmraid.

Problem gone on RHEL 5.7 Beta ( kernel -264, dmraid-1.0.0.rc13-65.el5).

Zdenek,

I found a problem (might be), if I replace a disk with new one, the volume will always in no-sync mode and system keep rebuild that disk.
Check this:
==============
[root@wlan-5-122 ~]# dmraid -r
/dev/sdb: isw, "isw_gffbgcdad", GROUP, ok, 976773165 sectors, data@ 0
/dev/sdc: isw, "isw_gffbgcdad", GROUP, ok, 976773165 sectors, data@ 0
/dev/sdd: isw, "isw_gffbgcdad", GROUP, ok, 976773165 sectors, data@ 0
[root@wlan-5-122 ~]# dmraid -s
*** Group superset isw_gffbgcdad
--> Active Subset
name   : isw_gffbgcdad_Volume0
size   : 41943040
stride : 128
type   : raid5_la
status : nosync
subsets: 0
devs   : 3
spares : 0
[root@wlan-5-122 ~]# dmsetup status
isw_gffbgcdad_Volume0: 0 41943040 raid45 3 8:16 8:32 8:48 2 AAA 2560/2560 1 core
VolGroup00-LogVol01: 0 20316160 linear
VolGroup00-LogVol00: 0 956235776 linear
==============

Base on 'dmsetup status' output, the new disk is fully synced.
But the status is still nosync and reboot will cause fully re-sync again.

Waiting for your reply.

Comment 41 Gris Ge 2011-06-15 06:18:49 UTC
Heiz,
Can you investigate on comment #40?

Comment 42 Heinz Mauelshagen 2011-06-15 13:14:08 UTC
Gris,

do you have the dmraid-events package installed?

That is in charge of updating the metadata to reflect the sync state via its contained DSO.

Heinz

Comment 43 Gris Ge 2011-06-17 09:40:17 UTC
dmraid-events-1.0.0.rc13-65.el5 with kernel-2.6.18-264.el5

Here is the reproduce steps:
1. dmraid  -f  isw  -C  Volum5 --type 5 --disk "/dev/sdb /dev/sdc /dev/sdd"
2. dmraid -ay
3. mfs.ext3 /dev/mapper/isw_bajjbbhafd_Volum5
4. unplug the cable of 1 disk.
5. dmraid -R isw_bajjbbhafd_Volum5 /dev/sde
6. reboot.
7. wait for dmsetup status show isw_bajjbbhafd_Volum5: 0 1953527296 raid45 3 8:32 8:48 8:16 2 AAA 7453/7453 1 core
8. wait for more time for dmraid event.

I wait it for 1 hour after dmsetup statue shoud  sync done. but dmraid -s sitll in nosync mode.

It's not 100% reproduceable. Any way information you need when I hit the problem?

Comment 44 Gris Ge 2011-06-17 09:40:34 UTC
dmraid-events-1.0.0.rc13-65.el5 with kernel-2.6.18-264.el5

Here is the reproduce steps:
1. dmraid  -f  isw  -C  Volum5 --type 5 --disk "/dev/sdb /dev/sdc /dev/sdd"
2. dmraid -ay
3. mfs.ext3 /dev/mapper/isw_bajjbbhafd_Volum5
4. unplug the cable of 1 disk.
5. dmraid -R isw_bajjbbhafd_Volum5 /dev/sde
6. reboot.
7. wait for dmsetup status show isw_bajjbbhafd_Volum5: 0 1953527296 raid45 3 8:32 8:48 8:16 2 AAA 7453/7453 1 core
8. wait for more time for dmraid event.

I wait it for 1 hour after dmsetup statue shoud  sync done. but dmraid -s sitll in nosync mode.

It's not 100% reproduceable. Any information you need when I hit the problem?

Comment 45 Gris Ge 2011-06-23 10:09:41 UTC
No reply.
Sanity Only.

Comment 46 errata-xmlrpc 2011-07-21 07:49:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1020.html