Bug 626417
| Summary: | [Intel 5.7 Bug] Kernel panic when rebuilding and reboot occurs on IMSM RAID5 volume. | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Ignacy Kasperowicz <ignacy.kasperowicz> | |
| Component: | dmraid | Assignee: | Heinz Mauelshagen <heinzm> | |
| Status: | CLOSED ERRATA | QA Contact: | Gris Ge <fge> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 5.4 | CC: | agk, artur.wojcik, bdonahue, chyang, dave.jiang, dwysocha, ed.ciechanowski, fge, fnadge, haipao.fan, heinzm, ignacy.kasperowicz, jane.lv, jbrassow, jvillalo, jwilleford, keve.a.gabbert, krzysztof.wojcik, luyu, marcin.labun, mbroz, prockai, rdoty, wojciech.neubauer, zkabelac | |
| Target Milestone: | rc | |||
| Target Release: | 5.7 | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | dmraid-1.0.0.rc13-64.el5 | Doc Type: | Bug Fix | |
| Doc Text: |
cause
Consequence
Fix
Result
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 649788 (view as bug list) | Environment: | ||
| Last Closed: | 2011-07-21 07:49:28 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 629795, 649788 | |||
| Attachments: | ||||
Created attachment 440392 [details]
dmraid -rD
Created attachment 440393 [details]
console log
Created attachment 440394 [details]
dm -rD output
Created attachment 440395 [details]
dmestup table output
Created attachment 440396 [details]
dmsetup -n output
The problem occurs because event monitoring for isw volume is disabled when starting reboot. Usually, event monitoring is enabled together with volume activation. However, when we deal with boot volume the event monitoring cannot be enabled during activation because it is too early during the boot phase. Consequently, when event monitoring is disabled when rebuild starts, dmraid updates the metadata on the disks at the begining of the rebuild process which in case of the reboot during rebuild causes the volume corruption (normal volume state without rebuild completion). This is the feature as designed with having in mind that event monitoring should always be anabled. If the event monitoring is enabled the metadata is finally updated after the rebuild is complete. Right now dmraid tries always enable event monitoring before rebuild. This action, however, fails always when the volume is degraded. Fix will be to allow enabling event monitoring for degraded volumes. However, I see that there is a problem in the event deamon which does not enable volume monitoring even though it returns success for the register function call. Created attachment 440974 [details]
Allow to register to the events monitoring for volumes in degraded state
input to the solution discussion.
Created attachment 440975 [details]
faulty message after unsuccessful vol registration
If registration command does not success we get faulty message:
"device "xyz" is now registered with dmeventd for monitoring"
In fact volume is not registered to events monitoring.
Reason: bad return value in function _register_for_event
input for the discussion on the solution
Created attachment 441208 [details] Allow registration to the event monitor for volume in inconsistent state Originally registration to the event monitoring for volume in inconsistent state was not allowed. It causes an issue in case of user has inconsistent, unregistered volume and trigger rebuild on it and then rebuild is interrupted. After this sequence user get information that volume is in OK state. It is fault information because volume rebuild has not been finished and data may be corrupted. This patch enables this feature in libdmraid-events-isw.so library. Heinz, Could you look at the patches in comment 8 and 9. They should resolve issue described in this Bugzilla. Please review them. More complex tests show that issue has not been fully fixed. In different timings between reproduction steps issue still is reproducible. Investigation is in progress... Created attachment 442669 [details]
Allow registration to the event monitor v2
Another issue has been found...
When user gives full path to the DSO library during volume registration dmraid is updating metadata directly after rebuild starts to OK state. It is not good...
This patch fix this issue.
Last patch fix issue described in this BZ but only if user manually register the RAID volume to the events monitoring before rebuild command so this is not complete solution. It happens because there is another issue in the event service. After volume registration to the event monitoring a faulty event "end_of_rebuild" is triggered. It happens always when user register volume- after manual registration, auto-registration during vol activation, after --rebuild command etc. This faulty event causes metadata update to OK state despite of the actual state of volume. Created attachment 442893 [details]
Correction of Allow to registr. degr. vol patch
Small correction of the last patch- replace dso_name to dso_lib_name.
BTW:
Heinz,
Maybe do you know why the "in-sync" event is triggered after volume registration deride of actual volume status?
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. New bug connected with this issue has been discovered: Bug 635995 - Data corruption during activation volume marked for rebuild Patch reviewed: looks ok. Intel: please confirm in conjunction with the bz#635995 fix, that there's no regressions, thanks. 5.7 requested. 5.6.z optional. Created attachment 449160 [details]
Complette patch enabling registration degraded volume
Created attachment 449161 [details]
Cleanup some compilation warnings
Created attachment 449162 [details]
Patch introduces option that postpone any metadata updates (-u).
At early stage of OS booting there is no possibility to register
volume to the events monitoring. It causes that metadata on disks
belong to the volume is updated to OK state just after rebuild is
triggered.
It may be confusing- if rebuild will be interupted, user get "OK"
state (should be "nosync" state)
This patch introduces option that postpone any metadata updates.
To finally resolve this issue following patches have to be applied: - comment 8 - comment 19 - comment 20 - comment 21 - patch resolving Bug 635995 Moreover two changes are needed: 1. In startup scripts have to be added volumes registration to the event monitoring: dmevent_tool -r <vol-name> 2. In initrd image should be used option '-u' during volume activation I've verified that on DM with modifications from #c22 issue is nor reproducible. Fix works, next step after fix verification is to perform full regression tests on DM after these changes. Regression teste on DM with these patches has passed. Thanks, Ignacy Zhanks Ignacy, let us double-check the list of patches which where applied in the regrssion tests: the ones attached to comments 8, 19, 20, 21 in this bz *and* the one fixing bz#635995? I confirm. All patches were applied in tests. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. *** Bug 649788 has been marked as a duplicate of this bug. *** Fixed in dmraid-1.0.0.rc13-64.el5
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
cause
Consequence
Fix
Result
Reproduced the panic on RHEL 5.5 with ism dmraid. Problem gone on RHEL 5.7 Beta ( kernel -264, dmraid-1.0.0.rc13-65.el5). Zdenek, I found a problem (might be), if I replace a disk with new one, the volume will always in no-sync mode and system keep rebuild that disk. Check this: ============== [root@wlan-5-122 ~]# dmraid -r /dev/sdb: isw, "isw_gffbgcdad", GROUP, ok, 976773165 sectors, data@ 0 /dev/sdc: isw, "isw_gffbgcdad", GROUP, ok, 976773165 sectors, data@ 0 /dev/sdd: isw, "isw_gffbgcdad", GROUP, ok, 976773165 sectors, data@ 0 [root@wlan-5-122 ~]# dmraid -s *** Group superset isw_gffbgcdad --> Active Subset name : isw_gffbgcdad_Volume0 size : 41943040 stride : 128 type : raid5_la status : nosync subsets: 0 devs : 3 spares : 0 [root@wlan-5-122 ~]# dmsetup status isw_gffbgcdad_Volume0: 0 41943040 raid45 3 8:16 8:32 8:48 2 AAA 2560/2560 1 core VolGroup00-LogVol01: 0 20316160 linear VolGroup00-LogVol00: 0 956235776 linear ============== Base on 'dmsetup status' output, the new disk is fully synced. But the status is still nosync and reboot will cause fully re-sync again. Waiting for your reply. Heiz, Can you investigate on comment #40? Gris, do you have the dmraid-events package installed? That is in charge of updating the metadata to reflect the sync state via its contained DSO. Heinz dmraid-events-1.0.0.rc13-65.el5 with kernel-2.6.18-264.el5 Here is the reproduce steps: 1. dmraid -f isw -C Volum5 --type 5 --disk "/dev/sdb /dev/sdc /dev/sdd" 2. dmraid -ay 3. mfs.ext3 /dev/mapper/isw_bajjbbhafd_Volum5 4. unplug the cable of 1 disk. 5. dmraid -R isw_bajjbbhafd_Volum5 /dev/sde 6. reboot. 7. wait for dmsetup status show isw_bajjbbhafd_Volum5: 0 1953527296 raid45 3 8:32 8:48 8:16 2 AAA 7453/7453 1 core 8. wait for more time for dmraid event. I wait it for 1 hour after dmsetup statue shoud sync done. but dmraid -s sitll in nosync mode. It's not 100% reproduceable. Any way information you need when I hit the problem? dmraid-events-1.0.0.rc13-65.el5 with kernel-2.6.18-264.el5 Here is the reproduce steps: 1. dmraid -f isw -C Volum5 --type 5 --disk "/dev/sdb /dev/sdc /dev/sdd" 2. dmraid -ay 3. mfs.ext3 /dev/mapper/isw_bajjbbhafd_Volum5 4. unplug the cable of 1 disk. 5. dmraid -R isw_bajjbbhafd_Volum5 /dev/sde 6. reboot. 7. wait for dmsetup status show isw_bajjbbhafd_Volum5: 0 1953527296 raid45 3 8:32 8:48 8:16 2 AAA 7453/7453 1 core 8. wait for more time for dmraid event. I wait it for 1 hour after dmsetup statue shoud sync done. but dmraid -s sitll in nosync mode. It's not 100% reproduceable. Any information you need when I hit the problem? No reply. Sanity Only. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1020.html |
Description of problem: OS does not boot from RAID volume after interrupted rebuild operation. How reproducible: Always HW: ICH9 - DQ35J0E One CPU 4 HDDs Steps to Reproduce: 1.Boot into RAID ROM(9.5.0.1037) 2.Create a volume with Raid5 level with 4HDDs 3.Boot from RHEL5.4 DVD 4.Full installation with all package 5.Boot into OS after installation finish 6.Remove one of the raid volume HDDs 7.Insert new empty HDD 8.Run command “dmraid --rebuild {raid volume name} {new hdd location}” 9.Run command “dmraid -n >> log.txt” for dmraid log 10.Reboot system and cannot boot into OS (see console.log) 11.Reboot(reset) system and into RAID ROM(Raid volume status is Normal) Actual results: System is not bootable "Kernel panic" occurs: "Kernel panic - not syncing: Attempted to kill init!" Expected results: OS should be fully operational after rebuild. Additional info attached: - log.txt (dmraid -n) from step 10). - metadata dump (dmraid -rD; tar+bzip2 the dmraid.osw directory created) saved during step 10 - rebuild process - the mapping table generated on the test system (dmsetup table --target=raid45) saved during step 10 - rebuild process - console log when kernel panic occurs.