Bug 450939
Summary: | panic in cluster_log_ser during resync of two legged core log mirrors | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||||||||
Component: | cmirror-kernel | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 4 | CC: | edamato | ||||||||||
Target Milestone: | --- | Keywords: | Regression | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | RHBA-2008-0803 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2008-07-25 19:28:02 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Corey Marthaler
2008-06-11 20:13:14 UTC
Created attachment 308986 [details]
log from taft-01
Created attachment 308987 [details]
log from taft-02
Created attachment 308989 [details]
log from taft-03
Created attachment 308990 [details]
log from taft-04
Reproduced this again last night while running the same test case: Kill the secendary legs of two synced core log mirrors. lvm2-2.02.37-3.el4 lvm2-cluster-2.02.37-3.el4 This now seems fairly easy to reproduce, and since this testcase worked before, it appears to be a regression. Hit this during machine recovery testing as well. Shot hayes-01, and hayes-03 paniced. GFS: fsid=HAYES:2.0: Done dm-cmirror: Error listening for server(2) response for KbfWJO2B: -110 dm-cmirror: Error listening for server(2) response for KbfWJO2B: -110 GFS: fsid=HAYES:2.0: jid=1: Trying to acquire journal lock... GFS: fsid=HAYES:1.2: jid=0: Trying to acquire journal lock... GFS: fsid=HAYES:2.0: jid=1: Looking at journal... GFS: fsid=HAYES:1.2: jid=0: Busy dm-cmirror: Recovery blocked by outstanding write on region 6899/KbfWJO2B ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at dm_cmirror_server:696 [...] FWIW, this can be reproduced by failing the primary leg as well, doesn't have to be the secondary one. Judging from comment #1, you should be able to repo this bug just by doing linear -> (corelog) mirror upconverts... This should be true because there is no mirror state carried along when a failed mirror goes to linear. I'm not sure I see a way for this panic to be triggered during normal operation... I think I see a way during machine/disk failure; so it will be important to know when exactly this panic is happening. (BTW, Thanks for including the build dates of the rpms - that's helpful.) I hit something like this while running a related cmirror failure case with the following bits: cmirror-1.0.1-1 cmirror-kernel-2.6.9-41.4 lvm2-2.02.37-3.el4 lvm2-cluster-2.02.37-3.el4 Senario: Kill secondary leg of non synced core log 2 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: nonsyncd_secondary_core_2legs * sync: 0 * mirrors: 1 * disklog: 0 * failpv: /dev/sdd1 * legs: 2 * pvs: /dev/sde1 /dev/sdd1 ************************************************ During a resync morph-02 panicked with the following messages: dm-cmirror: Mark attempted to recovering region by 3: 679/fA32IibI dm-cmirror: lc->recovering_region = 679 dm-cmirror: ru->ru_rw = 2 dm-cmirror: ru->ru_nodeid = 4 dm-cmirror: ru->ru_region = 679 ------------[ cut here ]------------ kernel BUG at /builddir/build/BUILD/cmirror-kernel-2.6.9-41/hugemem/src/dm-cmirror-server.c:574! invalid operand: 0000 [#1] SMP Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc cpufreq_powersave button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 ata_piix libata qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 1 EIP: 0060:[<82c15820>] Not tainted VLI EFLAGS: 00010246 (2.6.9-71.ELhugemem) EIP is at server_mark_region+0x198/0x1bb [dm_cmirror] eax: 0000002d ebx: 79ad8200 ecx: 75578f34 edx: 82c18b67 esi: 7961ff20 edi: 39c6c375 ebp: 00000000 esp: 75578f30 ds: 007b es: 007b ss: 0068 Process cluster_log_ser (pid: 24176, threadinfo=75578000 task=773a3330) Stack: 82c18b67 000002a7 00000000 82c18b3a 00000004 82c18b0d 00000002 00000003 81da5180 39c6c200 75578000 fffffffa ff000000 39c6c200 82c1664a 022d6202 00000043 75578000 7d5d85b0 773a3330 7e692d00 81da5220 00000000 c73e0002 Call Trace: [<82c1664a>] process_log_request+0x28f/0x47b [dm_cmirror] [<022d6202>] schedule+0x8c2/0x8ee [<82c169e1>] cluster_log_serverd+0x1ab/0x21f [dm_cmirror] [<82c16836>] cluster_log_serverd+0x0/0x21f [dm_cmirror] [<021041f5>] kernel_thread_helper+0x5/0xb Code: 20 ff 73 08 68 0d 8b c1 82 e8 66 d3 50 7f ff 73 0c 68 3a 8b c1 82 e8 59 d3 50 7f ff 73 14 ff 73 10 68 67 8b c1 82 e8 49 d3 50 7f <0f> 0b 3e 02 95 8b c1 82 83 c4 1c eb 0c 8b 00 89 70 04 89 06 89 <0>Fatal exception: panic in 5 seconds dm-cmirror: Client finishing recovery: 679/fA32IibI Kernel panic - not syncing: Fatal exception It appears that the kernel mod that Jon gave me fixes this issue. Fix checked into CVS: commit a0e6a6d02a4a55b98078dc874204c5555dbf74a4 Author: Jonathan Brassow <jbrassow> Date: Fri Jun 20 13:17:44 2008 -0500 dm-cmirror.ko: Fix for bug 450939 - must reset 'in_sync' var upon resume -- Clean-up of some compile warnings -- Additional debugging statements Looks like this is still reproducable after all. Moving back to 'assigned'. There should be no need for failure to reproduce this bug. Let's also try: 1) Create corelog mirror - using '--nosync' is fine 2) get I/O going on all machines except the server 3) Deactivate the server (and only the server) node via 'lvchange -aln vg/lv' This should trigger the bug. The server will migrate, the outstanding I/O will cause the mirror to appear 'not-in-sync', but the in_sync var will still be set. Once recovery collides with a write, the BUG will trigger. The failure should happen within 10 minutes. If it doesn't, you can try again. My test is now hitting this on non corelog mirrors Senario: Kill primary leg of synced 3 leg mirror(s) [...] # well after the failure was dealt with Enabling device sdg on taft-01 Enabling device sdg on taft-02 Enabling device sdg on taft-03 Enabling device sdg on taft-04 Recreating PV /dev/sdg1 WARNING: Volume group helter_skelter is not consistent WARNING: Volume Group helter_skelter is not consistent WARNING: Volume group helter_skelter is not consistent Extending the recreated PV back into VG helter_skelter Since we can't yet up convert existing mirrors, down converting to linear(s) on taft-04 before re-converting back to original mirror(s) Up converting linear(s) back to mirror(s) on taft-04... taft-04: lvconvert -m 2 helter_skelter/syncd_primary_3legs_1 /dev/sdg1:0-1000 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdh1:0-150 taft-04: lvconvert -m 2 helter_skelter/syncd_primary_3legs_2 /dev/sdg1:0-1000 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdh1:0-150 Verifying the up conversion from linear(s) to mirror(s) Verifying device /dev/sdg1 *IS* in the mirror(s) Could not connect to remote host [PANIC] should be fixed with the latest updates: commit 3ab00427e3eaf45e99d5f40fed6f3b459faccb14 Author: Jonathan Brassow <jbrassow> Date: Fri Jun 27 09:36:00 2008 -0500 dm-cmirror.ko: Fix for bug 450939, and other minor cleanups - If a write-recovery conflict is detected, halt recovery rather than calling BUG() (the fix bug 450939) - Minor code style cleanups An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0803.html |