Bug 435515

Summary: RHEL5 cmirror tracker: Error while processing invalid request
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 5.0CC: agk, ccaulfie, dwysocha, edamato, heinzm, mbroz
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 21:27:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 444983    

Description Corey Marthaler 2008-02-29 20:51:23 UTC
Description of problem:
While testing down conversions after leg failures, clogd got an invalid request
which appears to have led to clogd seg faulting. This caused the machine
(taft-03) to lock up and end up fenced. I'll attempt to reproduce this and
gather a core file from the segfault.

Senario: Kill primary leg of synced 3 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_primary_3legs
* sync:      1
* mirrors:   1
* disklog:   1
* failpv:    /dev/sdg1
* legs:      3
* pvs:       /dev/sdg1 /dev/sdd1 /dev/sdh1 /dev/sdc1
************************************************

Creating mirror(s) on taft-03...
taft-03: lvcreate -m 2 -n syncd_primary_3legs_1 -L 800M helter_skelter
/dev/sdg1:0-1000 /dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sdc1:0-150

Waiting until all mirrors become fully syncd...
        0/1 mirror(s) are fully synced: ( 1=0.50% )
        0/1 mirror(s) are fully synced: ( 1=29.50% )
        0/1 mirror(s) are fully synced: ( 1=58.50% )
        0/1 mirror(s) are fully synced: ( 1=87.00% )
        1/1 mirror(s) are fully synced: ( 1=100.00% )

Creating gfs on top of mirror(s) on taft-03...
Mounting mirrored gfs filesystems on taft-01...
Mounting mirrored gfs filesystems on taft-02...
Mounting mirrored gfs filesystems on taft-03...
Mounting mirrored gfs filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
checkit starting with:
CREATE
Num files:          100
Random Seed:        14631
Verify XIOR Stream: /tmp/checkit_syncd_primary_3legs_1
Working dir:        /mnt/syncd_primary_3legs_1/checkit

        ---- taft-02 ----
checkit starting with:
CREATE
Num files:          100
Random Seed:        16344
Verify XIOR Stream: /tmp/checkit_syncd_primary_3legs_1
Working dir:        /mnt/syncd_primary_3legs_1/checkit

        ---- taft-03 ----
checkit starting with:
CREATE
Num files:          100
Random Seed:        14595
Verify XIOR Stream: /tmp/checkit_syncd_primary_3legs_1
Working dir:        /mnt/syncd_primary_3legs_1/checkit

        ---- taft-04 ----
checkit starting with:
CREATE
Num files:          100
Random Seed:        13895
Verify XIOR Stream: /tmp/checkit_syncd_primary_3legs_1
Working dir:        /mnt/syncd_primary_3legs_1/checkit


<start name="taft-01_1" pid="16285" time="Fri Feb 29 11:10:00 2008" type="cmd" />
<start name="taft-02_1" pid="16287" time="Fri Feb 29 11:10:00 2008" type="cmd" />
<start name="taft-03_1" pid="16289" time="Fri Feb 29 11:10:00 2008" type="cmd" />
<start name="taft-04_1" pid="16291" time="Fri Feb 29 11:10:00 2008" type="cmd" />

Disabling device sdg on taft-01
Disabling device sdg on taft-02
Disabling device sdg on taft-03
Disabling device sdg on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-03
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.095782 seconds, 438 MB/s
Didn't receive heartbeat for 120 seconds
<fail name="taft-03_1" pid="16289" time="Fri Feb 29 11:14:11 2008" type="cmd"
duration="1204305251" ec="127" />
ALL STOP!
Verifying the down conversion of the failed mirror(s)
Could not connect to remote host
down conversion didn't appear to work as a simple lvs fails
run_iteration() method failed at
/home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21.


Taft-03:
[...]
Feb 29 11:09:30 taft-03 lvm[6906]: Locking LV
TWLi86MHiEsOLk9W50JnGnP8lg2e0DhTujj6BU6LPpmpI2r1sib9R9GNjYoIZcSw PW C (0x9c)
Feb 29 11:09:30 taft-03 clogd[6692]: Invalid request_type
Feb 29 11:09:30 taft-03 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_PRESUSPEND]: -22
Feb 29 11:09:30 taft-03 kernel: device-mapper: raid1: log presuspend failed
Feb 29 11:09:45 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:3271391 - retrying
Feb 29 11:09:45 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_MARK_REGION:3275707 - retrying
Feb 29 11:10:04 taft-03 clogd[6692]: Invalid request_type
Feb 29 11:10:19 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_POSTSUSPEND:3293624 - retrying
Feb 29 11:10:19 taft-03 clogd[6692]: Forced to refuse checkpoint for nodeid 4 -
log not valid yet
Feb 29 11:10:19 taft-03 kernel: clogd[6692]: segfault at 0000000000000000 rip
0000000000404d38 rsp 00007fffb848e5b0 error 4


Version-Release number of selected component (if applicable):
2.6.18-71.el5
cmirror-1.1.15-1.el5
kmod-cmirror-0.1.8-1.el5

Comment 1 Corey Marthaler 2008-03-03 18:43:10 UTC
I wasn't able to reproduce the segfault but I was able to reproduce a bunch of
the "Invalid request_type" messages, which in this case appears to have led to a
possibly corrupted gfs filesystem.

/dev/mapper/helter_skelter-syncd_primary_2legs_3 on /mnt/syncd_primary_2legs_3
type gfs (rw,hostdata=jid=0:id=13107201:first=1)

[root@taft-01 /]# umount /mnt/syncd_primary_2legs_3
/sbin/umount.gfs: there isn't a GFS filesystem on
/dev/mapper/helter_skelter-syncd_primary_2legs_3
/sbin/umount.gfs: there isn't a GFS filesystem on
/dev/mapper/helter_skelter-syncd_primary_2legs_3

[root@taft-01 /]# lvs -a -o +devices
  LV                               VG             Attr   LSize   Origin Snap% 
Move Log                        Copy%  Convert Devices         
  LogVol00                         VolGroup00     -wi-ao  66.19G               
                                              /dev/sda2(0)    
  LogVol01                         VolGroup00     -wi-ao   1.94G               
                                              /dev/sda2(2118) 
  syncd_primary_2legs_1            helter_skelter mwi-a- 800.00M               
    syncd_primary_2legs_1_mlog 100.00        
syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] helter_skelter iwi-ao 800.00M               
                                              /dev/sdh1(0)    
  [syncd_primary_2legs_1_mimage_1] helter_skelter iwi-ao 800.00M               
                                              /dev/sdc1(0)    
  [syncd_primary_2legs_1_mlog]     helter_skelter lwi-ao   4.00M               
                                              /dev/sdg1(0)    
  syncd_primary_2legs_2            helter_skelter mwi-a- 800.00M               
    syncd_primary_2legs_2_mlog 100.00        
syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] helter_skelter iwi-ao 800.00M               
                                              /dev/sdh1(200)  
  [syncd_primary_2legs_2_mimage_1] helter_skelter iwi-ao 800.00M               
                                              /dev/sdc1(200)  
  [syncd_primary_2legs_2_mlog]     helter_skelter lwi-ao   4.00M               
                                              /dev/sdg1(1)    
  syncd_primary_2legs_3            helter_skelter mwi-ao 800.00M               
    syncd_primary_2legs_3_mlog 100.00        
syncd_primary_2legs_3_mimage_0(0),syncd_primary_2legs_3_mimage_1(0)
  [syncd_primary_2legs_3_mimage_0] helter_skelter iwi-ao 800.00M               
                                              /dev/sdh1(400)  
  [syncd_primary_2legs_3_mimage_1] helter_skelter iwi-ao 800.00M               
                                              /dev/sdc1(400)  
  [syncd_primary_2legs_3_mlog]     helter_skelter lwi-ao   4.00M               
                                              /dev/sdg1(2)    



Mar  3 11:03:39 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:25421158 - retrying
Mar  3 11:03:39 taft-03 clogd[7129]: Invalid request_type
Mar  3 11:03:39 taft-03 lvm[7442]: Unable to parse mirror status string.
Mar  3 11:03:39 taft-03 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_STATUS_INFO]: -22


Mar  3 11:02:45 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:39255142 - retrying
Mar  3 11:02:45 taft-04 clogd[7129]: Invalid request_type

Comment 2 Jonathan Earl Brassow 2008-05-15 18:30:22 UTC
commit a21d1c03600e011cf1ee132100ac823bf3478877
Author: Jonathan Brassow <jbrassow>
Date:   Thu May 15 13:02:19 2008 -0500

    clogd + dm-log-clustered: Improve communication error handling (K <-> U)

    Connector (netlink) is an unreliable protocol.  From time to time, I get
    a connector header in userspace that has all the right fields (seq #, size
    of data, etc), but the data is all zeros.  When this happens, I use the seq
    number in the header to request the kernel to resend the request.  This
    should avoid those nasty 'Invalid request' messages... which are then
    followed by 'Request timed out' (because the invalid request was ignored).


Comment 3 Corey Marthaler 2008-08-25 16:22:41 UTC
This issue has not been see with the latest code. Marking verified (will reopen if seen again).

Comment 5 errata-xmlrpc 2009-01-20 21:27:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html