Description of problem: While testing down conversions after leg failures, clogd got an invalid request which appears to have led to clogd seg faulting. This caused the machine (taft-03) to lock up and end up fenced. I'll attempt to reproduce this and gather a core file from the segfault. Senario: Kill primary leg of synced 3 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_primary_3legs * sync: 1 * mirrors: 1 * disklog: 1 * failpv: /dev/sdg1 * legs: 3 * pvs: /dev/sdg1 /dev/sdd1 /dev/sdh1 /dev/sdc1 ************************************************ Creating mirror(s) on taft-03... taft-03: lvcreate -m 2 -n syncd_primary_3legs_1 -L 800M helter_skelter /dev/sdg1:0-1000 /dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sdc1:0-150 Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 1=0.50% ) 0/1 mirror(s) are fully synced: ( 1=29.50% ) 0/1 mirror(s) are fully synced: ( 1=58.50% ) 0/1 mirror(s) are fully synced: ( 1=87.00% ) 1/1 mirror(s) are fully synced: ( 1=100.00% ) Creating gfs on top of mirror(s) on taft-03... Mounting mirrored gfs filesystems on taft-01... Mounting mirrored gfs filesystems on taft-02... Mounting mirrored gfs filesystems on taft-03... Mounting mirrored gfs filesystems on taft-04... Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- checkit starting with: CREATE Num files: 100 Random Seed: 14631 Verify XIOR Stream: /tmp/checkit_syncd_primary_3legs_1 Working dir: /mnt/syncd_primary_3legs_1/checkit ---- taft-02 ---- checkit starting with: CREATE Num files: 100 Random Seed: 16344 Verify XIOR Stream: /tmp/checkit_syncd_primary_3legs_1 Working dir: /mnt/syncd_primary_3legs_1/checkit ---- taft-03 ---- checkit starting with: CREATE Num files: 100 Random Seed: 14595 Verify XIOR Stream: /tmp/checkit_syncd_primary_3legs_1 Working dir: /mnt/syncd_primary_3legs_1/checkit ---- taft-04 ---- checkit starting with: CREATE Num files: 100 Random Seed: 13895 Verify XIOR Stream: /tmp/checkit_syncd_primary_3legs_1 Working dir: /mnt/syncd_primary_3legs_1/checkit <start name="taft-01_1" pid="16285" time="Fri Feb 29 11:10:00 2008" type="cmd" /> <start name="taft-02_1" pid="16287" time="Fri Feb 29 11:10:00 2008" type="cmd" /> <start name="taft-03_1" pid="16289" time="Fri Feb 29 11:10:00 2008" type="cmd" /> <start name="taft-04_1" pid="16291" time="Fri Feb 29 11:10:00 2008" type="cmd" /> Disabling device sdg on taft-01 Disabling device sdg on taft-02 Disabling device sdg on taft-03 Disabling device sdg on taft-04 Attempting I/O to cause mirror down conversion(s) on taft-03 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.095782 seconds, 438 MB/s Didn't receive heartbeat for 120 seconds <fail name="taft-03_1" pid="16289" time="Fri Feb 29 11:14:11 2008" type="cmd" duration="1204305251" ec="127" /> ALL STOP! Verifying the down conversion of the failed mirror(s) Could not connect to remote host down conversion didn't appear to work as a simple lvs fails run_iteration() method failed at /home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21. Taft-03: [...] Feb 29 11:09:30 taft-03 lvm[6906]: Locking LV TWLi86MHiEsOLk9W50JnGnP8lg2e0DhTujj6BU6LPpmpI2r1sib9R9GNjYoIZcSw PW C (0x9c) Feb 29 11:09:30 taft-03 clogd[6692]: Invalid request_type Feb 29 11:09:30 taft-03 kernel: device-mapper: dm-log-clustered: Server error while processing request [DM_CLOG_PRESUSPEND]: -22 Feb 29 11:09:30 taft-03 kernel: device-mapper: raid1: log presuspend failed Feb 29 11:09:45 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:3271391 - retrying Feb 29 11:09:45 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_MARK_REGION:3275707 - retrying Feb 29 11:10:04 taft-03 clogd[6692]: Invalid request_type Feb 29 11:10:19 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_POSTSUSPEND:3293624 - retrying Feb 29 11:10:19 taft-03 clogd[6692]: Forced to refuse checkpoint for nodeid 4 - log not valid yet Feb 29 11:10:19 taft-03 kernel: clogd[6692]: segfault at 0000000000000000 rip 0000000000404d38 rsp 00007fffb848e5b0 error 4 Version-Release number of selected component (if applicable): 2.6.18-71.el5 cmirror-1.1.15-1.el5 kmod-cmirror-0.1.8-1.el5
I wasn't able to reproduce the segfault but I was able to reproduce a bunch of the "Invalid request_type" messages, which in this case appears to have led to a possibly corrupted gfs filesystem. /dev/mapper/helter_skelter-syncd_primary_2legs_3 on /mnt/syncd_primary_2legs_3 type gfs (rw,hostdata=jid=0:id=13107201:first=1) [root@taft-01 /]# umount /mnt/syncd_primary_2legs_3 /sbin/umount.gfs: there isn't a GFS filesystem on /dev/mapper/helter_skelter-syncd_primary_2legs_3 /sbin/umount.gfs: there isn't a GFS filesystem on /dev/mapper/helter_skelter-syncd_primary_2legs_3 [root@taft-01 /]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 66.19G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 1.94G /dev/sda2(2118) syncd_primary_2legs_1 helter_skelter mwi-a- 800.00M syncd_primary_2legs_1_mlog 100.00 syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0) [syncd_primary_2legs_1_mimage_0] helter_skelter iwi-ao 800.00M /dev/sdh1(0) [syncd_primary_2legs_1_mimage_1] helter_skelter iwi-ao 800.00M /dev/sdc1(0) [syncd_primary_2legs_1_mlog] helter_skelter lwi-ao 4.00M /dev/sdg1(0) syncd_primary_2legs_2 helter_skelter mwi-a- 800.00M syncd_primary_2legs_2_mlog 100.00 syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0) [syncd_primary_2legs_2_mimage_0] helter_skelter iwi-ao 800.00M /dev/sdh1(200) [syncd_primary_2legs_2_mimage_1] helter_skelter iwi-ao 800.00M /dev/sdc1(200) [syncd_primary_2legs_2_mlog] helter_skelter lwi-ao 4.00M /dev/sdg1(1) syncd_primary_2legs_3 helter_skelter mwi-ao 800.00M syncd_primary_2legs_3_mlog 100.00 syncd_primary_2legs_3_mimage_0(0),syncd_primary_2legs_3_mimage_1(0) [syncd_primary_2legs_3_mimage_0] helter_skelter iwi-ao 800.00M /dev/sdh1(400) [syncd_primary_2legs_3_mimage_1] helter_skelter iwi-ao 800.00M /dev/sdc1(400) [syncd_primary_2legs_3_mlog] helter_skelter lwi-ao 4.00M /dev/sdg1(2) Mar 3 11:03:39 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:25421158 - retrying Mar 3 11:03:39 taft-03 clogd[7129]: Invalid request_type Mar 3 11:03:39 taft-03 lvm[7442]: Unable to parse mirror status string. Mar 3 11:03:39 taft-03 kernel: device-mapper: dm-log-clustered: Server error while processing request [DM_CLOG_STATUS_INFO]: -22 Mar 3 11:02:45 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:39255142 - retrying Mar 3 11:02:45 taft-04 clogd[7129]: Invalid request_type
commit a21d1c03600e011cf1ee132100ac823bf3478877 Author: Jonathan Brassow <jbrassow> Date: Thu May 15 13:02:19 2008 -0500 clogd + dm-log-clustered: Improve communication error handling (K <-> U) Connector (netlink) is an unreliable protocol. From time to time, I get a connector header in userspace that has all the right fields (seq #, size of data, etc), but the data is all zeros. When this happens, I use the seq number in the header to request the kernel to resend the request. This should avoid those nasty 'Invalid request' messages... which are then followed by 'Request timed out' (because the invalid request was ignored).
This issue has not been see with the latest code. Marking verified (will reopen if seen again).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0158.html