Bug 624148

Summary: 'dm_task_run failed' when attempting to deactivate cmirror after successful device failure iteration
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Alasdair Kergon <agk>
lvm2 sub component: Clustered Mirror / cmirrord (RHEL6) QA Contact: Corey Marthaler <cmarthal>
Status: CLOSED WONTFIX Docs Contact:
Severity: low    
Priority: low CC: agk, cmarthal, coughlan, dwysocha, heinzm, jbrassow, joe.thornber, msnitzer, prajnoha, prockai, zkabelac
Version: 6.0   
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-02 15:57:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 844493    
Bug Blocks: 756082, 960054, 1268411    
Attachments:
Description Flags
Logs from 3 machines that were running tests when problem occurred none

Description Corey Marthaler 2010-08-13 20:11:23 UTC
Description of problem:
From helter_skelter output:

./helter_skelter -l /home/msp/cmarthal/work/rhel6/sts-root -r /usr/tests/sts-rhel6.0/ -R ../../var/share/resource_files/taft.xml -e kill_primary_synced_2_legs -i 3

[...] # all the failure stuff worked

Stopping the io load (collie/xdoio) on mirror(s)
Unmounting gfs and removing mnt point on taft-01...
Unmounting gfs and removing mnt point on taft-02...
Unmounting gfs and removing mnt point on taft-03...
Unmounting gfs and removing mnt point on taft-04...
Deactivating and removing mirror(s)
  Error locking on node taft-01: Unable to deactivate open helter_skelter-syncd_primary_2legs_1_mlog (253:3)
couldn't deactivate mirror syncd_primary_2legs_1

From log:
Aug 13 20:00:59 taft-01 qarshd[9870]: Running cmdline: lvchange -an /dev/helter_skelter/syncd_primary_2legs_1
Aug 13 20:00:59 taft-01 lvm[2371]: dm_task_run failed, errno = 22, Invalid argument
Aug 13 20:00:59 taft-01 lvm[2371]: No longer monitoring mirror device helter_skelter-syncd_primary_2legs_1 for events.


Version-Release number of selected component (if applicable):
2.6.32-59.1.el6.x86_64

lvm2-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-libs-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-cluster-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
cmirror-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010

Comment 1 Corey Marthaler 2010-08-16 16:35:13 UTC
This is reproducible.

Unmounting gfs and removing mnt point on taft-01...
Unmounting gfs and removing mnt point on taft-02...
Unmounting gfs and removing mnt point on taft-03...
Unmounting gfs and removing mnt point on taft-04...
Deactivating and removing mirror(s)
  Error locking on node taft-01: Unable to deactivate open helter_skelter-nonsyncd_log_4legs_1_mlog (253:3)
couldn't deactivate mirror nonsyncd_log_4legs_1

Aug 14 00:30:53 taft-01 lvm[2311]: dm_task_run failed, errno = 22, Invalid argument
Aug 14 00:30:53 taft-01 lvm[2311]: No longer monitoring mirror device helter_skelter-nonsyncd_log_4legs_1 for events.


Version-Release number of selected component (if applicable):
2.6.32-59.1.el6.x86_64

lvm2-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-libs-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-cluster-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
cmirror-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010

Comment 3 Peter Rajnoha 2010-09-09 11:06:14 UTC
(In reply to comment #1)
> This is reproducible.
> Deactivating and removing mirror(s)
>   Error locking on node taft-01: Unable to deactivate open
> helter_skelter-nonsyncd_log_4legs_1_mlog (253:3)

Well, at first, does this happen with udevd stopped?

Comment 5 Corey Marthaler 2011-02-10 21:28:04 UTC
I just repo'ed this with the latest 6.1 lvm rpms. 

Feb 10 15:22:17 taft-03 lvm[1235]: dm_task_run failed, errno = 22, Invalid argument
Feb 10 15:22:17 taft-03 lvm[1235]: No longer monitoring mirror device helter_skelter-syncd_primary_2legs_1 for events.
Feb 10 15:22:17 taft-03 lvm[1235]: No longer monitoring mirror device helter_skelter-syncd_primary_2legs_2 for events.

2.6.32-94.el6.x86_64

lvm2-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-libs-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-cluster-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
udev-147-2.31.el6    BUILT: Wed Jan 26 05:39:15 CST 2011
device-mapper-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
cmirror-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011

Comment 6 Corey Marthaler 2011-02-10 21:34:35 UTC
I'm not sure how to attempt this with udev stopped since nothing works w/ out it. I can't even get a mirror created with that process killed.

Comment 11 Jonathan Earl Brassow 2011-06-03 17:14:22 UTC
Ok, we'll try to reproduce this again and I'll see if prajnoha can help me figure it out.

Also, is this bug limited to 'after successful device failure iteration'?

Comment 12 Peter Rajnoha 2011-06-06 09:44:40 UTC
Corey, just to make sure - udisks is *not* installed, right?

Comment 13 Corey Marthaler 2011-06-06 19:02:02 UTC
Peter, if by "udisks" you mean udev, then yes. See comment #6, udev *is* involved. 

If you mean something else by "udisks", then I don't think udisks are involved.

[root@hayes-01 ~]# rpm -q udisks
package udisks is not installed

Comment 14 Peter Rajnoha 2011-06-07 08:26:49 UTC
Yes, I meant udisks package - this one contains 80-udisks.rules with the OPTIONS+="watch" set for DM devices. So that minimizes a chance that this is a udev interaction problem now.

Comment 15 Corey Marthaler 2011-09-16 16:40:27 UTC
I'm still seeing the dm_task_failed messages in the syslog during device failure testing. However, I didn't see the deactivation failure due to the locking errors.

[root@taft-01 ~]# grep dm_task_run /var/log/messages | grep failed
Sep 15 16:42:38 taft-01 lvm[3083]: dm_task_run failed, errno = 22, Invalid argument
Sep 15 16:45:49 taft-01 lvm[3083]: dm_task_run failed, errno = 22, Invalid argument
Sep 15 16:49:46 taft-01 lvm[3083]: dm_task_run failed, errno = 22, Invalid argument
Sep 15 16:57:09 taft-01 lvm[3083]: dm_task_run failed, errno = 22, Invalid argument
Sep 15 17:03:51 taft-01 lvm[3083]: dm_task_run failed, errno = 22, Invalid argument
Sep 15 17:07:16 taft-01 lvm[3083]: dm_task_run failed, errno = 22, Invalid argument

2.6.32-195.el6.x86_64

lvm2-2.02.87-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
lvm2-libs-2.02.87-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
lvm2-cluster-2.02.87-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
udev-147-2.38.el6    BUILT: Fri Sep  9 16:25:50 CDT 2011
device-mapper-1.02.66-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
device-mapper-libs-1.02.66-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
device-mapper-event-1.02.66-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
device-mapper-event-libs-1.02.66-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
cmirror-2.02.87-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011

Comment 25 Jonathan Earl Brassow 2013-05-15 20:56:33 UTC
Corey, can you test to see if this bug is still present?

Bug 844493 has been closed - asserting that there are no more 'dm_task_run failed' errors.  I'm dubious.  Since those errors are what were causing this bug to trigger, this bug is probably a good place to ensure it is fixed.

Comment 26 Jonathan Earl Brassow 2013-05-15 20:59:38 UTC
moving to 6.6.  I will not prioritize this bug until it is reproduced and found that bug 844493 is not the cause.  If bug 844493 is still occurring, then it should be reopened and /that/ bug should be prioritized.

Comment 27 Corey Marthaler 2013-05-17 19:30:02 UTC
These still exist while doing failure testing.

TAFT-02:
May 17 13:41:28 taft-02 lvm[14338]: Mirror status: 1 of 2 images failed.
May 17 13:41:28 taft-02 lvm[14338]: Repair of mirrored device helter_skelter-syncd_primary_2legs_3 finished successfully.
May 17 13:41:28 taft-02 lvm[14338]: Primary mirror device 253:4 has failed (D).
May 17 13:41:28 taft-02 lvm[14338]: Device failure in helter_skelter-syncd_primary_2legs_1.
May 17 13:41:30 taft-02 lvm[14338]: device-mapper: waitevent ioctl on  failed: Interrupted system call
May 17 13:41:30 taft-02 lvm[14338]: dm_task_run failed, errno = 22, Invalid argument


TAFT-04:
May 17 13:41:35 taft-04 lvm[31955]: Repair of mirrored device helter_skelter-syncd_primary_2legs_3 finished successfully.
May 17 13:41:35 taft-04 lvm[31955]: Primary mirror device 253:4 read failed.
May 17 13:41:35 taft-04 lvm[31955]: Primary mirror device 253:8 read failed.
May 17 13:41:35 taft-04 lvm[31955]: Primary mirror device 253:4 has failed (D).
May 17 13:41:35 taft-04 lvm[31955]: Device failure in helter_skelter-syncd_primary_2legs_1.
May 17 13:41:37 taft-04 lvm[31955]: device-mapper: waitevent ioctl on  failed: Interrupted system call
May 17 13:41:37 taft-04 lvm[31955]: dm_task_run failed, errno = 22, Invalid argument
May 17 13:41:37 taft-04 lvm[31955]: No longer monitoring mirror device helter_skelter-syncd_primary_2legs_3 for events.
May 17 13:41:37 taft-04 lvm[31955]: No longer monitoring mirror device helter_skelter-syncd_primary_2legs_2 for events.
May 17 13:41:40 taft-04 lvm[31955]: Couldn't find device with uuid eeDBek-f0JX-Rb4R-0RgM-9CeD-PGbM-TMvHzN.
May 17 13:41:40 taft-04 lvm[31955]: Repair of mirrored device helter_skelter-syncd_primary_2legs_1 finished successfully.
May 17 13:41:42 taft-04 lvm[31955]: No longer monitoring mirror device helter_skelter-syncd_primary_2legs_1 for events.



2.6.32-354.el6.x86_64
lvm2-2.02.98-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
lvm2-libs-2.02.98-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
lvm2-cluster-2.02.98-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
udev-147-2.46.el6    BUILT: Tue Jan 15 06:41:59 CST 2013
device-mapper-1.02.77-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
device-mapper-libs-1.02.77-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
device-mapper-event-1.02.77-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
device-mapper-event-libs-1.02.77-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
cmirror-2.02.98-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013

Comment 29 Jonathan Earl Brassow 2014-08-26 15:56:49 UTC
Ok, so the messages are still there, but the original problem is not (mirror unable to deactivate)?

I will have to push this out to 6.7 for a possible fix.

I am also seeing more of these "device-mapper: waitevent ioctl on  failed: Interrupted system call" messages in bug 1127231, which is another failure testing scenario involving RAID.  I'm not yet sure why those are happening, but they seem to accompany these issues - not cause them.

Comment 34 Jonathan Earl Brassow 2015-02-25 23:53:48 UTC
I still see the messages:
Feb 25 18:39:32 bp-01 lvm[9299]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Feb 25 18:39:32 bp-01 lvm[9299]: dm_task_run failed, errno = 22, Invalid argument

However, they have no ill effect.  I am unclear what is causing them, but I'm setting priority and severity to "low".  Likely, the cause won't be discovered and fixed in 6.7.

Comment 35 Jonathan Earl Brassow 2015-03-02 15:20:15 UTC
need verbose logs from:
- dmeventd
- and other LVM commands

Comment 36 Jonathan Earl Brassow 2015-03-17 22:06:18 UTC
Created attachment 1002962 [details]
Logs from 3 machines that were running tests when problem occurred

This tarball contains the logs from three machines that were performing the tests at the time the message was printed.  The error only seemed to be produced from bp-02, but the other logs are provided for completeness.

Comment 37 Alasdair Kergon 2015-04-29 16:15:04 UTC
Mar 17 12:28:06 bp-02 lvm[33921]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Mar 17 12:28:06 bp-02 lvm[33921]: dm_task_run failed, errno = 22, Invalid argument


Could we try to reproduce this with debug logging enabled in dmeventd to get a better idea of the sequence of events that might be occurring?  Add -ddd (making sure syslog captures it)

Comment 38 Alasdair Kergon 2015-05-18 19:08:39 UTC
(We're wondering whether the original errno gets overwritten before the caller reads it, so we'll change the code to store and return it explicitly.)

Comment 40 Alasdair Kergon 2015-10-16 18:29:01 UTC
The errno code has been improved (upstream) since the last reports of this here, so once we have 6.8 builds, we need to see whether the problem is still there or not, and if it is, obtain the updated diagnostics that the new code provides.

Comment 44 Jonathan Earl Brassow 2016-03-02 15:57:43 UTC
This bug no longer exists in its original harmful form.  We had seen harmless error messages that we were looking to address with this bug, but we haven't seen them lately either.  closing this bug...