Bug 662433

Summary:

iscsid preventing machine shutdown or reboot

Product:

Red Hat Enterprise Linux 6

Reporter:

Vitaly Karasik <linux.il>

Component:

device-mapper-multipath

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED ERRATA

QA Contact:

Gris Ge <fge>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.0

CC:

agajania, agk, anthony.cheung, anthony.cheung, bdonahue, bmarzins, coughlan, czhang, devin.bougie, dwysocha, ekuric, fge, harald, heinzm, jwest, marting, mbroz, mchristi, notting, pknirsch, prajnoha, prockai, robert.w.love, steven.ellis, xdl-redhat-bugzilla, zkabelac

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

device-mapper-multipath-0.4.9-48.el6

Doc Type:

Bug Fix

Doc Text:

Cause: Multipathd was not disabling queue_if_no_path on multipath devices when it stopped, by default. Consequence: Once multipathd stopped on shutdown, if all paths to a device were lost, the IO to that device would queue, causing shutdown to hang Fix: multipath now sets the queue_without_daemon option to "no" by default, causing all multipath devices to stop queueing when multipathd is stopped Result: multipath devices no longer queue IO during shutdown, and the nodes no longer hang

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-06-20 14:36:09 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

658636, 676970, 756082, 758850

Attachments:

Description	Flags
screenshot of shutdown problem	none
System paused on this iscsi message for a couple of minutes	none
Device mapper kernel panic during shutdown	none
Screen shot of failure to mount LVM partition from iSCSI	none
Output of /var/log/messages showing a valid boot process	none
Output of /var/log/messages where iscsi hasn't started before network devices are mounted.	none
kernel panic on shutdown with upgraded iscsi package.	none
Multipath failing over	none
kernel panic on shutdown with iscsi-initiator-utils-6.2.0.872-17.el6.x86_64.rpm package.	none

Description Vitaly Karasik 2010-12-12 16:18:17 UTC

Created attachment 468237 [details]
screenshot of shutdown problem

Description of problem:
Problem seems to be similar to Bug 583218 under RHEL5

 
How reproducible:
always

Steps to Reproduce:
1. activate iscsi volume on RHEL6
2. run reboot  command
 
  
Actual results:
system stuck on shutdown - the last line it shows "shutting down loppback interface"

Expected results:
  reboot  

Additional info:

Comment 2 Mike Christie 2010-12-13 17:25:42 UTC

I think this is a duplicate of:
https://bugzilla.redhat.com/show_bug.cgi?id=634021


To login to the target did you do

service iscsi start

or did you manually login to the target:

iscsiadm -m node ... -l/-L

Comment 3 Vitaly Karasik 2010-12-13 19:22:06 UTC

I'm not sure I understand your question well. If you're asking how I activate iSCSI - the answer is using "service iscsi start", automatically:

# chkconfig --list|grep scs
iscsi           0:off   1:off   2:off   3:on    4:on    5:on    6:off
iscsid          0:off   1:off   2:off   3:on    4:on    5:on    6:off

Comment 4 Mike Christie 2010-12-14 03:23:53 UTC

You can start the iscsi service by doing "service iscsi start" and that will log into target portals found in the iscsi db. But you can also log into targets manually by just running iscsiadm. So if you have discovered some portals that are not yet logged into you can do

iscsiadm -m node -T target -p ip -l
or
iscsiadm -m node -l

and this will log into the target portals. There is bug with this though. If you run iscsiadm by hand to login then when you reboot or run "service iscsi stop" then the sessions might not get logged out of. And iIf you are rebooting that can cause a hang.

So just to confirm it is iscsi that is causing the problem, before you do the reboot, do

iscsiadm -m node -u

then reboot and report if it hangs.


Also you are not doing iscsi root, right?

Comment 5 Vitaly Karasik 2010-12-14 07:10:04 UTC

# iscsiadm -m node -u
Logging out of session [sid: 1, target: iqn.2001-05.com.equallogic:0-8a0906-0b9469e03-def000ece0d4cfd5-vringo-1, portal: 172.30.2.2,3260]
Logging out of session [sid: 2, target: iqn.2001-05.com.equallogic:0-8a0906-0b9469e03-def000ece0d4cfd5-vringo-1, portal: 172.30.2.2,3260]
Logout of [sid: 1, target: iqn.2001-05.com.equallogic:0-8a0906-0b9469e03-def000ece0d4cfd5-vringo-1, portal: 172.30.2.2,3260] successful.
Logout of [sid: 2, target: iqn.2001-05.com.equallogic:0-8a0906-0b9469e03-def000ece0d4cfd5-vringo-1, portal: 172.30.2.2,3260] successful.


But system hang anyway.

Comment 6 Mike Christie 2010-12-14 20:13:59 UTC

Initscript developers,

It does not look like a iscsi issue. In comment #5 we manually shutdown iscsi before rebooting so during shutdown there is not iscsi activity or sessions running.

Comment 7 Bill Nottingham 2010-12-14 20:23:30 UTC

If you enable sysrq, what does 'sysrq-t' or 'sysrq-p' show? (You may need a serial console to catch it all.)

Comment 8 Vitaly Karasik 2010-12-18 20:30:45 UTC

Unfortunately, I wasn't able to get any dump using 'sysrq-t' or 'sysrq-p' [even I enabled sysrq]. Probably it because some DRAC (Dell's remote console) issue.
I'll ask my hosting company for help, may be they will be able to provide more details.

thank you,
Vitaly

Comment 9 Steven Ellis 2010-12-23 02:42:14 UTC

I'm also seeing this issue on my RHEL6 64bit test bed. During shutdown the system locks on

Not stopping iscsid: iscsi sessions still active

If I manually run

 iscsiadm -m node -u

Then the following exits correctly
 service iscsid stop

I still have an issue on shutdown where there is a huge delay after displaying
 stopping cgconfig service [OK]

If I then hit CTRL-ALT-Delete I get the following but the machine won't power off
 Could not log bootup: Address already in use

This has been tracked in Fedora under
 - https://bugzilla.redhat.com/show_bug.cgi?id=654762
 - https://bugzilla.redhat.com/show_bug.cgi?id=642347

Comment 10 Mike Christie 2010-12-27 04:02:23 UTC

(In reply to comment #9)
> I'm also seeing this issue on my RHEL6 64bit test bed. During shutdown the
> system locks on
> 
> Not stopping iscsid: iscsi sessions still active
> 
> If I manually run
> 
>  iscsiadm -m node -u
> 
> Then the following exits correctly
>  service iscsid stop

Steven,

Your bug looks like
https://bugzilla.redhat.com/show_bug.cgi?id=634021
or at least will be fixed with the change in there.

Comment 11 Steven Ellis 2011-01-04 23:43:43 UTC

Hi Mike

Afraid I don't have access to https://bugzilla.redhat.com/show_bug.cgi?id=634021 so I can't fully comment.

On shutdown the last displayed message is
 Not stopping iscsid: iscsi sessions still active

I've disable quiet boot and rhgb.

Any suggestions on how getting any additional output?

Comment 12 Harald Hoyer 2011-01-14 14:20:05 UTC

(In reply to comment #11)
> Hi Mike
> 
> Afraid I don't have access to
> https://bugzilla.redhat.com/show_bug.cgi?id=634021 so I can't fully comment.
> 
> On shutdown the last displayed message is
>  Not stopping iscsid: iscsi sessions still active
> 
> I've disable quiet boot and rhgb.
> 
> Any suggestions on how getting any additional output?

you could add 
set +x
after the first line in /etc/init.d/halt /etc/rc /etc/init.d/iscsi and /etc/init.d/iscsid

of course this will give a lot of output, but you will see the last one

Comment 13 Steven Ellis 2011-01-18 01:34:42 UTC

Just tried a reboot with the above suggestion for adjusting the halt/iscsi/iscid scripts but didn't get much if any additional output.

I did get a big pause on iscsi during shutdown and then I got a kernel panic. I'll attach a pair of screen shots showing the output.

Comment 14 Steven Ellis 2011-01-18 01:36:02 UTC

Created attachment 473954 [details]
System paused on this iscsi message for a couple of minutes

Comment 15 Steven Ellis 2011-01-18 01:37:39 UTC

Created attachment 473955 [details]
Device mapper kernel panic during shutdown

Once the iscsi error times out shutdown continued until this kernel panic appeared

Comment 16 Mike Christie 2011-01-18 01:54:59 UTC

You are hitting the bug in 634021.

I will send you a patch to test in a email so this bz is not cluttered and I will figure out what I am supposed to do so you can see the other bz.

Comment 17 Mike Christie 2011-02-01 08:49:27 UTC

Steven,

Could you try the rpm here
http://people.redhat.com/mchristi/iscsi/rhel6.1/iscsi-initiator-utils/

Comment 19 Steven Ellis 2011-02-18 01:28:22 UTC

Hi Mike.

Tried the updated version that you suggested and I'm afraid it actually made things worse.

Currently installed package

    * iscsi-initiator-utils-6.2.0.872-10.el6.x86_64 

Proposed newer package

    * iscsi-initiator-utils-6.2.0.872-15.el6.x86_64 

I upgraded to iscsi-initiator-utils-6.2.0.872-15.el6.x86_64 and rebooted the machine. It had the same shutdown issues as before and the same kernel panic messages. I forced a reboot and then brought the machine up in single user mode to confirm no local storage issues.

I then tried a clean reboot with the updated version of iscsi-initiator-utils and the system hung trying to mount an LVM filesystem sitting on iSCSI storage (see iscsi-lvm-error.png).

Logging in showed that the iscsi daemon hadn't been started before the OS tried to mount filesystems tagged with _netdev in /etc/fstab, and thus causing the error. If I manually started iscsi at this time the system found the storage.

I've attached two text files boot.broken and boot.works showing the section of /var/log/messages around the boot for the old and new versions of iscsi-initiator-utils.

I then tried a normal shutdown of the system with your new version of iscsi-initiator-utils, and I still get the same errors and kernel panic on shutdown.

For the moment I've reverted the version back to iscsi-initiator-utils-6.2.0.872-10.el6.x86_64 and I've confirmed that I now have a clean boot process.

Comment 20 Steven Ellis 2011-02-18 01:29:50 UTC

Created attachment 479432 [details]
Screen shot of failure to mount LVM partition from iSCSI

Comment 21 Steven Ellis 2011-02-18 01:30:21 UTC

Created attachment 479433 [details]
Output of /var/log/messages showing a valid boot process

Comment 22 Steven Ellis 2011-02-18 01:30:58 UTC

Created attachment 479434 [details]
Output of /var/log/messages where iscsi hasn't started before network devices are mounted.

Comment 23 Steven Ellis 2011-02-18 01:31:41 UTC

Created attachment 479435 [details]
kernel panic on shutdown with upgraded iscsi package.

Comment 24 Mike Christie 2011-02-19 03:38:19 UTC

(In reply to comment #23)
> Created attachment 479435 [details]
> kernel panic on shutdown with upgraded iscsi package.

Do you still see this the hang in "Not stopping iscsid" like here:
https://bugzilla.redhat.com/show_bug.cgi?id=662433#c14

Comment 25 Mike Christie 2011-02-20 04:46:47 UTC

(In reply to comment #19)
> 
> Logging in showed that the iscsi daemon hadn't been started before the OS tried
> to mount filesystems tagged with _netdev in /etc/fstab, and thus causing the
> error

Sorry about that. That looks like a dumb mistake on my part. I fixed the startup bug now in iscsi-initiator-utils-6.2.0.872-17.el6. It can be downloaded here:
http://people.redhat.com/mchristi/iscsi/rhel6.1/iscsi-initiator-utils/

Let me know if we still see the "Not stopping iscsid" message or if we hang in something like the network interface shutdown like here https://bugzilla.redhat.com/attachment.cgi?id=468237.

Comment 26 Steven Ellis 2011-02-21 00:02:00 UTC

Still got some issues with your newer package.

The boot issues have now been resolved but on shutdown the system still throws kernel errors

The shutdown gets as far as the multipath failing messages (see iscis-multipath-failing-over.png) and then pauses for around 5 minutes.

Then I get the usual kernel panic error (iscsi-new-kernel-panic.png).

I'll see if I can arrange a Serial console into the blade where I'm testing this as at the moment I can't capture the full shutdown output as I have limited physical access to the hardware.

Comment 27 Steven Ellis 2011-02-21 00:02:34 UTC

Created attachment 479800 [details]
Multipath failing over

Comment 28 Steven Ellis 2011-02-21 00:03:23 UTC

Created attachment 479801 [details]
kernel panic on shutdown with iscsi-initiator-utils-6.2.0.872-17.el6.x86_64.rpm package.

Comment 29 Steven Ellis 2011-02-21 05:25:28 UTC

I've now managed to reproduce the issue from inside a KVM guest so I've got full console output.

First the output before I've patched the system - here is what mutipath can see

[root@rh6test02 ~]# multipath -ll
mpathc (360a98000503363775a5a43453338594d) dm-3 NETAPP,LUN
size=150G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 3:0:0:0 sdb 8:16 active ready running
  `- 2:0:0:0 sdc 8:32 active ready running
mpathb (360a98000503363775a5a434c6e564a6b) dm-2 NETAPP,LUN
size=180G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 3:0:0:2 sdg 8:96 active ready running
  `- 2:0:0:2 sdf 8:80 active ready running
mpatha (360a98000503363775a5a434533435976) dm-1 NETAPP,LUN
size=150G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 3:0:0:1 sde 8:64 active ready running
  `- 2:0:0:1 sdd 8:48 active ready running


If we perform a shutdown we get

Stopping Red Hat Network Daemon: [  OK  ]
Stopping atd: [  OK  ]
Stopping abrt daemon: [  OK  ]
Stopping sshd: [  OK  ]
Shutting down postfix: [  OK  ]
Stopping crond: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping HAL daemon: [  OK  ]
Killing mdmonitor: [  OK  ]
Stopping system message bus: [  OK  ]
Stopping multipathd daemon: [  OK  ]
Stopping auditd: type=1305 audit(1298261350.484:14067): audit_pid=0 old=1537 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[  OK  ]
Not stopping iscsid: iscsi sessions still active[WARNING]
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
IPv6 over IPv4 tunneling driver
sit0: Disabled Privacy Extensions
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295875817, last ping 4295880817, now 4295885817
 connection2:0: detected conn error (1011)
 connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295876635, last ping 4295881635, now 4295886635
 connection1:0: detected conn error (1011)
 session2: session recovery timed out after 120 secs
 session1: session recovery timed out after 120 secs
sd 2:0:0:1: [sdd] Unhandled error code
sd 2:0:0:1: [sdd] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
sd 2:0:0:1: [sdd] CDB: Read(10): 28 00 12 bf ff 80 00 00 08 00
end_request: I/O error, dev sdd, sector 314572672
device-mapper: multipath: Failing path 8:48.
sd 3:0:0:1: [sde] Unhandled error code
sd 3:0:0:1: [sde] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
sd 3:0:0:1: [sde] CDB: Read(10): 28 00 12 bf ff 80 00 00 08 00
end_request: I/O error, dev sde, sector 314572672
device-mapper: multipath: Failing path 8:64.
INFO: task vgs:2713 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgs           D 0000000000000000     0  2713   2712 0x00000080
 ffff880037bc5b88 0000000000000086 ffff880000000000 0000024380c8b294
 ffff88003c8142c0 ffff88003bea0ec0 000000000001ead8 ffffffffb29e1f53
 ffff880037bf70a8 ffff880037bc5fd8 0000000000010518 ffff880037bf70a8
Call Trace:
 [<ffffffff8109bae9>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff814c9533>] io_schedule+0x73/0xc0
 [<ffffffff811a688e>] __blockdev_direct_IO+0x70e/0xc40
 [<ffffffff8125b82a>] ? kobject_get+0x1a/0x30
 [<ffffffff811a44d7>] blkdev_direct_IO+0x57/0x60
 [<ffffffff811a36c0>] ? blkdev_get_blocks+0x0/0xc0
 [<ffffffff8110d8cb>] generic_file_aio_read+0x6db/0x730
 [<ffffffff812071b1>] ? avc_has_perm+0x71/0x90
 [<ffffffff81208c02>] ? selinux_inode_permission+0x72/0xb0
 [<ffffffff8116ceda>] do_sync_read+0xfa/0x140
 [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811a3a8c>] ? block_ioctl+0x3c/0x40
 [<ffffffff8117fa12>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff8120c70b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811ffb76>] ? security_file_permission+0x16/0x20
 [<ffffffff8116d905>] vfs_read+0xb5/0x1a0
 [<ffffffff810d42b2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8116da41>] sys_read+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b

Comment 30 Steven Ellis 2011-02-21 05:26:20 UTC

Trying the newer iscsi-initiator-utils-6.2.0.872-17.el6.x86_64.rpm we get similar output on shutdown


Stopping Red Hat Network Daemon: [  OK  ]
Stopping atd: [  OK  ]
Stopping abrt daemon: [  OK  ]
Stopping sshd: [  OK  ]
Shutting down postfix: [  OK  ]
Stopping crond: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping HAL daemon: [  OK  ]
Killing mdmonitor: [  OK  ]
Stopping system message bus: [  OK  ]
Stopping multipathd daemon: [  OK  ]
Stopping auditd: type=1305 audit(1298263802.100:16777): audit_pid=0 old=1698 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[  OK  ]
Stopping iscsi: [  OK  ]
Stopping iscsid: 
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
IPv6 over IPv4 tunneling driver
sit0: Disabled Privacy Extensions
device-mapper: multipath: Failing path 8:80.
device-mapper: multipath: Failing path 8:96.
INFO: task vgs:2240 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgs           D 0000000000000000     0  2240   2239 0x00000080
 ffff88003b175b88 0000000000000086 ffff880000000000 000000efbfb3dd2e
 ffff880037e642c0 ffff88003d42ef30 0000000000011b9e ffffffffb29e12af
 ffff880037fac5f8 ffff88003b175fd8 0000000000010518 ffff880037fac5f8
Call Trace:
 [<ffffffff8109bae9>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff814c9533>] io_schedule+0x73/0xc0
 [<ffffffff811a688e>] __blockdev_direct_IO+0x70e/0xc40
 [<ffffffff8125b82a>] ? kobject_get+0x1a/0x30
 [<ffffffff811a44d7>] blkdev_direct_IO+0x57/0x60
 [<ffffffff811a36c0>] ? blkdev_get_blocks+0x0/0xc0
 [<ffffffff8110d8cb>] generic_file_aio_read+0x6db/0x730
 [<ffffffff812071b1>] ? avc_has_perm+0x71/0x90
 [<ffffffff81208c02>] ? selinux_inode_permission+0x72/0xb0
 [<ffffffff8116ceda>] do_sync_read+0xfa/0x140
 [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811a3a8c>] ? block_ioctl+0x3c/0x40
 [<ffffffff8117fa12>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff8120c70b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811ffb76>] ? security_file_permission+0x16/0x20
 [<ffffffff8116d905>] vfs_read+0xb5/0x1a0
 [<ffffffff810d42b2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8116da41>] sys_read+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
INFO: task vgs:2240 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgs           D 0000000000000000     0  2240   2239 0x00000080
 ffff88003b175b88 0000000000000086 ffff880000000000 000000efbfb3dd2e
 ffff880037e642c0 ffff88003d42ef30 0000000000011b9e ffffffffb29e12af
 ffff880037fac5f8 ffff88003b175fd8 0000000000010518 ffff880037fac5f8
Call Trace:
 [<ffffffff8109bae9>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff814c9533>] io_schedule+0x73/0xc0
 [<ffffffff811a688e>] __blockdev_direct_IO+0x70e/0xc40
 [<ffffffff8125b82a>] ? kobject_get+0x1a/0x30
 [<ffffffff811a44d7>] blkdev_direct_IO+0x57/0x60
 [<ffffffff811a36c0>] ? blkdev_get_blocks+0x0/0xc0
 [<ffffffff8110d8cb>] generic_file_aio_read+0x6db/0x730
 [<ffffffff812071b1>] ? avc_has_perm+0x71/0x90
 [<ffffffff81208c02>] ? selinux_inode_permission+0x72/0xb0
 [<ffffffff8116ceda>] do_sync_read+0xfa/0x140
 [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811a3a8c>] ? block_ioctl+0x3c/0x40
 [<ffffffff8117fa12>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff8120c70b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811ffb76>] ? security_file_permission+0x16/0x20
 [<ffffffff8116d905>] vfs_read+0xb5/0x1a0
 [<ffffffff810d42b2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8116da41>] sys_read+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
INFO: task vgs:2240 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgs           D 0000000000000000     0  2240   2239 0x00000080
 ffff88003b175b88 0000000000000086 ffff880000000000 000000efbfb3dd2e
 ffff880037e642c0 ffff88003d42ef30 0000000000011b9e ffffffffb29e12af
 ffff880037fac5f8 ffff88003b175fd8 0000000000010518 ffff880037fac5f8
Call Trace:
 [<ffffffff8109bae9>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff814c9533>] io_schedule+0x73/0xc0
 [<ffffffff811a688e>] __blockdev_direct_IO+0x70e/0xc40
 [<ffffffff8125b82a>] ? kobject_get+0x1a/0x30
 [<ffffffff811a44d7>] blkdev_direct_IO+0x57/0x60
 [<ffffffff811a36c0>] ? blkdev_get_blocks+0x0/0xc0
 [<ffffffff8110d8cb>] generic_file_aio_read+0x6db/0x730
 [<ffffffff812071b1>] ? avc_has_perm+0x71/0x90
 [<ffffffff81208c02>] ? selinux_inode_permission+0x72/0xb0
 [<ffffffff8116ceda>] do_sync_read+0xfa/0x140
 [<ffffffff81091de0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811a3a8c>] ? block_ioctl+0x3c/0x40
 [<ffffffff8117fa12>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff8120c70b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811ffb76>] ? security_file_permission+0x16/0x20
 [<ffffffff8116d905>] vfs_read+0xb5/0x1a0
 [<ffffffff810d42b2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8116da41>] sys_read+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b

Comment 31 Steven Ellis 2011-02-21 05:27:15 UTC

If I want a clean shutdown I need to perform the following steps


service multipathd stop
dmsetup remove /dev/mapper/mpathc
dmsetup remove /dev/mapper/mpathb
dmsetup remove /dev/mapper/mpatha
service iscsi stop
service iscsid stop
shutdown


If I don't use dmsetup to remove the device-mapper entries I always get the kernel errors on shutdown.

Comment 32 Ben Marzinski 2011-02-22 18:32:50 UTC

In the defaults section of /etc/multipath.conf, can you add the line

queue_without_daemon no

This turns off queue_if_no_path when multipathd is shutdown.  That way, when all your scsi device get removed, any outstanding IO will simply be failed back.

Let me know if that helps.

Comment 33 Steven Ellis 2011-02-23 02:40:34 UTC

Hi Ben

Yes that does appear to fix things with regards to the shutdown.

I am a little bit worried about the I/O errors that are thrown, although the system does boot cleanly after a shutdown.

Here is the output of the shutdown

Stopping Red Hat Network Daemon: [  OK  ]
Stopping atd: [  OK  ]
Stopping abrt daemon: [  OK  ]
Stopping sshd: [  OK  ]
Shutting down postfix: [  OK  ]
Stopping crond: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping HAL daemon: [  OK  ]
Killing mdmonitor: [  OK  ]
Stopping system message bus: [  OK  ]
Stopping multipathd daemon: [  OK  ]
Stopping auditd: type=1305 audit(1298416533.079:9404): audit_pid=0 old=1184 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[  OK  ]
Stopping iscsi:  connection2:0: detected conn error (1020)
 connection1:0: detected conn error (1020)
[  OK  ]
Stopping iscsid: 
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
IPv6 over IPv4 tunneling driver
sit0: Disabled Privacy Extensions
device-mapper: multipath: Failing path 8:80.
device-mapper: multipath: Failing path 8:96.
end_request: I/O error, dev dm-1, sector 377495424
end_request: I/O error, dev dm-1, sector 377495536
end_request: I/O error, dev dm-1, sector 0
end_request: I/O error, dev dm-1, sector 8
end_request: I/O error, dev dm-1, sector 0
device-mapper: multipath: Failing path 8:32.
device-mapper: multipath: Failing path 8:16.
end_request: I/O error, dev dm-2, sector 314572672
end_request: I/O error, dev dm-2, sector 314572784
end_request: I/O error, dev dm-2, sector 0
end_request: I/O error, dev dm-2, sector 8
end_request: I/O error, dev dm-2, sector 0
device-mapper: multipath: Failing path 8:48.
device-mapper: multipath: Failing path 8:64.
end_request: I/O error, dev dm-3, sector 314572672
end_request: I/O error, dev dm-3, sector 314572784
end_request: I/O error, dev dm-3, sector 0
end_request: I/O error, dev dm-3, sector 8
end_request: I/O error, dev dm-3, sector 0
end_request: I/O error, dev dm-2, sector 227459328
end_request: I/O error, dev dm-2, sector 227459440
end_request: I/O error, dev dm-2, sector 227246464
end_request: I/O error, dev dm-2, sector 227246472
end_request: I/O error, dev dm-2, sector 227246464
Stopping monitoring for VG myvg: end_request: I/O error, dev dm-1, sector 377495424
  /dev/mapper/mpathb: read failed after 0 of 4096 at 193277657088: Input/output error
end_request: I/O error, dev dm-1, sector 377495536
  /dev/mapper/mpathb: read failed after 0 of 4096 at 193277714432: Input/output error
end_request: I/O error, dev dm-1, sector 0
  /dev/mapper/mpathb: read failed after 0 of 4096 at 0: Input/output error
end_request: I/O error, dev dm-1, sector 8
  /dev/mapper/mpathb: read failed after 0 of 4096 at 4096: Input/output error
end_request: I/O error, dev dm-1, sector 0
end_request: I/O error, dev dm-2, sector 314572672
  /dev/mapper/mpathc: read failed after 0 of 4096 at 161061208064: Input/output error
end_request: I/O error, dev dm-2, sector 314572784
  /dev/mapper/mpathc: read failed after 0 of 4096 at 161061265408: Input/output error
end_request: I/O error, dev dm-2, sector 0
  /dev/mapper/mpathc: read failed after 0 of 4096 at 0: Input/output error
end_request: I/O error, dev dm-2, sector 8
  /dev/mapper/mpathc: read failed after 0 of 4096 at 4096: Input/output error
end_request: I/O error, dev dm-2, sector 0
end_request: I/O error, dev dm-3, sector 314572672
  /dev/mapper/mpatha: read failed after 0 of 4096 at 161061208064: Input/output error
end_request: I/O error, dev dm-3, sector 314572784
  /dev/mapper/mpatha: read failed after 0 of 4096 at 161061265408: Input/output error
end_request: I/O error, dev dm-3, sector 0
  /dev/mapper/mpatha: read failed after 0 of 4096 at 0: Input/output error
end_request: I/O error, dev dm-3, sector 8
  /dev/mapper/mpatha: read failed after 0 of 4096 at 4096: Input/output error
end_request: I/O error, dev dm-3, sector 0
end_request: I/O error, dev dm-2, sector 227459328
  /dev/LinuxBackupVG/rhsat_boot: read failed after 0 of 4096 at 108986368: Input/output error
end_request: I/O error, dev dm-2, sector 227459440
  /dev/LinuxBackupVG/rhsat_boot: read failed after 0 of 4096 at 109043712: Input/output error
end_request: I/O error, dev dm-2, sector 227246464
  /dev/LinuxBackupVG/rhsat_boot: read failed after 0 of 4096 at 0: Input/output error
end_request: I/O error, dev dm-2, sector 227246472
  /dev/LinuxBackupVG/rhsat_boot: read failed after 0 of 4096 at 4096: Input/output error
end_request: I/O error, dev dm-2, sector 227246464
  1 logical volume(s) in volume group "myvg" unmonitored
[  OK  ]
Sending all processes the TERM signal... [  OK  ]
Sending all processes the KILL signal... [  OK  ]
Saving random seed:  [  OK  ]
Syncing hardware clock to system time type=1111 audit(1298416538.502:9405): user pid=23543 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:hwclock_t:s0 msg='changing system time: exe="/sbin/hwclock" hostname=? addr=? terminal=console res=success'
[  OK  ]
Turning off swap:  [  OK  ]
Turning off quotas:  [  OK  ]
Unmounting file systems:  [  OK  ]
init: Re-executing /sbin/init
Halting system...
type=1128 audit(1298416539.657:9406): user pid=23117 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:initrc_t:s0 msg='init: exe="/sbin/reboot" hostname=? addr=? terminal=console res=success'
md: stopping all md devices.
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 0:0:0:0: [sda] Stopping disk
ACPI: Preparing to enter system sleep state S5
Disabling non-boot CPUs ...
Power down.

Comment 34 Mike Christie 2011-02-24 22:15:10 UTC

(In reply to comment #33)
> I am a little bit worried about the I/O errors that are thrown, although the
> system does boot cleanly after a shutdown.
> 

Hi,

Where you using a filesystem over the dm multipath device or using the dm device directly?

Comment 36 David Aquilina 2011-03-17 12:46:59 UTC

Anthony @ HP, can you please try the suggestion in Comment #32? thanks!

Comment 37 Anthony Cheung 2011-03-17 21:14:40 UTC

@David, yes with comment #32 added to multipath.conf does help.

Comment 41 Anthony Cheung 2011-05-06 20:10:46 UTC

Just to document this issue is still happening on RHEL6 Update1 Snapshot5

Comment 47 Tom Coughlan 2011-08-25 18:33:59 UTC

In reply to comment #37)

> @David, yes with comment #32 added to multipath.conf does help.

(In reply to comment #41)
> Just to document this issue is still happening on RHEL6 Update1 Snapshot5

Anthony, Elvir, Stephen,

If I understand the situation correctly, when you add the line

queue_without_daemon no

to the defaults section of /etc/multipath.conf, then the system shuts down reliably, without hanging. The remaining open questions are: 

1) is there something we can do to avoid the I/O errors shown in comment 33?

2) should we make "queue_without_daemon no" the default? 

Is this a correct summary of the situation? 

It would be nice to fix #1, but that may not be possible, since those I/Os are already queued to the device. We have to fail them. Right Ben? 

If this is true, we could document this as an unavoidable consequence of selecting queue_if_no_path.

WRT #2, we do not normally change defaults in a minor release because this can cause a surprise change of behavior for people who already have this running. This case might be an exception, though, since no one would really have any reason to want to run with "queue_without_daemon yes". Is that true Ben?

Comment 48 Anthony Cheung 2011-08-25 21:51:19 UTC

Tom

yes queue_without_daemon=no help the situation.  It has been a while but I don't recall seeing errors like those in comment 33 in my setup.  Systems just shutdown like normally would.

Comment 49 Ben Marzinski 2011-08-25 21:55:55 UTC

(In reply to comment #47)
> 1) is there something we can do to avoid the I/O errors shown in comment 33?

The issue is that with "queue_without_daemon no" multipath is no longer queueing
IO when there are no available paths.  This means that if you don't have any
valid paths to the multipath device, and there is IO to the device, you will see these errors.  If the paths are gone and never coming back, there is no way to avoid this.  All queued IO will be failed.
 
> 2) should we make "queue_without_daemon no" the default? 
> 
> Is this a correct summary of the situation? 
> 
> It would be nice to fix #1, but that may not be possible, since those I/Os are
> already queued to the device. We have to fail them. Right Ben? 
> 
> If this is true, we could document this as an unavoidable consequence of
> selecting queue_if_no_path.
> 
> WRT #2, we do not normally change defaults in a minor release because this can
> cause a surprise change of behavior for people who already have this running.

Correct.

> This case might be an exception, though, since no one would really have any
> reason to want to run with "queue_without_daemon yes". Is that true Ben?

The onlyl issue is that if someone needed to stop multipathd, even to restart it (for example when upgrading the package).  If they had a path with all of it paths failed, then the IO would get failed back, instead of being queued like
they intended. Since you don't normally stop multipathd, this probably isn't
a big issue, and I willing to consider changing the default.  We could certainly put it into the multipath.conf template.  This way, new installs
will have it, but upgrades won't.

Comment 53 Ben Marzinski 2011-10-03 18:50:48 UTC

We should be able to solve this for all users.  Like I mentioned in Comment 49, the problem is with running

# service mutipathd restart

This gets run when you upgrade you RPM.  In this case, multipathd will be coming right back, so you don't want to fail the queued IOs. To solve this, I can add a multipathd command that overrides the queue_without_daemon setting, and make
the multipathd init script use this when running a restart.  With that change in place, I don't see much risk of surprising customers with this change.

Comment 57 Ben Marzinski 2012-03-05 17:33:56 UTC

multipathd how has queue_without_daemon disabled by default. However, when doing

# service multipathd restart

queue_without_daemon will always be enforced, so paths don't fail back IO when
you restart the daemon.

# service multipathd restart

will now also work with mutipathed root filesystems.

There are also two new multipathd interactive commands

forcequeueing daemon

and

restorequeueing daemon

forcequeueing will force multipathd to set queue_without_daemon.  restorequeueing will revert it to the configured value.

Comment 59 Gris Ge 2012-05-02 06:57:15 UTC

Reproduce this problem with this line in RHEL 6.2 GA (which enable queue_without_daemon by default):
====
features "1 queue_if_no_path"
====

As RHEL 6.3 disable "queue_without_daemon" by default, reboot will not be blocked.


Ben,

Once more concern, if user specify "queue_without_daemon yes" in their configure, os shutdown will still be blocked.
Will this not a bug or we will fix it later?

Thanks.

Comment 60 Ben Marzinski 2012-05-02 17:51:31 UTC

I don't consider this a bug. If the user manually sets queue_without_daemon, then yes, it will get blocked, but it's only doing what it was told to do.

Comment 61 Ben Marzinski 2012-05-02 21:26:48 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:  Multipathd was not disabling queue_if_no_path on multipath devices when it stopped, by default.

Consequence:  Once multipathd stopped on shutdown, if all paths to a device were lost, the IO to that device would queue, causing shutdown to hang

Fix: multipath now sets the queue_without_daemon option to
"no" by default, causing all multipath devices to stop queueing when multipathd is stopped

Result: multipath devices no longer queue IO during shutdown, and the nodes no longer hang

Comment 62 Gris Ge 2012-05-03 01:43:02 UTC

Thanks for clarification.

verify this bug.

Bug #818404 for documents update for this change.

Comment 64 errata-xmlrpc 2012-06-20 14:36:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0946.html