Bug 848425

Summary: [bisected] System won't shut down with unreachable iSCSI target
Product: [Fedora] Fedora Reporter: Stefan Ring <stefanrin>
Component: kernelAssignee: Mike Christie <mchristi>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 17CC: dan.j.williams, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-10 11:12:38 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Stefan Ring 2012-08-15 10:43:17 EDT
Description of problem:
System does not shut down

Version-Release number of selected component (if applicable):
3.5.1-1.fc17

How reproducible:
I don't know. It happens whenever I try to reboot. This did not happen with version 3.4.3-1.fc17. There is an active LVM volume group on the iSCSI target, and the target is unreachable (because it is actually a KVM guest living on the same machine).


Actual results:

Hangs forever, spitting out stack traces once in a while.

Expected results:

Clean shutdown.

Additional info:

After upgrading my F17 system from kernel 3.4.3-1.fc17 to 3.5.1-1.fc17, it is unable to shutdown or reboot at all. It gets stuck in some iscsi wind-down code. Ironically, I was having another shutdown-related problem with NFS (#830862), which I hoped would get resolved by the upgrade, but now it's much worse than it was before :(.

Pid: 745, comm: iscsid Not tainted 3.5.1-1.str.fc17.x86_64
RIP __scsi_host_match+0x20

Call Trace:
? scsi_target_reap+0x28
scsi_remove_target+0x199
__iscsi_unbind_session+0xbf
iscsi_remove_session+0xe2
iscsi_destroy_session+0x16
iscsi_session_teardown+0xa6
iscsi_sw_tcp_session_destroy+0x4b
iscsi_if_rx+0xab0
? __kfree_skb+0x47
? kfree+0x49
? kmem_cache_alloc_node+0x38
? __kmalloc_node_track_caller+0x63
? netlink_sendmsg+0x22a
netlink_unicast+0x19d
netlink_sendmsg+0x2bd
sock_sendmsg+0xf8
? sock_recvmsg+0xfd
? move_addr_to_kernel+0x5a
? verify_iovec+0x56
__sys_sendmsg+0x3ac
? sysfs_refresh_inode+0x90
? sysfs_getattr+0x52
? cp_new_stat+0x120
sys_sendmsg+0x49
system_call_fastpath+0x16

The exact location varies from slightly, but it's always below scsi_remove_target.

Meanwhile, I've become quite adept at typing in stack traces from screen photographs ;).
Comment 1 Stefan Ring 2012-08-15 17:14:31 EDT
I don't know the actual message because it scrolled off the screen, and I couldn't scroll back. After a while, multiple "CPU#x stuck for xxs" messages appear.
Comment 2 Stefan Ring 2012-08-16 14:26:39 EDT
A few more observations:

• ssh sessions stay alive; I can still watch "top".
• systemd gets stuck in "D" state (as does apparently every other process trying to access a disk).
• The volume group must be activated for this to happen. Fortunately, this means that I now know how to work around this issue (deactivate it before shutting down).
Comment 3 Stefan Ring 2012-08-22 07:53:16 EDT
I can reproduce it reliably now, using only two KVM virtual guests.

One is the iSCSI target. For this I'm using CentOS 6.2 (c6). The other is the problematic iSCSI consumer (initiator) running F17 (f17):

c6$ sudo service tgtd start
f17$ sudo vgchange -ay && sudo mount -o barrier=0 /dev/mapper/vg_scsi-lvtest1 /xfs
f17$ sudo umount /xfs
c6$ sudo service tgtd force-stop
(wait for 120 seconds for f17 to complain that it cannot reconnect)
f17$ sudo reboot

The mounted file system is (predictably) XFS.
Comment 4 Stefan Ring 2012-08-22 10:31:42 EDT
(In reply to comment #2)
> Fortunately, this
> means that I now know how to work around this issue (deactivate it before
> shutting down).

... which also doesn't work :(.
Comment 5 Stefan Ring 2012-08-22 10:32:48 EDT
It is caused by this commit [1]. This is the stable series, shortly after v3.5. Mainline mailing list thread seems to be [2].

What happens is this. Don't be confused by the version number; it's actually a build of the aforementioned version (10f8d5b86743):

[  136.160148]  session1: session recovery timed out after 120 secs
[  159.903518] sd 2:0:0:1: [sda] Synchronizing SCSI cache
[  159.904013] sd 2:0:0:1: [sda]  
[  159.904249] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  184.023076] BUG: soft lockup - CPU#0 stuck for 22s! [iscsid:648]
[  184.023076] Modules linked in: xfs lockd sunrpc be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter(-) ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack tpm_bios snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_timer snd microcode pcspkr virtio_balloon virtio_net soundcore i2c_piix4 i2c_core virtio_blk floppy [last unloaded: scsi_wait_scan]
[  184.023076] CPU 0 
[  184.023076] Modules linked in:[  184.023076]  xfs lockd sunrpc be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter(-) ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack tpm_bios snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_timer snd microcode pcspkr virtio_balloon virtio_net soundcore i2c_piix4 i2c_core virtio_blk floppy [last unloaded: scsi_wait_scan]

[  184.023076] Pid: 648, comm: iscsid Not tainted 3.4.0-0.rc0.git10.1.el6.x86_64 #1 Red Hat KVM
[  184.023076] RIP: 0010:[<ffffffff81608221>]  [<ffffffff81608221>] _raw_spin_unlock_irqrestore+0x11/0x20
[  184.023076] RSP: 0018:ffff88007b1c38d8  EFLAGS: 00000286
[  184.023076] RAX: 0000000000000286 RBX: ffffffff81077022 RCX: 000000000000002a
[  184.023076] RDX: ffff88007b1f1018 RSI: 0000000000000286 RDI: 0000000000000286
[  184.023076] RBP: ffff88007b1c38d8 R08: ffffffff81cd8060 R09: 0000000000000080
[  184.023076] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88007b2ec810
[  184.023076] R13: ffff88007b2ed138 R14: ffff8800368a53b8 R15: ffff8800368a5078
[  184.023076] FS:  00007f38dfc53740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  184.023076] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  184.023076] CR2: 0000000000f369f0 CR3: 000000007b1ae000 CR4: 00000000000006f0
[  184.023076] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  184.023076] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  184.023076] Process iscsid (pid: 648, threadinfo ffff88007b1c2000, task ffff88007b1a5c80)
[  184.023076] Stack:
[  184.023076]  ffff88007b1c3908 ffffffff813dc447 0000000000000296 ffff88007b1f1000
[  184.023076]  ffff88007b2ec7f0 ffff88007b2ec800 ffff88007b1c3958 ffffffff813dfb59
[  184.023076]  0000000000000286 ffff88007b2ec800 0000000000000000 ffff88007b2ed000
[  184.023076] Call Trace:
[  184.023076]  [<ffffffff813dc447>] scsi_target_reap+0x77/0xe0
[  184.023076]  [<ffffffff813dfb59>] scsi_remove_target+0x1a9/0x1c0
[  184.023076]  [<ffffffffa011bf34>] __iscsi_unbind_session+0xb4/0x170 [scsi_transport_iscsi]
[  184.023076]  [<ffffffffa011c0ce>] iscsi_remove_session+0xde/0x1b0 [scsi_transport_iscsi]
[  184.023076]  [<ffffffffa011c1b6>] iscsi_destroy_session+0x16/0x60 [scsi_transport_iscsi]
[  184.023076]  [<ffffffffa0132739>] iscsi_session_teardown+0xa9/0xd0 [libiscsi]
[  184.023076]  [<ffffffffa013e160>] iscsi_sw_tcp_session_destroy+0x50/0x70 [iscsi_tcp]
[  184.023076]  [<ffffffffa011c725>] iscsi_if_rx+0x295/0xd10 [scsi_transport_iscsi]
[  184.023076]  [<ffffffff814f9ec1>] ? __skb_recv_datagram+0x121/0x2f0
[  184.023076]  [<ffffffff811768ec>] ? kfree+0x2c/0x120
[  184.023076]  [<ffffffff814f4b5b>] ? __alloc_skb+0x4b/0x170
[  184.023076]  [<ffffffff8152bb58>] netlink_unicast+0x1a8/0x220
[  184.023076]  [<ffffffff8152c94c>] netlink_sendmsg+0x20c/0x300
[  184.023076]  [<ffffffff814ec5a8>] sock_sendmsg+0xf8/0x130
[  184.023076]  [<ffffffff811a794b>] ? mntput_no_expire+0x4b/0x130
[  184.023076]  [<ffffffff811a7a53>] ? mntput+0x23/0x40
[  184.023076]  [<ffffffff811950a2>] ? path_put+0x22/0x30
[  184.023076]  [<ffffffff814eb07e>] ? move_addr_to_kernel+0x4e/0x90
[  184.023076]  [<ffffffff814f92c5>] ? verify_iovec+0x85/0xf0
[  184.023076]  [<ffffffff814edcdd>] __sys_sendmsg+0x3fd/0x420
[  184.023076]  [<ffffffff811f9f30>] ? sysfs_refresh_inode+0x90/0xa0
[  184.023076]  [<ffffffff8118f436>] ? cp_new_stat+0x116/0x130
[  184.023076]  [<ffffffff814edf09>] sys_sendmsg+0x49/0x90
[  184.023076]  [<ffffffff816104e9>] system_call_fastpath+0x16/0x1b
[  184.023076] Code: 40 00 f3 90 0f b6 07 38 c2 75 f7 c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 80 07 01 48 89 f7 57 9d <66> 66 90 66 90 c9 c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 


[  159.9] is when I issued reboot.


[1] http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=10f8d5b86743b33d841a175303e2bf67fd620f42
[2] https://lkml.org/lkml/2012/6/22/31
Comment 6 Stefan Ring 2012-08-31 09:58:18 EDT
This should solve it:

http://marc.info/?t=134621728400002&r=1&w=2&n=3
Comment 8 Josh Boyer 2012-10-08 08:54:32 EDT
3.5.6 is build in koji and filed as an update.  It hasn't hit the repos yet, but if you could test it that would be appreciated:

https://admin.fedoraproject.org/updates/kernel-3.5.6-1.fc17
Comment 9 Stefan Ring 2012-10-08 12:28:50 EDT
I left an anonymous comment. Fixed.
Comment 10 Josh Boyer 2012-10-10 11:12:38 EDT
(In reply to comment #8)
> 3.5.6 is build in koji and filed as an update.  It hasn't hit the repos yet,
> but if you could test it that would be appreciated:
> 
> https://admin.fedoraproject.org/updates/kernel-3.5.6-1.fc17

This is in the stable repo.  Closing.