Bug 768835 - Server Crash " iSCSI " failure [NEEDINFO]
Summary: Server Crash " iSCSI " failure
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: iscsi-initiator-utils
Version: 5.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Chris Leech
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-12-19 05:35 UTC by venkateswara reddy
Modified: 2014-06-02 13:05 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-02 13:05:38 UTC
Target Upstream Version:
pm-rhel: needinfo? (venkatreddy4g)


Attachments (Terms of Use)

Description venkateswara reddy 2011-12-19 05:35:54 UTC
Description of problem:

We have a Server running " Enterprise Linux Enterprise Linux Server release 5.2 (Carthage)" connected to iSCSI SAN Storage and server is rebooting every day after executing system cron jobs " cron.daily ". below is the log of the server.

Dec  6 04:07:13 prematics33 multipathd: sde: readsector0 checker reports path is down
Dec  6 04:07:53 prematics33 syslogd 1.4.1: restart.


on Dec 7th the same cron jobs executed at 04:02:01 and finished at 04:02:41 and than again sde drive got failed ( multipathd: sde: readsector0 checker reports path is down )
and than iscsi initiator iscsid deamon recoverd the failed sde drive. here system is not effected as drive recoverd.

 
Dec  7 04:03:53 prematics33 iscsid: connection2:0 is operational after recovery (2 attempts)
Dec  7 04:03:53 prematics33 iscsid: connection1:0 is operational after recovery (2 attempts)
Dec  7 04:03:58 prematics33 multipathd: sde: readsector0 checker reports path is up


on Dec 8th the cron jobs executed again at 04:02:01 as sheduled by prematics admins, and failed sde harddrive to recognise
iscsid deamon failed to recover the failed sde drive this issue caused system crash.
below is the log.

 Dec  8 04:02:50 prematics33 logger: cron.daily: tmpwatch finished
Dec  8 04:03:19 prematics33 multipathd: sde: readsector0 checker reports path is down
Dec  8 04:03:19 prematics33 multipathd: checker failed path 8:64 in map pstor05
Dec  8 04:03:19 prematics33 multipathd: pstor05: remaining active paths: 1
Dec  8 04:03:19 prematics33 kernel: device-mapper: multipath: Failing path 8:64.
Dec  8 04:03:22 prematics33 multipathd: dm-0: add map (uevent)
Dec  8 04:03:22 prematics33 multipathd: dm-0: devmap already registered
Dec  8 04:04:07 prematics33 syslogd 1.4.1: restart. 

Version-Release number of selected component (if applicable):

2.6.18-92.1.13.0.1.el5xen


How reproducible:

Not sure, is the issue from iSCSI bug or Kernal bug.

Additional info:

Please let us know if any information is required to identify the root cause.

Thank you

Comment 2 Mike Christie 2011-12-20 02:07:31 UTC
Could you try upgrading rhel and xen like in here:
https://bugzilla.redhat.com/show_bug.cgi?id=429469

Not sure what happened. We were working on debugging it, then people just upgraded at some point and something in some update fixed it. And they ended up closing the bug.

Comment 3 venkateswara reddy 2011-12-20 13:31:49 UTC
So as per "429469" users comments they have fixed the issue with kernel upgradation.

If so please give me advice on kernel installation like, present kernel is 
2.6.18-92.1.13.0.1.el5xen, so which kernel i have to choose for this issue to fix, which one is stable to work with oracle db, etc.

Please let me know as i need to fix this issue :)

Thank you,
Venkat. V

Comment 4 Mike Christie 2011-12-21 05:03:40 UTC
Could try rhel 5.7 (2.6.18-274.*).

Comment 5 venkateswara reddy 2011-12-21 07:23:24 UTC
Ok thank you for the update Mike, we will try to upgrade.

Comment 6 Gregory Lee Bartholomew 2012-02-14 18:20:33 UTC
FYI, I'm getting this issue on Fedora 16 (3.1.6-1.fc16.x86_64) as well.  Here is what is showing up in the logs:

Feb  1 22:31:12 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3
Feb  1 22:34:31 node-01 kernel: [809067.808031]  connection2:0: detected conn error (1021)
Feb  1 22:34:32 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3)
Feb  1 22:34:34 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
Feb  1 22:34:35 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3
Feb  1 22:52:44 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3
Feb  8 15:14:26 node-01 kernel: [1387491.744033]  connection2:0: detected conn error (1021)
Feb  8 15:14:27 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3)
Feb  8 15:14:30 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3
Feb  8 15:14:30 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
Feb  8 15:14:57 node-01 kernel: [1387522.444018] vmx_handle_exit: Breaking out of NMI-blocked state on VCPU 1 after 1 s timeout

I suspect that this issue is being triggered during times of high file IO on the host server (transferring tens of gigs of data from local to iSCSI storage).

On a side note, this bug seems to have created a disconnect between a KVM virtual machine's disk-image file (vda.img) and the actual iSCSI storage backend in such a way that the virtual machine continued to run for 24 hours in the first instance and 55 hours in the second instance but the virtual machine hard-crashed (powered-off) after those respective time periods elapsed.  Because of the disconnect, 24 hours and 55 hours of activity respectively "disappeared".  This was particular bad for me in this case because the VM was running Moodle (an open-source course management system) for the Computer Science department here at SIUE.  Effectively, all of the student's homework submissions (there are about 40 classes being hosted in this CMS) over a 24-hour period and a 55-hour period disappeared :-(.

Comment 7 Gregory Lee Bartholomew 2012-02-14 18:30:22 UTC
My mistake about the disappearing files in the previous post -- a student claimed that their submission disappeared but upon further investigation, it looks like user error (other files are present from that time period).

Comment 8 Gregory Lee Bartholomew 2012-02-14 18:32:41 UTC
P.P.S. I'm not sure if the qemu-kvm crashes are related to this kernel bug, but there is definitely a strange delayed correspondence.

Comment 9 Mike Christie 2012-02-14 23:45:57 UTC
(In reply to comment #6)
> FYI, I'm getting this issue on Fedora 16 (3.1.6-1.fc16.x86_64) as well.  Here
> is what is showing up in the logs:
> 
> Feb  1 22:31:12 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3
> Feb  1 22:34:31 node-01 kernel: [809067.808031]  connection2:0: detected conn
> error (1021)
> Feb  1 22:34:32 node-01 iscsid: Kernel reported iSCSI connection 2:0 error
> (1021) state (3)
> Feb  1 22:34:34 node-01 iscsid: connection2:0 is operational after recovery (1
> attempts)
> Feb  1 22:34:35 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3
> Feb  1 22:52:44 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3
> Feb  8 15:14:26 node-01 kernel: [1387491.744033]  connection2:0: detected conn
> error (1021)
> Feb  8 15:14:27 node-01 iscsid: Kernel reported iSCSI connection 2:0 error
> (1021) state (3)
> Feb  8 15:14:30 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3
> Feb  8 15:14:30 node-01 iscsid: connection2:0 is operational after recovery (1
> attempts)
> Feb  8 15:14:57 node-01 kernel: [1387522.444018] vmx_handle_exit: Breaking out
> of NMI-blocked state on VCPU 1 after 1 s timeout
> 

Is this a log of when you hit a crash? Are you running iscsi in the vm that crashed?

You are mostly like correct in the high load theory. 1021 indicates the iscsi layer had to drop the session and reconnect, because a command took too long and we could not get it unstuck using a abort or device reset. We ended up dropping the session then reconnecting.

Are you seeing a panic from the FS too, or is the box just locking up and the messages you sent the last ones you saw?

Oh yeah, if you see the 1020 message and your storage/san is ok (you also see that when the disks are bad or the network is flakey) then you should increase the cmd timeout or you might want to also decrease the node.session.queue_depth or node.session.cmds_max.

Comment 10 Gregory Lee Bartholomew 2012-02-15 15:55:37 UTC
(In reply to comment #9)
> (In reply to comment #6)
> > FYI, I'm getting this issue on Fedora 16 (3.1.6-1.fc16.x86_64) as well.  Here
> > is what is showing up in the logs:
> > 
> > Feb  1 22:31:12 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3
> > Feb  1 22:34:31 node-01 kernel: [809067.808031]  connection2:0: detected conn
> > error (1021)
> > Feb  1 22:34:32 node-01 iscsid: Kernel reported iSCSI connection 2:0 error
> > (1021) state (3)
> > Feb  1 22:34:34 node-01 iscsid: connection2:0 is operational after recovery (1
> > attempts)
> > Feb  1 22:34:35 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3
> > Feb  1 22:52:44 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3
> > Feb  8 15:14:26 node-01 kernel: [1387491.744033]  connection2:0: detected conn
> > error (1021)
> > Feb  8 15:14:27 node-01 iscsid: Kernel reported iSCSI connection 2:0 error
> > (1021) state (3)
> > Feb  8 15:14:30 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3
> > Feb  8 15:14:30 node-01 iscsid: connection2:0 is operational after recovery (1
> > attempts)
> > Feb  8 15:14:57 node-01 kernel: [1387522.444018] vmx_handle_exit: Breaking out
> > of NMI-blocked state on VCPU 1 after 1 s timeout
> > 
> 
> Is this a log of when you hit a crash? Are you running iscsi in the vm that
> crashed?
> 
> You are mostly like correct in the high load theory. 1021 indicates the iscsi
> layer had to drop the session and reconnect, because a command took too long
> and we could not get it unstuck using a abort or device reset. We ended up
> dropping the session then reconnecting.
> 
> Are you seeing a panic from the FS too, or is the box just locking up and the
> messages you sent the last ones you saw?
> 
> Oh yeah, if you see the 1020 message and your storage/san is ok (you also see
> that when the disks are bad or the network is flakey) then you should increase
> the cmd timeout or you might want to also decrease the node.session.queue_depth
> or node.session.cmds_max.

The host system never crashed and seems fine.  The VM is not running iscsi.  The VM is a virtfs client and the virtfs mounted filesystem is also coming from the storage that is mounted via iscsi on the host system (SAN <--iSCSI--> Host <--VirtFS--> VM).  There is a seeming correlation between these hiccups and the VM crashes, but the correlation is delayed (the VM continues to run for tens of hours; maybe the issue is with how qemu-kvm handles iscsi session reconnects?).

"... if you see the 1020 message ..." Do you mean the 1021 message?

I'm not seeing any FS panic and the box is not locking up.  I'm just seeing a LogWatch report of "WARNING:  Kernel Errors Present\n\tconnection2:0: detected conn error (1021) ...:  1 Time(s)" several hours prior to when my VM crashes.

"... then you should increase the cmd timeout ..." I don't see anything in /etc/iscsi/iscsid.conf labeled "cmd_timeout".  Are you perhaps referring to node.conn[0].timeo.noop_out_timeout or node.session.err_timeo.lu_reset_timeout?

Is this a "Kernel Error", a poor configuration, or expected behaviour on iSCSI's part and perhaps it is qemu-kvm that is not handling the reset properly?

Thanks,
gb

Comment 11 Gregory Lee Bartholomew 2012-02-19 19:21:20 UTC
My VM just crashed again today.  This is the third time this has happened.  This time though, there was no corresponding kernel error reported on the host system.  I guess the VM crashes are not being caused by the iscsi issues.  At this point, I am suspecting the VirtFS filesystem simply because that is the most recent addition to this Virtual Machine.  I don't get any logs on the Virtual Machine itself to indicate what the problem is.  It seems like it is just humming along peacefully and then BAM -- sudden death out of nowhere.

Though I have had this error in my logs on the host server, at this point, it seems that it is not the cause of my VM troubles -- the perceived correlation must have either been coincidence or perhaps indirect (i.e. maybe high file IO on the host system also destabilizes VirtFS somehow but it isn't iscsi that was afflicting VirtFS).

Thanks anyway,
gb

Comment 12 RHEL Program Management 2014-03-07 13:37:28 UTC
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.

Comment 13 RHEL Program Management 2014-06-02 13:05:38 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).


Note You need to log in before you can comment on or make changes to this bug.