Bug 768835
Summary: | Server Crash " iSCSI " failure | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | venkateswara reddy <venkatreddy4g> |
Component: | iscsi-initiator-utils | Assignee: | Chris Leech <cleech> |
Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 5.2 | CC: | coughlan, gregory.lee.bartholomew, leiwang, mchristi, mrezanin, qwan, venkatreddy4g, xen-maint |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-06-02 13:05:38 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
venkateswara reddy
2011-12-19 05:35:54 UTC
Could you try upgrading rhel and xen like in here: https://bugzilla.redhat.com/show_bug.cgi?id=429469 Not sure what happened. We were working on debugging it, then people just upgraded at some point and something in some update fixed it. And they ended up closing the bug. So as per "429469" users comments they have fixed the issue with kernel upgradation. If so please give me advice on kernel installation like, present kernel is 2.6.18-92.1.13.0.1.el5xen, so which kernel i have to choose for this issue to fix, which one is stable to work with oracle db, etc. Please let me know as i need to fix this issue :) Thank you, Venkat. V Could try rhel 5.7 (2.6.18-274.*). Ok thank you for the update Mike, we will try to upgrade. FYI, I'm getting this issue on Fedora 16 (3.1.6-1.fc16.x86_64) as well. Here is what is showing up in the logs: Feb 1 22:31:12 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3 Feb 1 22:34:31 node-01 kernel: [809067.808031] connection2:0: detected conn error (1021) Feb 1 22:34:32 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3) Feb 1 22:34:34 node-01 iscsid: connection2:0 is operational after recovery (1 attempts) Feb 1 22:34:35 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3 Feb 1 22:52:44 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3 Feb 8 15:14:26 node-01 kernel: [1387491.744033] connection2:0: detected conn error (1021) Feb 8 15:14:27 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3) Feb 8 15:14:30 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3 Feb 8 15:14:30 node-01 iscsid: connection2:0 is operational after recovery (1 attempts) Feb 8 15:14:57 node-01 kernel: [1387522.444018] vmx_handle_exit: Breaking out of NMI-blocked state on VCPU 1 after 1 s timeout I suspect that this issue is being triggered during times of high file IO on the host server (transferring tens of gigs of data from local to iSCSI storage). On a side note, this bug seems to have created a disconnect between a KVM virtual machine's disk-image file (vda.img) and the actual iSCSI storage backend in such a way that the virtual machine continued to run for 24 hours in the first instance and 55 hours in the second instance but the virtual machine hard-crashed (powered-off) after those respective time periods elapsed. Because of the disconnect, 24 hours and 55 hours of activity respectively "disappeared". This was particular bad for me in this case because the VM was running Moodle (an open-source course management system) for the Computer Science department here at SIUE. Effectively, all of the student's homework submissions (there are about 40 classes being hosted in this CMS) over a 24-hour period and a 55-hour period disappeared :-(. My mistake about the disappearing files in the previous post -- a student claimed that their submission disappeared but upon further investigation, it looks like user error (other files are present from that time period). P.P.S. I'm not sure if the qemu-kvm crashes are related to this kernel bug, but there is definitely a strange delayed correspondence. (In reply to comment #6) > FYI, I'm getting this issue on Fedora 16 (3.1.6-1.fc16.x86_64) as well. Here > is what is showing up in the logs: > > Feb 1 22:31:12 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3 > Feb 1 22:34:31 node-01 kernel: [809067.808031] connection2:0: detected conn > error (1021) > Feb 1 22:34:32 node-01 iscsid: Kernel reported iSCSI connection 2:0 error > (1021) state (3) > Feb 1 22:34:34 node-01 iscsid: connection2:0 is operational after recovery (1 > attempts) > Feb 1 22:34:35 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3 > Feb 1 22:52:44 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3 > Feb 8 15:14:26 node-01 kernel: [1387491.744033] connection2:0: detected conn > error (1021) > Feb 8 15:14:27 node-01 iscsid: Kernel reported iSCSI connection 2:0 error > (1021) state (3) > Feb 8 15:14:30 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3 > Feb 8 15:14:30 node-01 iscsid: connection2:0 is operational after recovery (1 > attempts) > Feb 8 15:14:57 node-01 kernel: [1387522.444018] vmx_handle_exit: Breaking out > of NMI-blocked state on VCPU 1 after 1 s timeout > Is this a log of when you hit a crash? Are you running iscsi in the vm that crashed? You are mostly like correct in the high load theory. 1021 indicates the iscsi layer had to drop the session and reconnect, because a command took too long and we could not get it unstuck using a abort or device reset. We ended up dropping the session then reconnecting. Are you seeing a panic from the FS too, or is the box just locking up and the messages you sent the last ones you saw? Oh yeah, if you see the 1020 message and your storage/san is ok (you also see that when the disks are bad or the network is flakey) then you should increase the cmd timeout or you might want to also decrease the node.session.queue_depth or node.session.cmds_max. (In reply to comment #9) > (In reply to comment #6) > > FYI, I'm getting this issue on Fedora 16 (3.1.6-1.fc16.x86_64) as well. Here > > is what is showing up in the logs: > > > > Feb 1 22:31:12 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3 > > Feb 1 22:34:31 node-01 kernel: [809067.808031] connection2:0: detected conn > > error (1021) > > Feb 1 22:34:32 node-01 iscsid: Kernel reported iSCSI connection 2:0 error > > (1021) state (3) > > Feb 1 22:34:34 node-01 iscsid: connection2:0 is operational after recovery (1 > > attempts) > > Feb 1 22:34:35 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3 > > Feb 1 22:52:44 node-01 sbd: [13025]: WARN: Latency: 4 exceeded threshold 3 > > Feb 8 15:14:26 node-01 kernel: [1387491.744033] connection2:0: detected conn > > error (1021) > > Feb 8 15:14:27 node-01 iscsid: Kernel reported iSCSI connection 2:0 error > > (1021) state (3) > > Feb 8 15:14:30 node-01 sbd: [13025]: WARN: Latency: 19 exceeded threshold 3 > > Feb 8 15:14:30 node-01 iscsid: connection2:0 is operational after recovery (1 > > attempts) > > Feb 8 15:14:57 node-01 kernel: [1387522.444018] vmx_handle_exit: Breaking out > > of NMI-blocked state on VCPU 1 after 1 s timeout > > > > Is this a log of when you hit a crash? Are you running iscsi in the vm that > crashed? > > You are mostly like correct in the high load theory. 1021 indicates the iscsi > layer had to drop the session and reconnect, because a command took too long > and we could not get it unstuck using a abort or device reset. We ended up > dropping the session then reconnecting. > > Are you seeing a panic from the FS too, or is the box just locking up and the > messages you sent the last ones you saw? > > Oh yeah, if you see the 1020 message and your storage/san is ok (you also see > that when the disks are bad or the network is flakey) then you should increase > the cmd timeout or you might want to also decrease the node.session.queue_depth > or node.session.cmds_max. The host system never crashed and seems fine. The VM is not running iscsi. The VM is a virtfs client and the virtfs mounted filesystem is also coming from the storage that is mounted via iscsi on the host system (SAN <--iSCSI--> Host <--VirtFS--> VM). There is a seeming correlation between these hiccups and the VM crashes, but the correlation is delayed (the VM continues to run for tens of hours; maybe the issue is with how qemu-kvm handles iscsi session reconnects?). "... if you see the 1020 message ..." Do you mean the 1021 message? I'm not seeing any FS panic and the box is not locking up. I'm just seeing a LogWatch report of "WARNING: Kernel Errors Present\n\tconnection2:0: detected conn error (1021) ...: 1 Time(s)" several hours prior to when my VM crashes. "... then you should increase the cmd timeout ..." I don't see anything in /etc/iscsi/iscsid.conf labeled "cmd_timeout". Are you perhaps referring to node.conn[0].timeo.noop_out_timeout or node.session.err_timeo.lu_reset_timeout? Is this a "Kernel Error", a poor configuration, or expected behaviour on iSCSI's part and perhaps it is qemu-kvm that is not handling the reset properly? Thanks, gb My VM just crashed again today. This is the third time this has happened. This time though, there was no corresponding kernel error reported on the host system. I guess the VM crashes are not being caused by the iscsi issues. At this point, I am suspecting the VirtFS filesystem simply because that is the most recent addition to this Virtual Machine. I don't get any logs on the Virtual Machine itself to indicate what the problem is. It seems like it is just humming along peacefully and then BAM -- sudden death out of nowhere. Though I have had this error in my logs on the host server, at this point, it seems that it is not the cause of my VM troubles -- the perceived correlation must have either been coincidence or perhaps indirect (i.e. maybe high file IO on the host system also destabilizes VirtFS somehow but it isn't iscsi that was afflicting VirtFS). Thanks anyway, gb This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug. Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support). The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |