Description of problem: We have a number of boxes connecting to an iscsi netapp target and those running RHEL5 are rebooting from time to time (between 7 and 10 times a week). We get these errors in the logs: Jan 20 13:14:56 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Jan 20 13:15:00 xen1 iscsid: connection1:0 is operational after recovery (2 attempts) Jan 20 13:15:21 xen1 kernel: connection1:0: iscsi: detected conn error (1011) Jan 20 13:15:22 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Jan 20 13:15:26 xen1 iscsid: connection1:0 is operational after recovery (2 attempts) Jan 20 13:15:35 xen1 kernel: connection1:0: iscsi: detected conn error (1011) Jan 20 13:15:37 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Jan 20 13:15:41 xen1 iscsid: connection1:0 is operational after recovery (2 attempts) Version-Release number of selected component (if applicable): 2.6.18-53.1.4.el5xen How reproducible: We can't make it reproduce itself but it is happening. We're running xen instances over lvm over iscsi. The xen dom0's are rebooting so I'm not really sure if this is a kernel bug or an iscsi bug but considering its just a reboot (we're not even getting oops) we think its a kernel thing. Additional info: This is actively causing outages for Fedora and we'd be happy to help provide access to machines, etc if needed.
Created attachment 292299 [details] Attaching messages from NetApp Release 7.2.1.
Its been suggested that this may be a duplicate of #245823 I'm going to test the fix suggested there now.
Further info on this, the last thing we see before the reboot in syslog is: Feb 20 02:52:15 xen2 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping we have other servers getting that log as well but the machines stay up. ipmitool sel list shows: 47 | 02/18/2008 | 14:44:02 | OS Stop/Shutdown #0x46 | Run-time critical stop | Asserted 48 | Linux kernel panic: Fatal excep 49 | Linux kernel panic: tion 4a | 02/19/2008 | 04:34:33 | OS Stop/Shutdown #0x46 | Run-time critical stop | Asserted 4b | Linux kernel panic: Fatal excep 4c | Linux kernel panic: tion 4d | 02/20/2008 | 02:53:34 | OS Stop/Shutdown #0x46 | Run-time critical stop | Asserted 4e | Linux kernel panic: Fatal excep 4f | Linux kernel panic: tion
The iscsi initiator does a iscsi ping to the target every node.conn[0].timeo.noop_out_interval (check out /etc/iscsi/iscsi.conf) seconds to check that the target is ok. If we do not get a response in node.conn[0].timeo.noop_out_timeout seconds then you will see the nop-out message: (in RHEL 5.2 the message is a little different and instead of saying something about a nop-out timing it says that a ping timedout) Feb 20 02:52:15 xen2 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping So this means that we tried to ping a target and it did not reply for 15 seconds, so the iscsi initiator drops the session and tries to rebuild it. These other messages in the logs indicate that there was connection error. When a nop-out/ping times out and we drop the session/connection you could see this. You would also see it if there were some connection problem like someone pulled a cable or for some reason we could not send the IO on the network and the network layer returned some error to us, or if for some reason IO was not executing fast enough and the scsi command timer fired. Jan 20 13:14:56 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) This message indicates that we reconnected and relogged in a couple seconds later. Jan 20 13:15:00 xen1 iscsid: connection1:0 is operational after recovery (2 attempts) And now this is a little weird. Every once and a while you might see a nop/pint timeout because there is some problem in the network, but I am not sure what is happening here. It looks like we keep dropping the session and reconnecting a lot. When you see this are there any messages about a nop/ping timing out or any other errors? Jan 20 13:15:21 xen1 kernel: connection1:0: iscsi: detected conn error (1011) Jan 20 13:15:22 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Jan 20 13:15:26 xen1 iscsid: connection1:0 is operational after recovery (2 attempts) Jan 20 13:15:35 xen1 kernel: connection1:0: iscsi: detected conn error (1011) Jan 20 13:15:37 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Jan 20 13:15:41 xen1 iscsid: connection1:0 is operational after recovery (2 attempts) And I have never seen where the above errors cause the box to get rebooted. Is iSCSI being run in DomU or Dom0, and is DomU or Dom0 rebooting? And what type of workload is going on when you see these problems? Is there heavy IO going on or is there no IO except the iscsi pings, and is iSCSI being used for root? I do not think 245823 is your issue. And bug 179887 is probably not your issue too, because with that one we are not able to reconnect and do IO again, because we are not able to allocate memory at the network layer. I have also not seen it cause the box to reboot - maybe that just has something to do with the ipmi use though. I cannot see the messages.gz file. Can you email it to me in private, and can I get access to one of the boxes causing problems. Thanks.
(In reply to comment #5) > Is iSCSI being run in DomU or Dom0, and is DomU or Dom0 rebooting? > Ignore this. I see the answer in the original comment. One question though because I am not familiar with xen. When you say Xen instance does that mean iscsi was running in what vmware would call the guest? And does Dom0 rebooting mean that the guest is rebooting or that the real box is rebooting?
Oh, sorry, yes. The dom0 is rebooting. Our setup is such that dom0 has an iscsi mount (lvm) and the domU's are on that LVM.
Coincidentally we've been seeing a similar problem (with the entire machine rebooting, not sure if thats happening here, I'm not too famaliar with Xen terminology), with one of the servers we're going to use to host the rpmfusion repo. We've tracked this down to (most likely) problems with an adaptec (formerly dtp) i2o raid (V) controller. Do you happen to have one of those in this box? I've read you use iscsi, but maybe for the rootfs or something like that you use something else? Anyways if you have one of those controllers, we might have a solution (we still need to test it).
In our case dom0 is all local, only the domU's are run over iscsi so an outage there, in theory at least, I don't think would take the whole machine down like that. Its interesting because the log noted in comment #4 is happening on may servers, but they don't reboot.
What architecture is this? Could you post your kernel command line options and dmesg output?
Created attachment 295816 [details] dmesg from xen2 Was requested.
this is an x86_64 box, the current running grub config is: serial --unit=0 --speed=9600 terminal --timeout=5 serial console title Red Hat Enterprise Linux Server (2.6.18-72.el5xen) root (hd0,0) kernel /xen.gz-2.6.18-72.el5 com1=9600n8 module /vmlinuz-2.6.18-72.el5xen ro root=/dev/VolGroup00/root console=ttyS0,9600n8 console=tty0 rhgb quiet module /initrd-2.6.18-72.el5xen.img
Created attachment 295889 [details] Oops from crash I've finally caught an oops from the crash. Here it is.
Side note about this, it happened during a time of high nfs load, its possible thats a contributor as we didn't see much load this last weekend and had no issues.
Created attachment 296285 [details] Got another oops. Here it is. This is another oops we just got a bit ago.
I am adding Herbet Xu, because he knows a lot about Xen and networking and the oopses in comment 13 and 15 has lots of that in there :)
The oops indicates that one of the skb page frags is bogus. So you should try to identify why that's the case.
(In reply to comment #17) > The oops indicates that one of the skb page frags is bogus. So you should try > to identify why that's the case. How would one do that?
For a start find out who's generating the packet. Is the TCP payload coming from user-space or kernel-space iSCSI?
We've never been able to re-create this through any of our tests.
Created attachment 297166 [details] Here's another oops This is another oops that happend this morning.
OK, both oops are saying the same thing, one of the skb page frags have been unmapped. Interestingly, both back traces are caused by TCP retransmits. So going out on a limb here, let's assume that the packet in question is generated in kernel by iSCSI, then what this is saying is that the page reference count on the skb page frag is broken. So that's where I would start looking (of course, I have way too much on my plate right now to actually look so someone else will need to do that :)
Thanks Herbert. I am looking at it. Mike, I will send you a patch this weekend that should spit out some more info if it is iscsi. Mike, is it also possible to take nfs or iscsi out of the picture so we can narrow this down by one major network user?
we can take iscsi out of the picture, though taking nfs out would probably take something else. I'll work on taking iscsi out.
This host is no longer running iscsi at all.
(In reply to comment #23) > Thanks Herbert. I am looking at it. > > Mike, I will send you a patch this weekend that should spit out some more info > if it is iscsi. > > Mike, is it also possible to take nfs or iscsi out of the picture so we can > narrow this down by one major network user? If you've got that patch let me know, We had another host (one that used to reboot but stopped) reboot. Last bit in the logs: Mar 11 09:37:04 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. Mar 11 09:37:08 xen1 iscsid: connection2:0 is operational after recovery (2 attempts) Mar 11 09:41:59 xen1 syslogd 1.4.1: restart.
This happened again on our xen1 box. Though less frequent then xen2 we use iscsi heavily and we really need to get this bug tracked down. Do we have that patch for more iscsi debugging?
Am I being ignored?
Mike, so you got a crash without iSCSI? Could you please attach the stack trace with that? Thanks!
it only crashes with iscsi. We could only afford to remove iscsi from one server (the one that was rebooting frequently) and since we did that, it has not rebooted.
OK so it really seems to be an iSCSI issue. I don't have any time right now but if it's still unresolved in a couple of weeks time I can start looking at the iSCSI code.
(In reply to comment #28) > Am I being ignored? I have been busy with RHEL5 stuff. I will try and get a patch out seen.
Thanks, I don't mean to sound stern but this is actively causing outages in Fedora and regularly at night (which wakes me up).
FWIW, I have not had a single unexpected reboot in months. I'd forgotten this ticket was even open. Feel free to close it.
I think I saw this last night on a xen/iSCSI box: 2.6.18-194.26.1.el5xen Jan 6 23:51:32 stevens kernel: connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 307293912, last ping 307295162, now 307296412 Jan 6 23:51:32 stevens kernel: connection3:0: detected conn error (1011) Jan 6 23:51:33 stevens iscsid: Kernel reported iSCSI connection 3:0 error (1011) state (3) Jan 6 23:51:33 stevens kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 307294168, last ping 307295418, now 307296668 Jan 6 23:51:33 stevens kernel: connection1:0: detected conn error (1011) Jan 6 23:51:34 stevens kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 307294385, last ping 307295635, now 307296885 Jan 6 23:51:34 stevens kernel: connection2:0: detected conn error (1011) Jan 6 23:51:34 stevens iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Jan 6 23:51:35 stevens iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) By time I got to it, the machine had rebooted on its own accord and there wasn't an oops in the kernel log (syslog writes to a local disk, rather than iSCSI). I'm out of town, so moderately pleased by the unexpected reboot, but didn't expect it either. On this machine, some of the Xen VM's use iSCSI targets but none of the Dom0 resources are on iSCSI.
I'm having a similar issue as described by the previous comments. A Xen server with dom0 connected to an iSCSI SAN seems to be having kernel panics because if the iSCSI driver. Any ideas? 2.6.18-128.2.1.4.25.el5xen Apr 24 02:42:14 NJOVS02 kernel: connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 260541169, last ping 260542419, now 260543669 Apr 24 02:42:14 NJOVS02 kernel: connection3:0: detected conn error (1011) Apr 24 02:42:14 NJOVS02 iscsid: Kernel reported iSCSI connection 3:0 error (1011) state (3) Apr 24 02:42:14 NJOVS02 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 260541282, last ping 260542532, now 260543782 Apr 24 02:42:14 NJOVS02 kernel: connection1:0: detected conn error (1011) Apr 24 02:42:14 NJOVS02 kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 260541282, last ping 260542532, now 260543782 Apr 24 02:42:14 NJOVS02 kernel: connection2:0: detected conn error (1011) Apr 24 02:42:15 NJOVS02 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Apr 24 02:42:15 NJOVS02 iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) Apr 24 02:42:20 NJOVS02 iscsid: connect to 172.16.16.101:3260 failed (No route to host) Apr 24 02:42:52 NJOVS02 last message repeated 15 times Apr 24 02:45:59 NJOVS02 syslogd 1.4.1: restart.
(In reply to comment #36) > the iSCSI driver. Any ideas? > > 2.6.18-128.2.1.4.25.el5xen > > Apr 24 02:42:14 NJOVS02 kernel: connection3:0: ping timeout of 5 secs expired, > > recv timeout 5, last rx 260541169, last ping 260542419, now 260543669 > Apr 24 02:42:14 NJOVS02 kernel: connection3:0: detected conn error (1011) > Apr 24 02:42:14 NJOVS02 iscsid: Kernel reported iSCSI connection 3:0 error > > (1011) state (3) > Apr 24 02:42:14 NJOVS02 kernel: connection1:0: ping timeout of 5 secs expired, iSCSI layer did not see any traffic going on, so it sent a iscsi ping to the target. The target did not respond, so the initaitor dropped the connection. > > recv timeout 5, last rx 260541282, last ping 260542532, now 260543782 > Apr 24 02:42:14 NJOVS02 kernel: connection1:0: detected conn error (1011) > Apr 24 02:42:14 NJOVS02 kernel: connection2:0: ping timeout of 5 secs expired, > > recv timeout 5, last rx 260541282, last ping 260542532, now 260543782 > Apr 24 02:42:14 NJOVS02 kernel: connection2:0: detected conn error (1011) > Apr 24 02:42:15 NJOVS02 iscsid: Kernel reported iSCSI connection 1:0 error > > (1011) state (3) > Apr 24 02:42:15 NJOVS02 iscsid: Kernel reported iSCSI connection 2:0 error > > (1011) state (3) > Apr 24 02:42:20 NJOVS02 iscsid: connect to 172.16.16.101:3260 failed (No route Could not connect to target. No route to host. Probably a network problem or could be target is down. > > to host) > Apr 24 02:42:52 NJOVS02 last message repeated 15 times > Apr 24 02:45:59 NJOVS02 syslogd 1.4.1: restart. We probably retried for replacement_timeout seconds (see iscsid.conf) and could not reconnect and so your FS got IO errors or maybe a hangcheck/watchdog timer fired and caused the system to reboot. Normally there is something in the logs though.
@Mike Sun: FWIW, I upgraded my Dom0 to Fedora 14 with 2.6.32 Xen4 and it's been stable. I don't have my Xen 3 config any longer but I did gain more stability by playing with the scheduler credits to give Dom0 more time. I got the impression there was a race/deadlock, but that's not backed by empirical data.
Hi Mike, We are also facing the system crash issue the same as mentioned by Mike Sun. below is the log of the system. Dec 6 04:07:13 prematics33 multipathd: sde: readsector0 checker reports path is down Dec 6 04:07:53 prematics33 syslogd 1.4.1: restart. on Dec 7th the same cron jobs executed at 04:02:01 and finished at 04:02:41 and than again sde drive got failed ( multipathd: sde: readsector0 checker reports path is down ) and than iscsi initiator iscsid deamon recoverd the failed sde drive. here system is not effected as drive recoverd. Dec 7 04:03:53 prematics33 iscsid: connection2:0 is operational after recovery (2 attempts) Dec 7 04:03:53 prematics33 iscsid: connection1:0 is operational after recovery (2 attempts) Dec 7 04:03:58 prematics33 multipathd: sde: readsector0 checker reports path is up on Dec 8th the cron jobs executed again at 04:02:01 as sheduled by, and failed sde harddrive to recognise iscsid deamon failed to recover the failed sde drive this issue caused system crash. below is the log. Dec 8 04:02:50 prematics33 logger: cron.daily: tmpwatch finished Dec 8 04:03:19 prematics33 multipathd: sde: readsector0 checker reports path is down Dec 8 04:03:19 prematics33 multipathd: checker failed path 8:64 in map pstor05 Dec 8 04:03:19 prematics33 multipathd: pstor05: remaining active paths: 1 Dec 8 04:03:19 prematics33 kernel: device-mapper: multipath: Failing path 8:64. Dec 8 04:03:22 prematics33 multipathd: dm-0: add map (uevent) Dec 8 04:03:22 prematics33 multipathd: dm-0: devmap already registered Dec 8 04:04:07 prematics33 syslogd 1.4.1: restart. Do we have any solution to fix this issue, please let me know. Thank you, Venkat. V
Venkat, What kernel and rhel version are you using?
Thanks for the reply Mike, i have updated it into "Bug 768835" Thank you, Venkat. V