429469 – Frequent reboots of servers after upgrading to RHEL5.

Bug 429469 - Frequent reboots of servers after upgrading to RHEL5.

Summary: Frequent reboots of servers after upgrading to RHEL5.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Mike Christie
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-01-20 20:21 UTC by Mike McGrath
Modified:	2013-01-10 04:32 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-07-02 16:28:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Attaching messages from NetApp Release 7.2.1. (102.91 KB, application/x-gzip) 2008-01-20 20:50 UTC, Marek Mahut	no flags	Details
dmesg from xen2 (48.85 KB, text/plain) 2008-02-25 17:24 UTC, Mike McGrath	no flags	Details
Oops from crash (3.24 KB, text/plain) 2008-02-26 10:05 UTC, Mike McGrath	no flags	Details
Got another oops. Here it is. (3.68 KB, text/plain) 2008-02-28 22:56 UTC, Mike McGrath	no flags	Details
Here's another oops (3.63 KB, text/plain) 2008-03-07 10:49 UTC, Mike McGrath	no flags	Details
View All

Description Mike McGrath 2008-01-20 20:21:08 UTC

Description of problem:

We have a number of boxes connecting to an iscsi netapp target and those running
RHEL5 are rebooting from time to time (between 7 and 10 times a week).  We get
these errors in the logs:


Jan 20 13:14:56 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011)
state (3)
Jan 20 13:15:00 xen1 iscsid: connection1:0 is operational after recovery (2
attempts)
Jan 20 13:15:21 xen1 kernel:  connection1:0: iscsi: detected conn error (1011)
Jan 20 13:15:22 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011)
state (3)
Jan 20 13:15:26 xen1 iscsid: connection1:0 is operational after recovery (2
attempts)
Jan 20 13:15:35 xen1 kernel:  connection1:0: iscsi: detected conn error (1011)
Jan 20 13:15:37 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011)
state (3)
Jan 20 13:15:41 xen1 iscsid: connection1:0 is operational after recovery (2
attempts)

Version-Release number of selected component (if applicable):

2.6.18-53.1.4.el5xen

How reproducible:
We can't make it reproduce itself but it is happening.  We're running xen
instances over lvm over iscsi.  The xen dom0's are rebooting so I'm not really
sure if this is a kernel bug or an iscsi bug but considering its just a reboot
(we're not even getting oops) we think its a kernel thing.

Additional info:

This is actively causing outages for Fedora and we'd be happy to help provide
access to machines, etc if needed.

Comment 2 Marek Mahut 2008-01-20 20:50:45 UTC

Created attachment 292299 [details]
Attaching messages from NetApp Release 7.2.1.

Comment 3 Mike McGrath 2008-01-23 18:44:44 UTC

Its been suggested that this may be a duplicate of #245823  I'm going to test
the fix suggested there now.

Comment 4 Mike McGrath 2008-02-20 18:10:06 UTC

Further info on this, the last thing we see before the reboot in syslog is:

Feb 20 02:52:15 xen2 iscsid: Nop-out timedout after 15 seconds on connection 2:0
state (3). Dropping

we have other servers getting that log as well but the machines stay up.

ipmitool sel list
shows:
  47 | 02/18/2008 | 14:44:02 | OS Stop/Shutdown #0x46 | Run-time critical stop |
Asserted
  48 | Linux kernel panic: Fatal excep
  49 | Linux kernel panic: tion
  4a | 02/19/2008 | 04:34:33 | OS Stop/Shutdown #0x46 | Run-time critical stop |
Asserted
  4b | Linux kernel panic: Fatal excep
  4c | Linux kernel panic: tion
  4d | 02/20/2008 | 02:53:34 | OS Stop/Shutdown #0x46 | Run-time critical stop |
Asserted
  4e | Linux kernel panic: Fatal excep
  4f | Linux kernel panic: tion

Comment 5 Mike Christie 2008-02-20 19:50:37 UTC

The iscsi initiator does a iscsi ping to the target every 
node.conn[0].timeo.noop_out_interval (check out /etc/iscsi/iscsi.conf) seconds
to check that the target is ok. If we do not get a response in
node.conn[0].timeo.noop_out_timeout seconds then you will see the nop-out message:
(in RHEL 5.2 the message is a little different and instead of saying something
about a nop-out timing it says that a ping timedout)

Feb 20 02:52:15 xen2 iscsid: Nop-out timedout after 15 seconds on connection 2:0
state (3). Dropping

So this means that we tried to ping a target and it did not reply for 15
seconds, so the iscsi initiator drops the session and tries to rebuild it.


These other messages in the logs indicate that there was connection error. When
a nop-out/ping times out and we drop the session/connection you could see this.
You would also see it if there were some connection problem like someone pulled
a cable or for some reason we could not send the IO on the network and the
network layer returned some error to us, or if for some reason IO was not
executing fast enough and the scsi command timer fired.



Jan 20 13:14:56 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011)
state (3)

This message indicates that we reconnected and relogged in a couple seconds later.

Jan 20 13:15:00 xen1 iscsid: connection1:0 is operational after recovery (2
attempts)




And now this is a little weird. Every once and a while you might see a nop/pint
timeout because there is some problem in the network, but I am not sure what is
happening here. It looks like we keep dropping the session and reconnecting a lot.

When you see this are there any messages about a nop/ping timing out or any
other errors?




Jan 20 13:15:21 xen1 kernel:  connection1:0: iscsi: detected conn error (1011)
Jan 20 13:15:22 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011)
state (3)
Jan 20 13:15:26 xen1 iscsid: connection1:0 is operational after recovery (2
attempts)
Jan 20 13:15:35 xen1 kernel:  connection1:0: iscsi: detected conn error (1011)
Jan 20 13:15:37 xen1 iscsid: Kernel reported iSCSI connection 1:0 error (1011)
state (3)
Jan 20 13:15:41 xen1 iscsid: connection1:0 is operational after recovery (2
attempts)


And I have never seen where the above errors cause the box to get rebooted.

Is iSCSI being run in DomU or Dom0, and is DomU or Dom0 rebooting?

And what type of workload is going on when you see these problems? Is there
heavy IO going on or is there no IO except the iscsi pings, and is iSCSI being
used for root?

I do not think 245823 is your issue. And bug 179887 is probably not your issue
too, because with that one we are not able to reconnect and do IO again, because
we are not able to allocate memory at the network layer. I have also not seen it
cause the box to reboot - maybe that just has something to do with the ipmi use
though.

I cannot see the messages.gz file. Can you email it to me in private, and can I
get access to one of the boxes causing problems. Thanks.

Comment 6 Mike Christie 2008-02-20 19:56:41 UTC

(In reply to comment #5)
> Is iSCSI being run in DomU or Dom0, and is DomU or Dom0 rebooting?
> 

Ignore this. I see the answer in the original comment.

One question though because I am not familiar with xen.

When you say Xen instance does that mean iscsi was running in what vmware would
call the guest? And does Dom0 rebooting mean that the guest is rebooting or that
the real box is rebooting?

Comment 7 Mike McGrath 2008-02-25 15:18:07 UTC

Oh, sorry, yes.  The dom0 is rebooting.  Our setup is such that dom0 has an
iscsi mount (lvm) and the domU's are on that LVM.

Comment 8 Hans de Goede 2008-02-25 15:46:42 UTC

Coincidentally we've been seeing a similar problem (with the entire machine
rebooting, not sure if thats happening here, I'm not too famaliar with Xen
terminology), with one of the servers we're going to use to host the rpmfusion repo.

We've tracked this down to (most likely) problems with an adaptec (formerly dtp)
i2o raid (V) controller. Do you happen to have one of those in this box? I've
read you use iscsi, but maybe for the rootfs or something like that you use
something else?

Anyways if you have one of those controllers, we might have a solution (we still
need to test it).

Comment 9 Mike McGrath 2008-02-25 15:51:59 UTC

In our case dom0 is all local, only the domU's are run over iscsi so an outage
there, in theory at least, I don't think would take the whole machine down like
that. 

Its interesting because the log noted in comment #4 is happening on may servers,
but they don't reboot.

Comment 10 Jeff Moyer 2008-02-25 17:17:06 UTC

What architecture is this?  Could you post your kernel command line options and
dmesg output?

Comment 11 Mike McGrath 2008-02-25 17:24:16 UTC

Created attachment 295816 [details]
dmesg from xen2

Was requested.

Comment 12 Mike McGrath 2008-02-25 17:25:18 UTC

this is an x86_64 box, the current running grub config is:

serial --unit=0 --speed=9600
terminal --timeout=5 serial console
title Red Hat Enterprise Linux Server (2.6.18-72.el5xen)
        root (hd0,0)
        kernel /xen.gz-2.6.18-72.el5 com1=9600n8
        module /vmlinuz-2.6.18-72.el5xen ro root=/dev/VolGroup00/root
console=ttyS0,9600n8 console=tty0 rhgb quiet
        module /initrd-2.6.18-72.el5xen.img

Comment 13 Mike McGrath 2008-02-26 10:05:13 UTC

Created attachment 295889 [details]
Oops from crash

I've finally caught an oops from the crash.  Here it is.

Comment 14 Mike McGrath 2008-02-26 17:05:49 UTC

Side note about this, it happened during a time of high nfs load, its possible
thats a contributor as we didn't see much load this last weekend and had no issues.

Comment 15 Mike McGrath 2008-02-28 22:56:06 UTC

Created attachment 296285 [details]
Got another oops.  Here it is.

This is another oops we just got a bit ago.

Comment 16 Mike Christie 2008-02-29 06:05:48 UTC

I am adding Herbet Xu, because he knows a lot about Xen and networking and the
oopses in comment 13 and 15 has lots of that in there :)

Comment 17 Herbert Xu 2008-03-05 11:42:14 UTC

The oops indicates that one of the skb page frags is bogus.  So you should try
to identify why that's the case.

Comment 18 Jesse Keating 2008-03-05 16:24:07 UTC

(In reply to comment #17)
> The oops indicates that one of the skb page frags is bogus.  So you should try
> to identify why that's the case.

How would one do that?

Comment 19 Herbert Xu 2008-03-06 11:37:27 UTC

For a start find out who's generating the packet.  Is the TCP payload coming
from user-space or kernel-space iSCSI?

Comment 20 Mike McGrath 2008-03-06 14:14:35 UTC

We've never been able to re-create this through any of our tests.

Comment 21 Mike McGrath 2008-03-07 10:49:50 UTC

Created attachment 297166 [details]
Here's another oops

This is another oops that happend this morning.

Comment 22 Herbert Xu 2008-03-07 11:57:19 UTC

OK, both oops are saying the same thing, one of the skb page frags have been
unmapped.  Interestingly, both back traces are caused by TCP retransmits.

So going out on a limb here, let's assume that the packet in question is
generated in kernel by iSCSI, then what this is saying is that the page
reference count on the skb page frag is broken.  So that's where I would start
looking (of course, I have way too much on my plate right now to actually look
so someone else will need to do that :)

Comment 23 Mike Christie 2008-03-07 15:26:31 UTC

Thanks Herbert. I am looking at it.

Mike, I will send you a patch this weekend that should spit out some more info
if it is iscsi.

Mike, is it also possible to take nfs or iscsi out of the picture so we can
narrow this down by one major network user?

Comment 24 Mike McGrath 2008-03-07 15:31:59 UTC

we can take iscsi out of the picture, though taking nfs out would probably take
something else.  I'll work on taking iscsi out.

Comment 25 Mike McGrath 2008-03-08 18:33:22 UTC

This host is no longer running iscsi at all.

Comment 26 Mike McGrath 2008-03-11 09:59:44 UTC

(In reply to comment #23)
> Thanks Herbert. I am looking at it.
> 
> Mike, I will send you a patch this weekend that should spit out some more info
> if it is iscsi.
> 
> Mike, is it also possible to take nfs or iscsi out of the picture so we can
> narrow this down by one major network user?

If you've got that patch let me know, We had another host (one that used to
reboot but stopped) reboot.  Last bit in the logs:

Mar 11 09:37:04 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0
state (3). Dropping session.
Mar 11 09:37:08 xen1 iscsid: connection2:0 is operational after recovery (2
attempts)
Mar 11 09:41:59 xen1 syslogd 1.4.1: restart.

Comment 27 Mike McGrath 2008-03-29 15:29:13 UTC

This happened again on our xen1 box.  Though less frequent then xen2 we use
iscsi heavily and we really need to get this bug tracked down.  

Do we have that patch for more iscsi debugging?

Comment 28 Mike McGrath 2008-03-31 22:45:18 UTC

Am I being ignored?

Comment 29 Herbert Xu 2008-04-02 01:09:18 UTC

Mike, so you got a crash without iSCSI? Could you please attach the stack trace
with that? Thanks!

Comment 30 Mike McGrath 2008-04-02 01:56:12 UTC

it only crashes with iscsi.  We could only afford to remove iscsi from one
server (the one that was rebooting frequently) and since we did that, it has not
rebooted.

Comment 31 Herbert Xu 2008-04-02 02:26:53 UTC

OK so it really seems to be an iSCSI issue.  I don't have any time right now but
if it's still unresolved in a couple of weeks time I can start looking at the
iSCSI code.

Comment 32 Mike Christie 2008-04-02 16:03:01 UTC

(In reply to comment #28)
> Am I being ignored?

I have been busy with RHEL5 stuff. I will try and get a patch out seen.

Comment 33 Mike McGrath 2008-04-02 16:15:06 UTC

Thanks, I don't mean to sound stern but this is actively causing outages in
Fedora and regularly at night (which wakes me up).

Comment 34 Mike McGrath 2009-02-11 00:11:10 UTC

FWIW, I have not had a single unexpected reboot in months.  I'd forgotten this ticket was even open.  Feel free to close it.

Comment 35 Bill McGonigle 2011-01-07 15:36:33 UTC

I think I saw this last night on a xen/iSCSI box:

2.6.18-194.26.1.el5xen

Jan  6 23:51:32 stevens kernel:  connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 307293912, last ping 307295162, now 307296412
Jan  6 23:51:32 stevens kernel:  connection3:0: detected conn error (1011)
Jan  6 23:51:33 stevens iscsid: Kernel reported iSCSI connection 3:0 error (1011) state (3)
Jan  6 23:51:33 stevens kernel:  connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 307294168, last ping 307295418, now 307296668
Jan  6 23:51:33 stevens kernel:  connection1:0: detected conn error (1011)
Jan  6 23:51:34 stevens kernel:  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 307294385, last ping 307295635, now 307296885
Jan  6 23:51:34 stevens kernel:  connection2:0: detected conn error (1011)
Jan  6 23:51:34 stevens iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
Jan  6 23:51:35 stevens iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3)

By time I got to it, the machine had rebooted on its own accord and there wasn't an oops in the kernel log (syslog writes to a local disk, rather than iSCSI).  I'm out of town, so moderately pleased by the unexpected reboot, but didn't expect it either.  On this machine, some of the Xen VM's use iSCSI targets but none of the Dom0 resources are on iSCSI.

Comment 36 Mike Sun 2011-04-28 21:58:19 UTC

I'm having a similar issue as described by the previous comments.  A Xen server with dom0 connected to an iSCSI SAN seems to be having kernel panics because if the iSCSI driver.  Any ideas?

2.6.18-128.2.1.4.25.el5xen

Apr 24 02:42:14 NJOVS02 kernel:  connection3:0: ping timeout of 5 secs expired, 

recv timeout 5, last rx 260541169, last ping 260542419, now 260543669
Apr 24 02:42:14 NJOVS02 kernel:  connection3:0: detected conn error (1011)
Apr 24 02:42:14 NJOVS02 iscsid: Kernel reported iSCSI connection 3:0 error 

(1011) state (3)
Apr 24 02:42:14 NJOVS02 kernel:  connection1:0: ping timeout of 5 secs expired, 

recv timeout 5, last rx 260541282, last ping 260542532, now 260543782
Apr 24 02:42:14 NJOVS02 kernel:  connection1:0: detected conn error (1011)
Apr 24 02:42:14 NJOVS02 kernel:  connection2:0: ping timeout of 5 secs expired, 

recv timeout 5, last rx 260541282, last ping 260542532, now 260543782
Apr 24 02:42:14 NJOVS02 kernel:  connection2:0: detected conn error (1011)
Apr 24 02:42:15 NJOVS02 iscsid: Kernel reported iSCSI connection 1:0 error 

(1011) state (3)
Apr 24 02:42:15 NJOVS02 iscsid: Kernel reported iSCSI connection 2:0 error 

(1011) state (3)
Apr 24 02:42:20 NJOVS02 iscsid: connect to 172.16.16.101:3260 failed (No route 

to host)
Apr 24 02:42:52 NJOVS02 last message repeated 15 times
Apr 24 02:45:59 NJOVS02 syslogd 1.4.1: restart.

Comment 37 Mike Christie 2011-04-28 22:35:48 UTC

(In reply to comment #36)
> the iSCSI driver.  Any ideas?
> 
> 2.6.18-128.2.1.4.25.el5xen
> 
> Apr 24 02:42:14 NJOVS02 kernel:  connection3:0: ping timeout of 5 secs expired, 
> 
> recv timeout 5, last rx 260541169, last ping 260542419, now 260543669
> Apr 24 02:42:14 NJOVS02 kernel:  connection3:0: detected conn error (1011)
> Apr 24 02:42:14 NJOVS02 iscsid: Kernel reported iSCSI connection 3:0 error 
> 
> (1011) state (3)
> Apr 24 02:42:14 NJOVS02 kernel:  connection1:0: ping timeout of 5 secs expired, 



iSCSI layer did not see any traffic going on, so it sent a iscsi ping to the target. The target did not respond, so the initaitor dropped the connection.


> 
> recv timeout 5, last rx 260541282, last ping 260542532, now 260543782
> Apr 24 02:42:14 NJOVS02 kernel:  connection1:0: detected conn error (1011)
> Apr 24 02:42:14 NJOVS02 kernel:  connection2:0: ping timeout of 5 secs expired, 
> 
> recv timeout 5, last rx 260541282, last ping 260542532, now 260543782
> Apr 24 02:42:14 NJOVS02 kernel:  connection2:0: detected conn error (1011)
> Apr 24 02:42:15 NJOVS02 iscsid: Kernel reported iSCSI connection 1:0 error 
> 
> (1011) state (3)
> Apr 24 02:42:15 NJOVS02 iscsid: Kernel reported iSCSI connection 2:0 error 
> 
> (1011) state (3)
> Apr 24 02:42:20 NJOVS02 iscsid: connect to 172.16.16.101:3260 failed (No route 


Could not connect to target. No route to host. Probably a network problem or could be target is down.


> 
> to host)
> Apr 24 02:42:52 NJOVS02 last message repeated 15 times
> Apr 24 02:45:59 NJOVS02 syslogd 1.4.1: restart.


We probably retried for replacement_timeout seconds (see iscsid.conf) and could not reconnect and so your FS got IO errors or maybe a hangcheck/watchdog timer fired and caused the system to reboot. Normally there is something in the logs though.

Comment 38 Bill McGonigle 2011-04-28 22:50:13 UTC

@Mike Sun:  FWIW, I upgraded my Dom0 to Fedora 14 with 2.6.32 Xen4 and it's been stable.  I don't have my Xen 3 config any longer but I did gain more stability by playing with the scheduler credits to give Dom0 more time.  I got the impression there was a race/deadlock, but that's not backed by empirical data.

Comment 39 venkateswara reddy 2011-12-19 05:17:18 UTC

Hi Mike,

We are also facing the system crash issue the same as mentioned by Mike Sun. below is the log of the system.

Dec  6 04:07:13 prematics33 multipathd: sde: readsector0 checker reports path is down
Dec  6 04:07:53 prematics33 syslogd 1.4.1: restart.


on Dec 7th the same cron jobs executed at 04:02:01 and finished at 04:02:41 and than again sde drive got failed ( multipathd: sde: readsector0 checker reports path is down )
and than iscsi initiator iscsid deamon recoverd the failed sde drive. here system is not effected as drive recoverd.

 

Dec  7 04:03:53 prematics33 iscsid: connection2:0 is operational after recovery (2 attempts)
Dec  7 04:03:53 prematics33 iscsid: connection1:0 is operational after recovery (2 attempts)
Dec  7 04:03:58 prematics33 multipathd: sde: readsector0 checker reports path is up


on Dec 8th the cron jobs executed again at 04:02:01 as sheduled by, and failed sde harddrive to recognise
iscsid deamon failed to recover the failed sde drive this issue caused system crash.
below is the log.

 

Dec  8 04:02:50 prematics33 logger: cron.daily: tmpwatch finished
Dec  8 04:03:19 prematics33 multipathd: sde: readsector0 checker reports path is down
Dec  8 04:03:19 prematics33 multipathd: checker failed path 8:64 in map pstor05
Dec  8 04:03:19 prematics33 multipathd: pstor05: remaining active paths: 1
Dec  8 04:03:19 prematics33 kernel: device-mapper: multipath: Failing path 8:64.
Dec  8 04:03:22 prematics33 multipathd: dm-0: add map (uevent)
Dec  8 04:03:22 prematics33 multipathd: dm-0: devmap already registered
Dec  8 04:04:07 prematics33 syslogd 1.4.1: restart.

Do we have any solution to fix this issue, please let me know.

Thank you,
Venkat. V

Comment 40 Mike Christie 2011-12-20 02:02:37 UTC

Venkat,

What kernel and rhel version are you using?

Comment 41 venkateswara reddy 2011-12-20 13:33:37 UTC

Thanks for the reply Mike, i have updated it into "Bug 768835"

Thank you,
Venkat. V

Note You need to log in before you can comment on or make changes to this bug.