Bug 500857 - [RHEL5 U4] Systems seems to hang on reboot
[RHEL5 U4] Systems seems to hang on reboot
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.4
All Linux
high Severity high
: rc
: 5.4
Assigned To: Andy Gospodarek
Red Hat Kernel QE team
http://rhts.redhat.com/cgi-bin/rhts/t...
: Regression
Depends On:
Blocks: 475528
  Show dependency treegraph
 
Reported: 2009-05-14 11:09 EDT by Jeff Burke
Modified: 2014-06-29 19:01 EDT (History)
21 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:14:52 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
gospos proposed patch (2.23 KB, patch)
2009-05-29 14:00 EDT, Neil Horman
no flags Details | Diff

  None (edit)
Description Jeff Burke 2009-05-14 11:09:50 EDT
Description of problem:
 When rebooting systems they are hanging.

Version-Release number of selected component (if applicable):
2.6.18-147.el5

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL5.4 tree anything post 20099412 nightly on cisco-ca-blade2.rhts.bos.redhat.com
2. Configure kdump and reboot
  
Actual results:

INIT: INIT: Sending processes the TERM signal
Shutting down Avahi daemon: [  OK  ]
Stopping HAL daemon: [  OK  ]
Stopping yum-updatesd: [  OK  ]
Stopping anacron: [  OK  ]
Stopping atd: [  OK  ]
Stopping cups: [  OK  ]
Shutting down xfs: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping sshd: [  OK  ]
Shutting down sm-client: [  OK  ]
Shutting down sendmail: [  OK  ]
Stopping xinetd: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping crond: [  OK  ]
Stopping autofs:  Stopping automount: [  OK  ]
[  OK  ]
Shutting down ntpd: [  OK  ]
Stopping ipsec:  could not open include filename: '/etc/ipsec.d/*.conf' (tried  and )
ipsec_setup: Stopping Openswan IPsec...
[  OK  ]
Stopping system message bus: [  OK  ]
Stopping RPC idmapd: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping mcstransd: [  OK  ]
Stopping portmap: [  OK  ]
Shutting down restorecond: [  OK  ]
Stopping auditd: audit(1241796881.862:6): audit_pid=0 old=10427 by auid=4294967295 subj=system_u:system_r:auditd_t:s0
[  OK  ]
Stopping PC/SC smart card daemon (pcscd): [  OK  ]
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]
Shutting down hidd: [  OK  ]
[  OK  ][  OK  ]Stopping Bluetooth services:[  OK  ][  OK  ]
Disabling ondemand cpu frequency scaling: [  OK  ]
Starting killall:  [  OK  ]
Sending all processes the TERM signal... type=1701 audit(1241796884.912:7): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=system_u:system_r:iscsid_t:s0 pid=9980 comm="iscsid" sig=11

Sending all processes the KILL signal... 
Saving random seed:  
Syncing hardware clock to system time type=1111 audit(1241796890.999:8): user pid=14631 uid=0 auid=4294967295 subj=system_u:system_r:hwclock_t:s0 msg='changing system time: exe="/sbin/hwclock" (hostname=?, addr=?, terminal=console res=success)'

Turning off swap:  
Turning off quotas:  
Unmounting pipe file systems:  
Unmounting file systems:  
Please stand by while rebooting the system...
md: stopping all md devices.
Synchronizing SCSI cache for disk sda: 
ACPI: PCI interrupt for device 0000:06:00.1 disabled
usb 6-1: new full speed USB device using uhci_hcd and address 2
usb 6-1: not running at top speed; connect to a high speed hub
usb 6-1: configuration #1 chosen from 1 choice
hub 6-1:1.0: USB hub found
hub 6-1:1.0: 2 ports detected
usb 4-2: new full speed USB device using uhci_hcd and address 3
usb 4-2: not running at top speed; connect to a high speed hub
usb 4-2: configuration #1 chosen from 1 choice
scsi4 : SCSI emulation for USB Mass Storage devices
[-- MARK -- Fri May  8 11:35:00 2009]
usb 4-2: reset full speed USB device using uhci_hcd and address 3
  Vendor: Cisco     Model: Virtual CD/DVD    Rev: 1.10
  Type:   CD-ROM                             ANSI SCSI revision: 00
sr1: scsi3-mmc drive: 0x/0x cd/rw caddy
sr 4:0:0:0: Attached scsi generic sg3 type 5
  Vendor: Cisco     Model: Virtual FD/HD     Rev: 1.10
  Type:   Direct-Access                      ANSI SCSI revision: 00
sd 4:0:0:1: Attached scsi removable disk sdc
sd 4:0:0:1: Attached scsi generic sg4 type 0

http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=57782 <- Pass
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=57938 <- Fail

Expected results:
System should reboot properly 

Additional info:
Slightly different hang output with host sun-v40z-01.rhts.bos.redhat.com this one hang without kdump configured.

Starting monitoring for VG VolGroup00:   2 logical volume(s) in volume group "VolGroup00" monitored
[  OK  ]
Starting background readahead: [  OK  ]
Checking for hardware changes [  OK  ]
Enabling ondemand cpu frequency scaling: [  OK  ]
Turning off network shutdown.

http://rhts.redhat.com/testlogs/58610/195971/1633267/console.txt

Cisco system information:
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8070041
Comment 3 Scott Feldman 2009-05-14 14:10:55 EDT
I can't access any of the http://rhts.redhat.com URLs mentioned in the description.  Problem on my end?
Comment 4 Andrius Benokraitis 2009-05-14 14:21:16 EDT
Scott - no, that's correct - rhts is an internal site, walled off from the interweb.

Some folks here say there could be an issue with the virtual CDROM doing some flaking out at shutdown. I'll leave it to the experts to confirm that and reply back.
Comment 5 Don Zickus 2009-05-14 14:51:26 EDT
Actually we still believe it is a network issue only because if we do a 'ifdown eth0' before the reboot, everything works fine.

This goes along the line of some other bugs we have seen because we have supposedly changed the way nic cards shutdown for 5.4.
Comment 22 Mike Christie 2009-05-15 15:50:03 EDT
Oh yeah, one other thing, at some point I was thinking that I could just add some network and iscsi shutdown code to the /etc/init.d/halt script. I would have to load the /sbin/halt or kexec command in memory then I could stop the network and stop iscsi (nfs is mounted read only in  /etc/init.d/halt so it does not need a special shutdown, right?).

Would that be better? How do you load a program in memory without running it? Would I just have to create a ram based FS and run it from there?
Comment 24 Neil Horman 2009-05-15 18:45:35 EDT
Mike, in response to your comments, following along with what you said, I see why you've done what you did in the way you did.  I also think that, while its not the safest idea to rely on the network while iscsi is shutting down, I guess you need to for now. We also shouldn't be hanging during shutdown, so to that end I'm trying to figure out why that might be via a diseection of our git tree.  I'll put some thought into how we might bring iscsi to a stop in a slightly safer fashion.  

to answer your above question, yes, the way you access a program after you have unmounted your rootfs is to make a ramdisk and do a pivot root to it

I'll post here when I know more about the hang.
Comment 25 Neil Horman 2009-05-20 15:35:40 EDT
update: so I besected the point where this started happening, and as andy thought it might it started occuring with the latest ixgbe update.  Given that the last warning is telling us that the devices pci interrupt is getting disabled, I wonder if we're not processing an outstanding interrupt when this is happening and thats causing a problem.  I'm going to try adding some disable_irq's prior to our call to pci_disable_device to see if that fixes us up.
Comment 33 Neil Horman 2009-05-28 20:58:08 EDT
I think we can ignore the scsci stuff (unless you have further comment pete).  I just tried to remove all the usb modules prior to halt (which stops the khub_thread that prints out the above, and we still hung
Comment 36 Neil Horman 2009-05-29 14:00:14 EDT
Created attachment 345936 [details]
gospos proposed patch
Comment 37 RHEL Product and Program Management 2009-05-29 15:10:40 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 38 Neil Horman 2009-05-29 15:31:51 EDT
 http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1820023

I'm building a test kernel with andys patch (minusthe msix chunk at the end on his request for testing when the cisco box is feeling better and can see link on its interfaces
Comment 39 Andrius Benokraitis 2009-05-30 01:55:30 EDT
Neil/Gospo - can you all detail the way to reproduce this issue? I'd like to keep Cisco in the loop to test this on their side as well.
Comment 40 Neil Horman 2009-05-30 19:29:42 EDT
Fire up the box, make sure the ixgbe network driver is installed and loaded, and reboot.  It will hang during shutdown.

I've tested andys patch from  comment #36, and it passed several times for me.  So I think we've found a winner.  Interestingly REmoving the last chunk to back out the msix vector fix in the patch caused the box to continue to hang, so even thought the napi poll bug is still valid, it doesn't seem related to this hang.  I'm going to test just the msix fix on sunday to confirm, but regardless, I think the whole patch needs to go in.

Since its gospos  patch, I'm reassigning this to him to post for 5.4 monday morning.  Thanks andy!
Comment 47 Don Zickus 2009-06-04 12:07:51 EDT
in kernel-2.6.18-152.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 49 Chris Ward 2009-07-03 14:43:59 EDT
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.
Comment 50 Chris Ward 2009-07-10 15:13:18 EDT
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~

RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching.

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
Comment 51 Jan Tluka 2009-07-20 11:16:28 EDT
Patch is in -158.el5. Adding SanityOnly.
Comment 52 Andrius Benokraitis 2009-08-04 13:39:08 EDT
Tested/Verified based on conversations with Shrijeet @ Cisco.
Comment 54 errata-xmlrpc 2009-09-02 04:14:52 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.