Description of problem: When rebooting systems they are hanging. Version-Release number of selected component (if applicable): 2.6.18-147.el5 How reproducible: Always Steps to Reproduce: 1. Install RHEL5.4 tree anything post 20099412 nightly on cisco-ca-blade2.rhts.bos.redhat.com 2. Configure kdump and reboot Actual results: INIT: INIT: Sending processes the TERM signal Shutting down Avahi daemon: [ OK ] Stopping HAL daemon: [ OK ] Stopping yum-updatesd: [ OK ] Stopping anacron: [ OK ] Stopping atd: [ OK ] Stopping cups: [ OK ] Shutting down xfs: [ OK ] Shutting down console mouse services: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping xinetd: [ OK ] Stopping acpi daemon: [ OK ] Stopping crond: [ OK ] Stopping autofs: Stopping automount: [ OK ] [ OK ] Shutting down ntpd: [ OK ] Stopping ipsec: could not open include filename: '/etc/ipsec.d/*.conf' (tried and ) ipsec_setup: Stopping Openswan IPsec... [ OK ] Stopping system message bus: [ OK ] Stopping RPC idmapd: [ OK ] Stopping NFS statd: [ OK ] Stopping mcstransd: [ OK ] Stopping portmap: [ OK ] Shutting down restorecond: [ OK ] Stopping auditd: audit(1241796881.862:6): audit_pid=0 old=10427 by auid=4294967295 subj=system_u:system_r:auditd_t:s0 [ OK ] Stopping PC/SC smart card daemon (pcscd): [ OK ] Shutting down kernel logger: [ OK ] Shutting down system logger: [ OK ] Shutting down hidd: [ OK ] [ OK ][ OK ]Stopping Bluetooth services:[ OK ][ OK ] Disabling ondemand cpu frequency scaling: [ OK ] Starting killall: [ OK ] Sending all processes the TERM signal... type=1701 audit(1241796884.912:7): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=system_u:system_r:iscsid_t:s0 pid=9980 comm="iscsid" sig=11 Sending all processes the KILL signal... Saving random seed: Syncing hardware clock to system time type=1111 audit(1241796890.999:8): user pid=14631 uid=0 auid=4294967295 subj=system_u:system_r:hwclock_t:s0 msg='changing system time: exe="/sbin/hwclock" (hostname=?, addr=?, terminal=console res=success)' Turning off swap: Turning off quotas: Unmounting pipe file systems: Unmounting file systems: Please stand by while rebooting the system... md: stopping all md devices. Synchronizing SCSI cache for disk sda: ACPI: PCI interrupt for device 0000:06:00.1 disabled usb 6-1: new full speed USB device using uhci_hcd and address 2 usb 6-1: not running at top speed; connect to a high speed hub usb 6-1: configuration #1 chosen from 1 choice hub 6-1:1.0: USB hub found hub 6-1:1.0: 2 ports detected usb 4-2: new full speed USB device using uhci_hcd and address 3 usb 4-2: not running at top speed; connect to a high speed hub usb 4-2: configuration #1 chosen from 1 choice scsi4 : SCSI emulation for USB Mass Storage devices [-- MARK -- Fri May 8 11:35:00 2009] usb 4-2: reset full speed USB device using uhci_hcd and address 3 Vendor: Cisco Model: Virtual CD/DVD Rev: 1.10 Type: CD-ROM ANSI SCSI revision: 00 sr1: scsi3-mmc drive: 0x/0x cd/rw caddy sr 4:0:0:0: Attached scsi generic sg3 type 5 Vendor: Cisco Model: Virtual FD/HD Rev: 1.10 Type: Direct-Access ANSI SCSI revision: 00 sd 4:0:0:1: Attached scsi removable disk sdc sd 4:0:0:1: Attached scsi generic sg4 type 0 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=57782 <- Pass http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=57938 <- Fail Expected results: System should reboot properly Additional info: Slightly different hang output with host sun-v40z-01.rhts.bos.redhat.com this one hang without kdump configured. Starting monitoring for VG VolGroup00: 2 logical volume(s) in volume group "VolGroup00" monitored [ OK ] Starting background readahead: [ OK ] Checking for hardware changes [ OK ] Enabling ondemand cpu frequency scaling: [ OK ] Turning off network shutdown. http://rhts.redhat.com/testlogs/58610/195971/1633267/console.txt Cisco system information: http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8070041
I can't access any of the http://rhts.redhat.com URLs mentioned in the description. Problem on my end?
Scott - no, that's correct - rhts is an internal site, walled off from the interweb. Some folks here say there could be an issue with the virtual CDROM doing some flaking out at shutdown. I'll leave it to the experts to confirm that and reply back.
Actually we still believe it is a network issue only because if we do a 'ifdown eth0' before the reboot, everything works fine. This goes along the line of some other bugs we have seen because we have supposedly changed the way nic cards shutdown for 5.4.
Oh yeah, one other thing, at some point I was thinking that I could just add some network and iscsi shutdown code to the /etc/init.d/halt script. I would have to load the /sbin/halt or kexec command in memory then I could stop the network and stop iscsi (nfs is mounted read only in /etc/init.d/halt so it does not need a special shutdown, right?). Would that be better? How do you load a program in memory without running it? Would I just have to create a ram based FS and run it from there?
Mike, in response to your comments, following along with what you said, I see why you've done what you did in the way you did. I also think that, while its not the safest idea to rely on the network while iscsi is shutting down, I guess you need to for now. We also shouldn't be hanging during shutdown, so to that end I'm trying to figure out why that might be via a diseection of our git tree. I'll put some thought into how we might bring iscsi to a stop in a slightly safer fashion. to answer your above question, yes, the way you access a program after you have unmounted your rootfs is to make a ramdisk and do a pivot root to it I'll post here when I know more about the hang.
update: so I besected the point where this started happening, and as andy thought it might it started occuring with the latest ixgbe update. Given that the last warning is telling us that the devices pci interrupt is getting disabled, I wonder if we're not processing an outstanding interrupt when this is happening and thats causing a problem. I'm going to try adding some disable_irq's prior to our call to pci_disable_device to see if that fixes us up.
I think we can ignore the scsci stuff (unless you have further comment pete). I just tried to remove all the usb modules prior to halt (which stops the khub_thread that prints out the above, and we still hung
Created attachment 345936 [details] gospos proposed patch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1820023 I'm building a test kernel with andys patch (minusthe msix chunk at the end on his request for testing when the cisco box is feeling better and can see link on its interfaces
Neil/Gospo - can you all detail the way to reproduce this issue? I'd like to keep Cisco in the loop to test this on their side as well.
Fire up the box, make sure the ixgbe network driver is installed and loaded, and reboot. It will hang during shutdown. I've tested andys patch from comment #36, and it passed several times for me. So I think we've found a winner. Interestingly REmoving the last chunk to back out the msix vector fix in the patch caused the box to continue to hang, so even thought the napi poll bug is still valid, it doesn't seem related to this hang. I'm going to test just the msix fix on sunday to confirm, but regardless, I think the whole patch needs to go in. Since its gospos patch, I'm reassigning this to him to post for 5.4 monday morning. Thanks andy!
in kernel-2.6.18-152.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~ RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching. If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
Patch is in -158.el5. Adding SanityOnly.
Tested/Verified based on conversations with Shrijeet @ Cisco.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html