Bug 500857

Summary: [RHEL5 U4] Systems seems to hang on reboot
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Burke <jburke>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: abdulkh, abjoglek, agospoda, andriusb, cward, dzickus, gcase, jim, jjarvis, jtluka, lwang, lwoodman, maurizio.antillon, mchristi, mgahagan, nhorman, pbunyan, peterm, savbu-lnx-drivers, scofeldm, zaitcev
Target Milestone: rcKeywords: Regression
Target Release: 5.4   
Hardware: All   
OS: Linux   
URL: http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8127400
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:14:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 475528    
Attachments:
Description Flags
gospos proposed patch none

Description Jeff Burke 2009-05-14 15:09:50 UTC
Description of problem:
 When rebooting systems they are hanging.

Version-Release number of selected component (if applicable):
2.6.18-147.el5

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL5.4 tree anything post 20099412 nightly on cisco-ca-blade2.rhts.bos.redhat.com
2. Configure kdump and reboot
  
Actual results:

INIT: INIT: Sending processes the TERM signal
Shutting down Avahi daemon: [  OK  ]
Stopping HAL daemon: [  OK  ]
Stopping yum-updatesd: [  OK  ]
Stopping anacron: [  OK  ]
Stopping atd: [  OK  ]
Stopping cups: [  OK  ]
Shutting down xfs: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping sshd: [  OK  ]
Shutting down sm-client: [  OK  ]
Shutting down sendmail: [  OK  ]
Stopping xinetd: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping crond: [  OK  ]
Stopping autofs:  Stopping automount: [  OK  ]
[  OK  ]
Shutting down ntpd: [  OK  ]
Stopping ipsec:  could not open include filename: '/etc/ipsec.d/*.conf' (tried  and )
ipsec_setup: Stopping Openswan IPsec...
[  OK  ]
Stopping system message bus: [  OK  ]
Stopping RPC idmapd: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping mcstransd: [  OK  ]
Stopping portmap: [  OK  ]
Shutting down restorecond: [  OK  ]
Stopping auditd: audit(1241796881.862:6): audit_pid=0 old=10427 by auid=4294967295 subj=system_u:system_r:auditd_t:s0
[  OK  ]
Stopping PC/SC smart card daemon (pcscd): [  OK  ]
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]
Shutting down hidd: [  OK  ]
[  OK  ][  OK  ]Stopping Bluetooth services:[  OK  ][  OK  ]
Disabling ondemand cpu frequency scaling: [  OK  ]
Starting killall:  [  OK  ]
Sending all processes the TERM signal... type=1701 audit(1241796884.912:7): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=system_u:system_r:iscsid_t:s0 pid=9980 comm="iscsid" sig=11

Sending all processes the KILL signal... 
Saving random seed:  
Syncing hardware clock to system time type=1111 audit(1241796890.999:8): user pid=14631 uid=0 auid=4294967295 subj=system_u:system_r:hwclock_t:s0 msg='changing system time: exe="/sbin/hwclock" (hostname=?, addr=?, terminal=console res=success)'

Turning off swap:  
Turning off quotas:  
Unmounting pipe file systems:  
Unmounting file systems:  
Please stand by while rebooting the system...
md: stopping all md devices.
Synchronizing SCSI cache for disk sda: 
ACPI: PCI interrupt for device 0000:06:00.1 disabled
usb 6-1: new full speed USB device using uhci_hcd and address 2
usb 6-1: not running at top speed; connect to a high speed hub
usb 6-1: configuration #1 chosen from 1 choice
hub 6-1:1.0: USB hub found
hub 6-1:1.0: 2 ports detected
usb 4-2: new full speed USB device using uhci_hcd and address 3
usb 4-2: not running at top speed; connect to a high speed hub
usb 4-2: configuration #1 chosen from 1 choice
scsi4 : SCSI emulation for USB Mass Storage devices
[-- MARK -- Fri May  8 11:35:00 2009]
usb 4-2: reset full speed USB device using uhci_hcd and address 3
  Vendor: Cisco     Model: Virtual CD/DVD    Rev: 1.10
  Type:   CD-ROM                             ANSI SCSI revision: 00
sr1: scsi3-mmc drive: 0x/0x cd/rw caddy
sr 4:0:0:0: Attached scsi generic sg3 type 5
  Vendor: Cisco     Model: Virtual FD/HD     Rev: 1.10
  Type:   Direct-Access                      ANSI SCSI revision: 00
sd 4:0:0:1: Attached scsi removable disk sdc
sd 4:0:0:1: Attached scsi generic sg4 type 0

http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=57782 <- Pass
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=57938 <- Fail

Expected results:
System should reboot properly 

Additional info:
Slightly different hang output with host sun-v40z-01.rhts.bos.redhat.com this one hang without kdump configured.

Starting monitoring for VG VolGroup00:   2 logical volume(s) in volume group "VolGroup00" monitored
[  OK  ]
Starting background readahead: [  OK  ]
Checking for hardware changes [  OK  ]
Enabling ondemand cpu frequency scaling: [  OK  ]
Turning off network shutdown.

http://rhts.redhat.com/testlogs/58610/195971/1633267/console.txt

Cisco system information:
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8070041

Comment 3 Scott Feldman 2009-05-14 18:10:55 UTC
I can't access any of the http://rhts.redhat.com URLs mentioned in the description.  Problem on my end?

Comment 4 Andrius Benokraitis 2009-05-14 18:21:16 UTC
Scott - no, that's correct - rhts is an internal site, walled off from the interweb.

Some folks here say there could be an issue with the virtual CDROM doing some flaking out at shutdown. I'll leave it to the experts to confirm that and reply back.

Comment 5 Don Zickus 2009-05-14 18:51:26 UTC
Actually we still believe it is a network issue only because if we do a 'ifdown eth0' before the reboot, everything works fine.

This goes along the line of some other bugs we have seen because we have supposedly changed the way nic cards shutdown for 5.4.

Comment 22 Mike Christie 2009-05-15 19:50:03 UTC
Oh yeah, one other thing, at some point I was thinking that I could just add some network and iscsi shutdown code to the /etc/init.d/halt script. I would have to load the /sbin/halt or kexec command in memory then I could stop the network and stop iscsi (nfs is mounted read only in  /etc/init.d/halt so it does not need a special shutdown, right?).

Would that be better? How do you load a program in memory without running it? Would I just have to create a ram based FS and run it from there?

Comment 24 Neil Horman 2009-05-15 22:45:35 UTC
Mike, in response to your comments, following along with what you said, I see why you've done what you did in the way you did.  I also think that, while its not the safest idea to rely on the network while iscsi is shutting down, I guess you need to for now. We also shouldn't be hanging during shutdown, so to that end I'm trying to figure out why that might be via a diseection of our git tree.  I'll put some thought into how we might bring iscsi to a stop in a slightly safer fashion.  

to answer your above question, yes, the way you access a program after you have unmounted your rootfs is to make a ramdisk and do a pivot root to it

I'll post here when I know more about the hang.

Comment 25 Neil Horman 2009-05-20 19:35:40 UTC
update: so I besected the point where this started happening, and as andy thought it might it started occuring with the latest ixgbe update.  Given that the last warning is telling us that the devices pci interrupt is getting disabled, I wonder if we're not processing an outstanding interrupt when this is happening and thats causing a problem.  I'm going to try adding some disable_irq's prior to our call to pci_disable_device to see if that fixes us up.

Comment 33 Neil Horman 2009-05-29 00:58:08 UTC
I think we can ignore the scsci stuff (unless you have further comment pete).  I just tried to remove all the usb modules prior to halt (which stops the khub_thread that prints out the above, and we still hung

Comment 36 Neil Horman 2009-05-29 18:00:14 UTC
Created attachment 345936 [details]
gospos proposed patch

Comment 37 RHEL Program Management 2009-05-29 19:10:40 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 38 Neil Horman 2009-05-29 19:31:51 UTC
 http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1820023

I'm building a test kernel with andys patch (minusthe msix chunk at the end on his request for testing when the cisco box is feeling better and can see link on its interfaces

Comment 39 Andrius Benokraitis 2009-05-30 05:55:30 UTC
Neil/Gospo - can you all detail the way to reproduce this issue? I'd like to keep Cisco in the loop to test this on their side as well.

Comment 40 Neil Horman 2009-05-30 23:29:42 UTC
Fire up the box, make sure the ixgbe network driver is installed and loaded, and reboot.  It will hang during shutdown.

I've tested andys patch from  comment #36, and it passed several times for me.  So I think we've found a winner.  Interestingly REmoving the last chunk to back out the msix vector fix in the patch caused the box to continue to hang, so even thought the napi poll bug is still valid, it doesn't seem related to this hang.  I'm going to test just the msix fix on sunday to confirm, but regardless, I think the whole patch needs to go in.

Since its gospos  patch, I'm reassigning this to him to post for 5.4 monday morning.  Thanks andy!

Comment 47 Don Zickus 2009-06-04 16:07:51 UTC
in kernel-2.6.18-152.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 49 Chris Ward 2009-07-03 18:43:59 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 50 Chris Ward 2009-07-10 19:13:18 UTC
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~

RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching.

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.

Comment 51 Jan Tluka 2009-07-20 15:16:28 UTC
Patch is in -158.el5. Adding SanityOnly.

Comment 52 Andrius Benokraitis 2009-08-04 17:39:08 UTC
Tested/Verified based on conversations with Shrijeet @ Cisco.

Comment 54 errata-xmlrpc 2009-09-02 08:14:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html