Bug 727640
Summary: | BUG: soft lockup on Dell C6145 with stock installation | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Mark Nipper <nipsy> | ||||
Component: | kernel | Assignee: | Shyam Iyer <shiyer> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.1 | CC: | arozansk, jburke, jdonohue, linux-bugs, peterm, shiyer | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-08-09 19:56:56 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Mark Nipper
2011-08-02 17:15:13 UTC
Version-Release number of selected component (if applicable): Linux snuffles.la.utexas.edu 2.6.32-131.6.1.el6.x86_64 #1 SMP Mon Jun 20 14:15:38 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux Forgot that part. Using both nohz=off and hpet=disable didn't help the soft lockup problem. I'm still seeing the messages while installing the Dell OMSA tools, although at a different point in the process: --- e initialization sequence of SAS components failed during system startup. SAS management and monitoring is not possible. Aug 2 13:09:42 snuffles kernel: Clocksource tsc unstable (delta = 209714104 ns) Aug 2 13:13:10 snuffles kernel: Switching to clocksource acpi_pm Aug 2 13:13:10 snuffles kernel: BUG: soft lockup - CPU#0 stuck for 67s! [sasdupie:5344] Aug 2 13:13:10 snuffles kernel: Modules linked in: mpt2sas scsi_transport_sas raid_class mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler dell_rbu ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 sr_mod cdrom hed igb dca sg dcdbas snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc amd64_edac_mod edac_core edac_mce_amd k10temp hwmon i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core video output megaraid_sas ata_generic pata_acpi pata_atiixp ahci dm_mod [last unloaded: scsi_wait_scan] Aug 2 13:13:10 snuffles kernel: CPU 0: Aug 2 13:13:10 snuffles kernel: Modules linked in: mpt2sas scsi_transport_sas raid_class mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler dell_rbu ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 sr_mod cdrom hed igb dca sg dcdbas snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc amd64_edac_mod edac_core edac_mce_amd k10temp hwmon i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core video output megaraid_sas ata_generic pata_acpi pata_atiixp ahci dm_mod [last unloaded: scsi_wait_scan] Aug 2 13:13:10 snuffles kernel: Pid: 5344, comm: sasdupie Not tainted 2.6.32-131.6.1.el6.x86_64 #1 PowerEdge C6145 Aug 2 13:13:10 snuffles kernel: RIP: 0010:[<ffffffff81279be0>] [<ffffffff81279be0>] pci_user_read_config_dword+0x90/0xc0 Aug 2 13:13:10 snuffles kernel: RSP: 0018:ffff881023469dc8 EFLAGS: 00000206 Aug 2 13:13:10 snuffles kernel: RAX: ffffffff81f04340 RBX: ffff881023469df8 RCX: 00000000000005a0 Aug 2 13:13:10 snuffles kernel: RDX: 0000000000000000 RSI: 0000000000300000 RDI: 0000000000000000 Aug 2 13:13:10 snuffles kernel: RBP: ffffffff8100bc8e R08: 0000000000000004 R09: ffff881023469dd4 Aug 2 13:13:10 snuffles kernel: R10: 0000000000000000 R11: ffff882824ee2e40 R12: ffff881824b4bec0 Aug 2 13:13:10 snuffles kernel: R13: ffff884024ca9e80 R14: ffff88081673a080 R15: ffff882024cc90e8 Aug 2 13:13:10 snuffles kernel: FS: 00007f0c2772c720(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 Aug 2 13:13:10 snuffles kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 2 13:13:10 snuffles kernel: CR2: 00000000026ed1d8 CR3: 0000002823eb0000 CR4: 00000000000006f0 Aug 2 13:13:10 snuffles kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 2 13:13:10 snuffles kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 2 13:13:10 snuffles kernel: Call Trace: Aug 2 13:13:10 snuffles kernel: [<ffffffff81279bcd>] ? pci_user_read_config_dword+0x7d/0xc0 Aug 2 13:13:10 snuffles kernel: [<ffffffff81182aaa>] ? do_filp_open+0x71a/0xd90 Aug 2 13:13:10 snuffles kernel: [<ffffffff8128358e>] ? pci_read_config+0xbe/0x230 Aug 2 13:13:10 snuffles kernel: [<ffffffff811e8027>] ? read+0x127/0x210 Aug 2 13:13:10 snuffles kernel: [<ffffffff81172fb5>] ? vfs_read+0xb5/0x1a0 Aug 2 13:13:10 snuffles kernel: [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0 Aug 2 13:13:10 snuffles kernel: [<ffffffff811730f1>] ? sys_read+0x51/0x90 Aug 2 13:13:10 snuffles kernel: [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b --- I guess I'll try this all from scratch specifying clocksource=acpi_pm to see if using that from the start helps to avoid the problem. Adding clocksource=acpi_pm (and notsc) did not fix the problem. Seemingly regardless of the timing source used, there's always a problem when going to install the Dell OMSA package and loading the IPMI drivers specifically. Reposting what I just sent to the Dell Linux PowerEdge list. Just to note, the problem is not entirely fixed, but seems to be much better with the newer firmware. But given the problem at this point seems to be an interaction with some of the Dell provided software, I'm not sure that this is a Red Hat specific issue anymore. --- A little update on this. Dell folks told me to update to the latest versions of the BIOS and the BMC on the C6145: --- BMC motherboard firmware Version, 1.03, A02, released 8/1/11, Recommended Windows: http://ftp.us.dell.com/esm/PECC6145_BMC_FRMW_WIN_R305210.EXE Red Hat Linux: http://ftp.us.dell.com/esm/PECC6145_BMC_FRMW_LX_R305210.BIN GnuPG: http://ftp.us.dell.com/esm/PECC6145_BMC_FRMW_LX_R305210.BIN.sign Bios Version 1.90 update, released 8/1/11, update package, requires immediate reboot: Windows Update Package: http://ftp.us.dell.com/bios/PECC6145_BIOS_WIN_1.9.0.EXE Red Hat Linux: http://ftp.us.dell.com/bios/PECC6145_BIOS_LX_1.9.0.BIN GnuPG: http://ftp.us.dell.com/bios/PECC6145_BIOS_LX_1.9.0.BIN.sign --- all of which helped. I'm still seeing soft lockups: --- Aug 8 15:05:36 snuffles kernel: BUG: soft lockup - CPU#47 stuck for 68s! [migration/47:192] as the Dell OMSA software stack loads, and specifically when either the IPMI drivers or the Fusion MPT drivers are being loaded. But afterward, the nVidia cards I have in the C410x expansion are visible and running commands like "nvidia-smi -q" don't freeze the machine completely like they did before. Hi Mark, I saw your posts on the poweredge list. Could you post the more detailed lockup messages .. Are they almost the same.. ? Also you mention that this happens with the OMSA stack only.. OMSA uses ipmi, dcbas drivers etc so we had have to eliminate those from the picture as well Can you stop the ipmi service or the unload the dcbbas driver to eliminate these drivers causing any issues. Also if you can attach a sosreport that contains all the log files would be beneficial. Thanks, Shyam Iyer Dell Linux Engineering Created attachment 517426 [details]
dmesg output
The dmesg output I just posted shows all of the lockups which occur right when the machine first boots and the Dell OMSA stack is loaded. No lockups have occurred since then as I have nohz=off and all the latest firmware updates (BIOS specifically, 1.9.0). I have a separate incident open with Dell and the people working on that case have valid user credentials to access this machine directly. Would that be of more help to you than logs? (In reply to comment #8) > The dmesg output I just posted shows all of the lockups which occur right when > the machine first boots and the Dell OMSA stack is loaded. No lockups have > occurred since then as I have nohz=off and all the latest firmware updates > (BIOS specifically, 1.9.0). > Ok.. Let me try to clarify. What you are saying is this.. - You don't see the deadlocks when you boot the system with nohz=off and the latest firmware updates provided OMSA is not installed. - You do see the deadlocks when you have OMSA installed but they stop after a while when apparently the system has stabilized and allows the nvidia driver to load as well. Now in this state with OMSA installed is your system still functioning well..? I guess I am trying to eliminate if the deadlocks occur only when OMSA stack is loaded and the system is usable afterwards. > I have a separate incident open with Dell and the people working on that case > have valid user credentials to access this machine directly. Would that be of > more help to you than logs? I guess it follows a different process there so I would try to help here in determining if we indeed have a kernel issue.. So I just installed a stock RHEL 6.1 system without the OMSA stack and was able to load the binary nVidia drivers (275.21) available via EPEL and talk to the cards without any soft lockups occurring. After the BIOS update from 1.7.0 to 1.9.0, that seems to have gone a long way in fixing the problems I was seeing initially on this particular hardware. At this point, I can load the OMSA stack to induce the soft lockup most likely. Is there something you wanted me to do to blacklist certain modules or some kind of logging you want me to be capturing while I install the OMSA stack? I'm using the standard approach to install the Dell OMSA stack: --- wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash (In reply to comment #10) > So I just installed a stock RHEL 6.1 system without the OMSA stack and was able > to load the binary nVidia drivers (275.21) available via EPEL and talk to the > cards without any soft lockups occurring. > > After the BIOS update from 1.7.0 to 1.9.0, that seems to have gone a long way > in fixing the problems I was seeing initially on this particular hardware. At > this point, I can load the OMSA stack to induce the soft lockup most likely. > Is there something you wanted me to do to blacklist certain modules or some > kind of logging you want me to be capturing while I install the OMSA stack? > > I'm using the standard approach to install the Dell OMSA stack: > --- > wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash Mark, I just realized that OMSA is not supported/tested on this platform. You should also get the following error when you start the OMSA services.. [root@dell-pec6145-01 sbin]# ./srvadmin-services.sh start Starting Systems Management Device Drivers: Starting dell_rbu:[ OK ] Starting ipmi driver: Already started[ OK ] Starting Systems Management Data Engine: Failed to start because system is not supported dsm_om_shrsvc: DSM SA Shared Services cannot start on an unsupported system. See the Dell Systems Software Support Matrix for a list of supported systems. Starting DSM SA Connection Service: [ OK ] Thanks, Shyam Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. |