Bug 717623 - FC15 Fresh install on Dell PowerEdge T110 will mostly never reboot correctly after install
Summary: FC15 Fresh install on Dell PowerEdge T110 will mostly never reboot correctly ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 15
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-29 12:22 UTC by Philippe
Modified: 2012-06-06 15:46 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-06 15:46:43 UTC
Type: ---


Attachments (Terms of Use)
boot.log when boot works ! (5.58 KB, text/plain)
2011-06-29 12:22 UTC, Philippe
no flags Details
lspci -nnvvv [when boot works] (35.65 KB, text/plain)
2011-07-01 15:43 UTC, Philippe
no flags Details
dmesg [ When boot works ] (55.08 KB, text/plain)
2011-07-01 15:46 UTC, Philippe
no flags Details
Boot-Still-Crashes-Without-2nd_NIC (585.81 KB, image/pjpeg)
2011-07-03 12:13 UTC, Philippe
no flags Details
BootStuckWithOnlyBnx2Enabled (33.59 KB, text/plain)
2011-07-05 00:41 UTC, Philippe
no flags Details
ServeurTest\SerialLog-BootStuckWithOnlyTg3Enabled (33.76 KB, text/plain)
2011-07-05 00:42 UTC, Philippe
no flags Details

Description Philippe 2011-06-29 12:22:35 UTC
Created attachment 510449 [details]
boot.log when boot works !

Description of problem:
FC15 Fresh install on Dell PowerEdge T110 will mostly never reboot correctly after install.

Version-Release number of selected component (if applicable):
FC15 including latest patches (even without latest patches)

How reproducible:
Very Easy (had to try 6 boots this morning before it works once !)

Steps to Reproduce:
1. Install
2. Reboot after install
3. It will not boot (freeze on half with no error message)
4. Power Off
5. Power On and see if the 3 bars (white & blue) will finally lead to login screen
6. Repeat from step 4 untill it will accept to boot (might be something with device  due to device order)

  
Actual results:
Will not boot.


Expected results:
Just boot :)


Additional info:
================
PowerEdge T110
2 SATA HardDisk (Hardware RAID1 seen as sdb)
1 Internal RD1000 (backup device seen as sda)

/etc/fstab
==========
/dev/mapper/vg_epf-lv_root /                       ext4    defaults        1 1
UUID=f816092f-c39c-49c2-b0b1-f09f72b1b930 /boot                   ext4    defaults        1 2
/dev/mapper/vg_epf-lv_swap swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0


scsi_id
=======
# scsi_id -g -u -d /dev/sda
1DELL_RD1000_102214V0C328042011
# scsi_id -g -u -d /dev/sdb
3600508e000000000f267c853fce2520c

lspci -l
========
Disk /dev/sdb: 999.7 GB, 999653638144 bytes
255 heads, 63 sectors/track, 121534 cylinders, total 1952448512 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xae052516

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *        2048      821247      409600   83  Linux
/dev/sdb2          821248  1952448511   975813632   8e  Linux LVM

Disk /dev/mapper/vg_epf-lv_swap: 17.2 GB, 17179869184 bytes
255 heads, 63 sectors/track, 2088 cylinders, total 33554432 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_epf-lv_swap doesn't contain a valid partition table

Disk /dev/mapper/vg_epf-lv_root: 262.1 GB, 262144000000 bytes
255 heads, 63 sectors/track, 31870 cylinders, total 512000000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_epf-lv_root doesn't contain a valid partition table


Questions
=========
- Disk identifier: 0x00000000 => Is this normal ?

- Disk /dev/mapper/vg_epf-lv_root doesn't contain a valid partition table => Is this normal ?

- I dont know how to "debug" this since when it will finally accept booting untill login screen i dont get the /var/log/boot.log* of previous failed boot.

Please find my latest boot.log attached.
Also i noticed (using Escape during boot sequence) that the boot would stop just after displaying message : "Starting File System Check on /dev/disk/by-uuid/f816092f-c39c-49c2-b0b1-f09f72b1b930"

Thanx for help
Philippe

Comment 1 Michal Schmidt 2011-06-29 16:18:53 UTC
> - Disk identifier: 0x00000000 => Is this normal ?
> - Disk /dev/mapper/vg_epf-lv_root doesn't contain a valid partition table => Is
> this normal ?

Yes, both are normal. The disk identifier and the partition table are stored in the Master Boot Record. Logical volumes do not usually contain a MBR.

Have you tried waiting for about 5 minutes? There's a chance that whatever is hanging may timeout and the boot would continue.

Does it react to CTRL+ALT+DEL when it's apparently frozen?

Comment 2 Philippe 2011-06-29 17:30:49 UTC
Yes i have.

CTRL+ALT+DEL is frozen
CAPS LOCK Also
like dead .. i can no more switch screen with CTRL+Shift...

Comment 3 Michal Schmidt 2011-06-30 10:50:43 UTC
Add "sysrq_always_enabled" to the kernel boot parameters and see if the system reboots when the Alt+SysRq+B combo is pressed.

Comment 4 Philippe 2011-06-30 11:51:22 UTC
Well as soon as i get back (probably in a few hours) i will proceed to your test. But imho the system is frozen (else the CAPS LOCK key light would be responsive).

What else should i do provide you more informations and advance with diagnostic ? Did you ever experiment this ? What do you think could be the problem ?

Comment 5 Harald Hoyer 2011-06-30 14:45:05 UTC
first thing: remove "rhgb quiet" from the kernel command line to see more messages

Comment 6 Philippe 2011-06-30 17:52:39 UTC
Replaced "rhgb quiet" with rq_always_enabled
=> got all the messages
boot freezes after line :
"(C0) PCI Express found at mem dc000000, IRQ 17, node addr 00:10:18:a6:aa:d2"

CTRL+ALT+DEL has no effect as expected.

Comment 7 Philippe 2011-06-30 18:02:01 UTC
After this i made it again with the same parameters replaced => This time the system accepted to boot.

Have you got an idea of what is happening ?
Do you need any log file ?

Comment 8 Philippe 2011-07-01 08:00:07 UTC
Would it be possible to upgrade severity / priority ?

I have a similar system at customer site under Fedora Core 13 with similar hardware (PowerEdge T110 + RD1000 backup) which sometimes would not reboot, never found why but it might be the same issue.

Comment 9 Michal Schmidt 2011-07-01 11:02:05 UTC
(In reply to comment #6)
> Replaced "rhgb quiet" with rq_always_enabled

It's 'sysrq_always_enabled'. I don't know if you just mistyped it here in the bugzilla comment or in the actually used command line too.

> CTRL+ALT+DEL has no effect as expected.

And Alt+SysRq+B ?

> boot freezes after line :
> "(C0) PCI Express found at mem dc000000, IRQ 17, node addr 00:10:18:a6:aa:d2"

If you repeat the experiment and manage to reproduce the hang, will the hang happen at this same line again?
Could you take a picture of it so we can see more of the preceding lines?

Anyway, this looks more like a kernel/hw issue. Reassigning.

Comment 10 Philippe 2011-07-01 11:20:04 UTC
yes sorry i misstyped it in the post only.

Alt+SysRq+B => I will try when i get back to the server in exactly 4hours.

Same line yes as i did it twice but i will verify it again and also take a picture.

Is there anything else i could do to provide you some more details / logs / debug ?

Comment 11 Michal Schmidt 2011-07-01 11:27:54 UTC
(In reply to comment #10)
> Is there anything else i could do to provide you some more details / logs /
> debug ?

After a successful boot, get the output of
 lspci -nnvvv
 dmesg

Comment 12 Philippe 2011-07-01 15:43:44 UTC
Created attachment 510887 [details]
lspci -nnvvv   [when boot works]

Comment 13 Philippe 2011-07-01 15:46:53 UTC
Created attachment 510889 [details]
dmesg     [ When boot works ]

Comment 14 Philippe 2011-07-01 16:07:37 UTC
I made it again , this time frozen just after :

14.928582 bnx2 0000:02:00.0: eth1: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem da000000, IRQ 16, node addr 00:10:18:a6:aa:d0
14.929113 bnx2 0000:02:00.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
******** Frozen Here ********
******** Alt+SysRq+B IS UNRESPONSIVE

Comparing to the working dmesg it is just before setting latency timer to 64
and before eth2 detection (eth1+eth2) is the dual broadcom card.

But this is also just before systemd-fsck on /dev/sdb1 ..
and still believe the problem might be arround here on disks...

At start when i used only the ESC key during the boot sequence it was hanging after displaying message : "Starting File System Check on
/dev/disk/by-uuid/f816092f-c39c-49c2-b0b1-f09f72b1b930"

Thanx for help.

Comment 15 Philippe 2011-07-03 12:12:36 UTC
Finally i removed the dual NIC card and booted kernel with rq_always_enabled and without rhgb quiet and it went frozen @ Starting File System Check on
/dev/disk/by-uuid/f816092f-c39c-49c2-b0b1-f09f72b1b930. Please find attached screenshot "Boot-Still-Crashes-Without-2nd_NIC.jpg". Hope this can help.
Could you please upgrade severity ?

Kind Regards
Philippe.

Comment 16 Philippe 2011-07-03 12:13:17 UTC
Created attachment 511049 [details]
Boot-Still-Crashes-Without-2nd_NIC

Comment 17 Michal Schmidt 2011-07-03 22:02:25 UTC
You could try blacklisting some modules to see if one of them is responsible
for the hang.
The NMI watchdog might help point out the fault:
http://www.mjmwired.net/kernel/Documentation/nmi_watchdog.txt

For other debugging tips see:

https://fedoraproject.org/wiki/Common_kernel_problems#Diagnosing_.22My_machine_locked_up.22

https://fedoraproject.org/wiki/Common_kernel_problems#Crashes.2FHangs

Comment 18 Philippe 2011-07-04 07:42:24 UTC
Replacing rhgb quiet with nmi_watchdog=1 || nmi_watchdog=2 , the problem remains the same and kernel steel freezes :(

Do you have an idea of what happens regarding the whole thread ?
As this would occur for anyone using this Dell Box same config and default install ..

I am really stuck & disappointed at this point. Please tell me precisely what i need to do / what could help to solve this ?

If this can help i can use a serial cable to a second box for capturing kernel messages. Will we see more messages than we already have ?

Regards
Philippe

Comment 19 Michal Schmidt 2011-07-04 12:03:09 UTC
(In reply to comment #18)
> Replacing rhgb quiet with nmi_watchdog=1 || nmi_watchdog=2 , the problem
> remains the same and kernel steel freezes :(

Just a clarification: The NMI watchdog is not supposed to fix anything. Its purpose is merely to produce a kernel stack trace when there's a hang with
IRQs disabled.

> Do you have an idea of what happens regarding the whole thread ?

Not yet.

> As this would occur for anyone using this Dell Box same config and default
> install ..

We don't know if that's the case.

> I am really stuck & disappointed at this point. Please tell me precisely what
> i need to do / what could help to solve this ?
> 
> If this can help i can use a serial cable to a second box for capturing kernel
> messages. Will we see more messages than we already have ?

There is a chance there might be something interesting logged to the serial console, but I cannot promise you that.

I suggest you to try a few things before you get the other box:
 - Blacklisting of modules non-essential for bootup (I can see bnx2, tg3,
   dcdbas, ...). To get a list of modules use 'lsmod' when you get
   a successful boot. To blacklist modules, write their names on separate lines
   in /etc/modprobe.d/local.conf:  blacklist <modulename>
   The idea is to narrow the problem down to specific module. If you can get a
   reliable boot after the blacklisting, try to modprobe the modules back
   manually and see if any of them causes the lockup.
   Alternatively you can boot with "init=/bin/sh" (hopefully at least such a
   minimal system will start reliably) and then try loading various modules
   with modprobe to see if any of the causes the lockup.
 - Some of the tips from the links in comment #17 (at the very least try
   acpi=off, nolapic, noapic)

Comment 20 Philippe 2011-07-04 12:23:49 UTC
With Nmi watchdog i expected the NMI handler generates an oops and kills the process. Have a 'controlled crash' (and the resulting kernel messages) like mentionned in the link Documentation ; and this was not the case ; kernel steel crashed / get unresponsive / without any new message. It was just feedback on this.
Still talking on NMI , where/how should i get the kernel stack trace you talk about ?

I will try blacklisting bnx2, tg3, dcdbas and give you feedback on this.

If the problem has something to do with mpt2sas , how should i deal with this ad it will not be possible to exclude it ?

Comment 21 Michal Schmidt 2011-07-04 12:39:47 UTC
(In reply to comment #20)
> Still talking on NMI , where/how should i get the kernel stack trace you talk
> about ?

It would be printed on the console. Since it isn't, the system is apparently even deader than the NMI watchdog can handle.

> I will try blacklisting bnx2, tg3, dcdbas and give you feedback on this.
> 
> If the problem has something to do with mpt2sas , how should i deal with this
> ad it will not be possible to exclude it ?

Is there no possibility to connect only a SATA disk to the AHCI controller and setup a test system on that?

Comment 22 Philippe 2011-07-04 17:19:11 UTC
After removing ipv6, bnx2, tg3, dcdbas the system seems to boot always.
Now trying to narrow excluded modules ..

Comment 23 Philippe 2011-07-05 00:41:16 UTC
Short resume of tests :
- Boot will NEVER freeze when blacklisting only (bnx2+tg3)
- Boot will often freeze when blacklisting only tg3
- Boot will often freeze when blacklisting only bnx2
- Keep in mind boot was still crashing without dual NIC (bnx2) : see Comment 15
- Tried booting with acpi=off noapic nolapic , had no positive effect

- Attached Com DebugLog Results :
SerialLog-BootStuckWithOnlyBnx2Enabled.txt
SerialLog-BootStuckWithOnlyTg3Enabled.txt

Comment 24 Philippe 2011-07-05 00:41:58 UTC
Created attachment 511231 [details]
BootStuckWithOnlyBnx2Enabled

Comment 25 Philippe 2011-07-05 00:42:35 UTC
Created attachment 511232 [details]
ServeurTest\SerialLog-BootStuckWithOnlyTg3Enabled

Comment 26 Philippe 2011-07-05 08:47:34 UTC
In addition probing modules after boot never lead to freeze.

i tryed disabling embedded NIC in the BIOS and boot with modules not blacklisted
=> No Effect

Also tryed set all IRQ to value "default" (which means they will be choosen by BIOS) instead of the fixed value the system came with.
=> No effect

In a few days/week i should have been delivering the working system to customer.

As i am not really used to bugzilla , i wonder how long it might take to solve this issue ? Even if it is difficult , could you give me an idea on schedule and how things will occur now ?

How long it might take if something had to be fixed in the kernel. (which i hope will not be necessary)

Regards

Comment 27 Philippe 2011-07-05 10:58:01 UTC
This morning my other customer using almost the same PE110 server confirmed the server was often not rebooting properly.
This server uses only tg3 module as it has 2 Broadcom BCM5721 & BCM5722 NIC.

Moreover the server OS (Fedora 13-kernel 2.6.34.8-68) sometimes hangs after 3 weeks in production.
The hardware has already been verified by Dell & Diagnostic tools.

Just in case it could help.
I am using many Fedora on other Dell servers ie PowerEdge R710 / PE 2950 / PE 2850 without any problem untill now.

Comment 28 Philippe 2011-07-05 11:17:59 UTC
I think we are in the same issue as :
https://bugzilla.redhat.com/show_bug.cgi?id=552288

Comment 29 Philippe 2011-07-05 11:23:25 UTC
Do you have an idea on why am i still having boot crash with BootStuckWithOnlyBnx2Enabled.txt and even with onboard NIC disabled in Bios ?

Comment 30 Philippe 2011-07-05 11:42:33 UTC
Sorry my fault, not the same issue for bug 552288, only tg3 related.

Comment 31 Michal Schmidt 2011-07-05 12:20:15 UTC
(In reply to comment #23)
> [   15.790196] bnx2 0000:02:00.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
> [   15.820592] bnx2 0000:02:0

Hmm, a hang in the middle of a printk, that's an interesting moment.

Can you also try blacklisting only dcdbas? The module can produce System Management Interrupts. They are handled fully by the BIOS, out the operating system's control. There may be a bug in the BIOS showing itself depending on the timing and other operations going on in the system.

(In reply to comment #26)
> In a few days/week i should have been delivering the working system to
> customer.
> 
> As i am not really used to bugzilla , i wonder how long it might take to solve
> this issue ? Even if it is difficult , could you give me an idea on schedule
> and how things will occur now ?

Sorry, I cannot give you any schedule. Fedora is a volunteer community-supported distribution. No warranty, no promises, no service agreements.
For users needing commercial support there's Red Hat Enterprise Linux.

Comment 32 Philippe 2011-07-05 12:44:47 UTC
> Hmm, a hang in the middle of a printk, that's an interesting moment
I did that twice. The second time , the second line was complete. Might be the whole message had no time to be completly sent to the port.

> Can you also try blacklisting only dcdbas?
I'll do that.
Also i am trying to build module tg3 from broadcom sources tg3-3.116j.tar.gz
As i don't know really what to do.
Anything more to suggest i should work on / try ?


> Sorry, I cannot give you any schedule ..
Sure i know those rules , i'm also trying to contribute by this thread and using fedora on production environnement.
I am just asking , regarding your experience if you think we will be able to solve it in let's say a week or two, or if you think i should switch to another OS. Which i would hate.

Comment 33 Philippe 2011-07-05 14:41:25 UTC
- Re-Enabled integrated NIC into the BIOS
- upgraded tg3 driver from 3.116 to 3.116j
- blacklisted only dcdbas

=> 1st reboot OK
=> 2nd reboot KO after IRQ 17 as before

Comment 34 Philippe 2011-07-05 18:00:49 UTC
tryed again with only both blacklist dcdbas && blacklist tg3
the system booted once and then the 5/6 next times it would always freeze.
I used rescue to set back blacklist on tg3 && bnx2.

Comment 35 Philippe 2011-07-06 00:19:03 UTC
Tests done after server boot with blacklist on both module for tg3 && bnx2 :

-  running command "rmmod dcdbas ; rmmod bnx2 ; rmmod tg3; modprobe dcdbas ; modprobe bnx2 ; modprobe tg3"
will not fail. (done 50 times less than 1 second before next launch)

- whereas running command "rmmod bnx2 ; rmmod tg3; modprobe bnx2 ; modprobe tg3"
crashed after 3 or 4 attempts

- did "rmmod dcdbas" and re-run command "rmmod bnx2 ; rmmod tg3; modprobe bnx2 ; modprobe tg3"
crashed after more than 40 attempts (less than 1 second before next launch)

- just to verify it again i removed blacklist for tg3 && tg2 and put only blacklist dcdbas
=> System freeze at startup as in the past
 => reverted back to blacklist on both module for tg3 && bnx2

- running command "rmmod dcdbas ; rmmod bnx2 ; rmmod tg3; modprobe dcdbas ; modprobe tg3 ; modprobe bnx2"
will not fail. (done 50 times less than 1 second before next launch)
=> it seems that the module load order has no effect

- after this tried running command "rmmod bnx2 ; rmmod tg3; modprobe tg3 ; modprobe bnx2"     (with dcdbas loaded)
would not fail for 50 attempts

=> did "rmmod dcdbas" and re-run command  "rmmod bnx2 ; rmmod tg3; modprobe tg3 ; modprobe bnx2"
crashed after 6 attempts


I am not sure if we can really conclude :
- it is obvious that things get better when dcdbas is probed before. Without it the freeze is happening fast.
- finally it seems that loading order of tg3 & bnx2 is not that important

Hope this will help.
Do you need something more ?

Comment 36 Kyle McMartin 2011-07-07 18:26:08 UTC
http://kyle.fedorapeople.org/kernel-2.6.40-0.rc5.git0.1.fc16.x86_64.rpm

Try booting this kernel and see if the problem persists in the latest upstream?

Comment 37 Philippe 2011-07-08 16:45:44 UTC
Hi kyle , thanx for your interest.

I installed 2.6.40 kernel and booted with modules dcdbas, tg3 & bnx2 disabled.

- On the first call of "rmmod dcdbas ; rmmod bnx2 ; rmmod tg3; modprobe dcdbas ;
modprobe tg3 ; modprobe bnx2" the OS got frozen immediately same behaviour.

- And also "rmmod dcdbas ; rmmod bnx2 ; rmmod tg3; modprobe dcdbas ;
modprobe bnx2 ; modprobe tg3" will not fail except if run continuously (without a minimum pause) after something like 30 attempts

Comment 38 Chuck Ebbert 2011-07-11 19:12:26 UTC
Did you try putting "sleep 1" after loading each module?

Comment 39 Philippe 2011-07-11 19:59:49 UTC
Yes i tried and it was working better with specific order and sleep 5

Comment 40 Harald Hoyer 2011-07-14 11:45:51 UTC
Just one shot in the dark: does adding "biosdevname=0" to the kernel command line help?

Comment 41 Michal Schmidt 2011-07-15 16:04:10 UTC
I found a PowerEdge T110 in our test lab, provisioned it with Fedora 15. Rebooted it three times and ran hundreds of iterations of
 rmmod bnx2; rmmod tg3; modprobe bnx2; modprobe tg3
and
 rmmod bnx2; rmmod tg3; modprobe tg3; modprobe bnx2
with and without dcdbas loaded. There was no hang.

There are some HW differences compared to Philippe's configuration:
 - a different SAS controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS [1000:0058], mptsas driver (not mpt2sas)
 - a few extra NICs (Intel, igb)

Comment 42 Philippe 2011-07-16 00:33:52 UTC
- Finally i switched to Centos 6 , i couldnt not get OpenManage to detect the H200 SAS controller whereas all the services were running fine. The freeze problem is gone uneder Centos , if you need some log to compare tell me. 

- FYI i installed the system at least 5 times from different manners (From DVDROM and From kickstart file + inet download) and everytime i had the problem, whatever the manner was. (i believe its specific to H200+both NICs)
and i still do not understand why disabling the embedded tg3 NIC in the bios had no effect !

- As i mentionned early i have another of this PE100 which look like yours :
01:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
with mptsas , it has not the freeze on boot problem. But the system freeze also after 1 or 2 weeks. Its running Fedora 13 + 2.6.34.9-69.fc13.i686.PAE.
Dell tested the complete hardware , there was no problem. This server is not doing much : serving files (smb) + pppoe connexion.
Freeze looks the same , no error in the logs, power and fans running, OS frozen.

PE100 is the only "server" hardware i get problem running on Fedora.
No problem with R710 , PE2950, PE2850, PE750

Best Regards,
Philippe

Comment 43 Josh Boyer 2012-06-06 15:46:43 UTC
F15 is going EOL in less than a month.  We aren't going to get this fixed.


Note You need to log in before you can comment on or make changes to this bug.