From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; Linux i686; U) Opera 7.51 [en] Description of problem: We have three Dell PowerEdge 6450 servers that were upgrded from RHEL AS 2.1 to 3.0 several months ago. After the upgrade it was discovered that all of these servers would hang when attempting to reboot. After some research we discovered several reports on the web about the same issue and the fix seemed was to add "reboot=b,s" to the boot command line. This indeed did fix the issue for two of the three servers, however, the third server continued to fail to reboot. The only difference between the servers that would reboot and the servers that wouldn't is that the one server that fails has 4 CPU's while the others only have 2 CPU's. I continued to try several different variations of the "reboot=" option such as "reboot=b,s0", "reboot=b,s1", etc., hoping that perhaps linux was simply selecting the incorrect processor to preform the reboot, however, no option that I tried corrected this issue. We also tried several other combinations with other "reboot=" options such as w, c, and h. Nothing has succeeded in getting this issue resolved. For additional testing I tried the following kernels and list their success or failure: Redhat AS 2.1 -- 2.4.9-e.38 -- Works Redhat 9 -- 2.4.20-31.9 -- Fails Fedora Core 1 -- 2.4.22-1.2197.nptl -- Works Redhat AS 3 -- 2.4.21-15.EL UP -- Works I tested several variants of the Redhat AS kernels, all SMP version failed, from 2.4.21-4.EL through the latest 2.4.21-15.0.3.EL, however, all UP kernel rebooted without issues. There are other reports of the issue that can be turned up with a quick search on Google, some have success with "reboot=b,s" others do not. I'm very suspcious that the people who do not have success are people with 4 CPU's. Please let me know what other information needs to be provided. Version-Release number of selected component (if applicable): kerne-smp-2.4.21-15.EL How reproducible: Always Steps to Reproduce: 1. Boot Dell PowerEdge 6450 with for processors with any AS 3 kernel 2. Type 'reboot' at command line Actual Results: System will hang at "System Rebooting..." Expected Results: System should reboot Additional info: We have worked around this issue by installing Dell Server Administrator which can detect a hung OS and use the systems embedded service processor to power cycle the system. Interestingly it detects this state as a hung OS and preforms the recovery. Its a crude workaround that shouldn't be required and adds an extra five minutes to an already long reboot process (these systems POST very slowly) but at least it allows us to reboot the server remotely even with this kernel bug.
Tom, I don't have one of these machines to work with, so I'll have to work through you. One question re: the Dell Server Administrator. Is it possible for it to report the PC of each processor? If this is a kernel-specific problem, I would first like to rule out the possibility that the IPI sent out by the rebooting cpu is not being received by one of the other cpus. If any of the processors for whatever reason are sitting in a spin_lock_irq(), then they won't ever respond to the IPI, and the rebooting system would block forever in machine_restart() and act as you describe. If you can get the PC of each cpu, it's possible that one of the cpus may show that it is operating in an address range that can be identified as a spin lock text area. If not, will you be able to run debug RHEL3 kernels that I create? I'd like to add a bunch of printk's in the machine_restart() function to figure out what's going on. Dave Anderson
Unfortunately I don't think that Dell Server Admin can get at that level of information, at least via any user accessible method that I can find. I guess that leaves us with the option of running a debug kernel, which I can do, but only during limited times as the system is a production Oracle box. That being said, we plan to upgrade the other two system to 4 CPU's this week and I'm anticipating that after we do that they will experience the same issue. If that turns out to be the case I can probably move the services of one of the servers to one of our lab servers temporarily which would free up a system to test with. In the meantime I can schedule times to test the reboot functionality on the existing server, but that probably means only one good test a day. I'm almost sure that the original beta kernels for RHEL 3 didn't have this problem. I may see if I still have one of those lying around just to test the reboot functionality as it might give us another data point that is closer to the current kernel than the RH9 or FC1 kernels. Then we could run some diff to see what changed. Later, Tom
Ok -- if you want to test an earlier RHEL3 kernel version, I can make it available for you.
This is a duplicate of bug 102504 (havent tried with the betas)
Thanks, Greg -- closing this as a duplicate. *** This bug has been marked as a duplicate of 102504 ***
How do I get access to that bug? I can view it but cannot add comments or add myself to the CC: list. It appears to be restricted to group members. I missed it during my search because it was files against the Beta. Sorry. Thanks, Tom
You are already on its cc: list, so you'll receive all subsequent input into the case. As to the restriction, it does appear to be restricted to Red Hat development, but since you are now on the cc: list, you are allowed to view it. I don't personally know how to change that behavior, but I can add your comments.
I still am unable to post comments on Bug 102504, presumably because it is for the Beta (I get the message "You are not permitted to edit bugs in product Red Hat Enterprise Linux Beta"). I am interested to know what steps I should take next to assist with resolving this issue. We are upgrading two of our 6450's from 2 to 4 CPU's tonight. Currently both of these systems will reboot with the "reboot=s,b" parameter but our 4 CPU system will not. We are anticaipating that after the upgrade we will then have 3 systems that fail to reboot. Is there a debug kernel we need to try? Thanks, Tom
I, too, am experiencing this problem. I have several Dell 6450s with 4 processors in each that fail to recycle after outputting the 'restarting system' message. They are running RH Enterprise Linux AS 3 Update 2. Is there a solution for this problem, perhaps in bug #102504 that I cannot at present access.
Not yet.
Has this been resolved in update 3? I have to install a 6450 with 4 cpus at a customer location soon. If this is still an issue, I'll just install RHEL2.1...
I don't believe the issue is resolved, it certainly doesn't seem to be for me, on top of this I've had random lockups and multiple servers after upgrading to the 2.4.21-20.EL kernels in U3 and am in the process of reverting to the previous kernels. You can easily work around the reboot issue with the Dell Server Administrator Auto Recovery feature, but I can't argue with running RHEL 2.1 unless you really need some of the RHEL 3 features. I ran 2.1 for quite a while on my 6450's and they were solid. Since upgrading to RHEL 3 over nine months ago we've had nothing but trouble with every kernel release having some bug that seems to make it worse than the last one, I sometimes wish it was easy to go back. Later, Tom
I took a drive that had AS 3.2 installed in a Dell PE 1650 and installed in my 6450. The 6450 would reboot with the 1650 drive. The 1650 would NOT reboot with the 6450 drive.
In addition to seeing the previously reported behavior on 6450s, I'm also seeing this on our 1600s. All with 4 processors. This is a "resolved duplicate"? Some one please tell Redhat's support staff so they can tell me the fix.
This bugzilla was closed as a duplicate of another open bugzilla. Unfortunately the problem at hand is not resolved.
PING: metoo. Please unclassify the tracker for this.
Has you guys fix this problem. I am running centos 3.6 and it is doing the same thing to me with 2 cpus and i just added 4.
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.
A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-40.9.EL).
(In reply to comment #22) > A fix for this problem has just been committed to the RHEL3 U8 > patch pool this evening (in kernel version 2.4.21-40.9.EL). > How did you fix it?
Created attachment 128122 [details] fix committed to RHEL3 U8 for this bug Hi, Greg. The attached patch is what was committed to U8. It simply adds "black list" entries for the Dell PowerEdge 6400 and 6450 systems that make reboots go through the BIOS (via setting "reboot_thru_bios").
Adding a couple dozen bugs to CanFix list so I can complete the stupid advisory.