Red Hat Bugzilla – Bug 58671
Tux 2.0, Samba 2.2.x, Apache 1.3.19
Last modified: 2007-04-18 12:39:08 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR
Description of problem:
I have a Tux webserver with Apache and Samba running. Occasionally, the kernel
locks up (oops), occasionally httpd hangs, and occasionally Tux has a thread on
our dual processor machines run at 100% utilization only on one of the
processors, #1. I have read in various places on the Internet that gcc 2.96
creates a kernel and user space programs that may cause file corruption (see
http://www.mysql.com/downloads/mysql-3.23.html bottom of the page). Samba 2.2.x
uses a database as a scoreboard for the sub-processes now, and it appears that
it is being corrupt somehow. Times when this happen show the file server to be
not responding on the network. When the kernel oops happen (I will get a
listing of all opps info the next time I hope), the machine appears to have
been performing a paging operation. Furthermore, I believe the Apache problem
is due to the scoreboard file becoming corrupt. We also have a Redhat 6.2
machine with Samba 2.0.7 that has experienced 100% uptime in over 300 days. The
Redhat 7.1 machines seems to fail every 18-35 days. The question becomes: is
gcc 2.96 the root of these problems? I upgraded the kernel to 2.4.3-12 when
that errata was released by Redhat. This did not fix the problem. I am
reluctant to update further if the cause is gcc 2.96 as these machines are
mission critical to the operation of our website.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Normal operation, non-existent to heavy loads.
Actual Results: During normal operation, Samba dies as it can't find it's own
processes in the scoreboard database. The same occurs with Apache, but not at
the same time (thus far). Tux seems to have a thread running 100% on processor
#1 as reported via top. At times the whole system will not respond, and the
machine's reset button must be used. These problems do not occur at the same
Expected Results: Normal operation.
A previous problem I had with these servers was data corruption on a MegaRaid
controlled array. The systems are HP Lpr's, one with dual PIII 550MHz, and
another with 850MHz. The Redhat 6.2 machine running Samba 2.0.7 is the same
(550Mhz) and has had 0% downtime. I scraped the MegaRaid cards and had to go
with the built in Symbios in the Lpr's to get anything working. After reading
reports of gcc 2.96, I feel this is the cause of the MegaRaid problem, the
Apache scoreboard file being corrupt, the Samba scoreboard database being
corrupt, Tux locking up on processor #1, and the system completely locking up
at times. I work in a 99.995% required uptime environment and absolutley do not
have time to sort through the oops printout or anything else for that matter
during down periods. I will, however, grabwhat I can during the next episode
and update this report.
The cause is not gcc 2.96. 2.96 has proven to be a very stable compiler and is
even recommended (next to *one* other gcc version) by Linus Torvalds for kernel
use (in fact, Linus himself uses 2.96).
We released a 2.4.9 kernel for 7.1 with a much upgraded TUX, it might be worth
upgrading to that...
do current kernels still produce this problem?
I have upgraded both machines to 7.2 and then applied the 2.4.9-31 kernel along
with the needed mod utils and newer tux userspace rpms. So far, everything is
fine. I would say the problem was in the older tux, this can/should be closed...