From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914) Description of problem: I have a Tux webserver with Apache and Samba running. Occasionally, the kernel locks up (oops), occasionally httpd hangs, and occasionally Tux has a thread on our dual processor machines run at 100% utilization only on one of the processors, #1. I have read in various places on the Internet that gcc 2.96 creates a kernel and user space programs that may cause file corruption (see http://www.mysql.com/downloads/mysql-3.23.html bottom of the page). Samba 2.2.x uses a database as a scoreboard for the sub-processes now, and it appears that it is being corrupt somehow. Times when this happen show the file server to be not responding on the network. When the kernel oops happen (I will get a listing of all opps info the next time I hope), the machine appears to have been performing a paging operation. Furthermore, I believe the Apache problem is due to the scoreboard file becoming corrupt. We also have a Redhat 6.2 machine with Samba 2.0.7 that has experienced 100% uptime in over 300 days. The Redhat 7.1 machines seems to fail every 18-35 days. The question becomes: is gcc 2.96 the root of these problems? I upgraded the kernel to 2.4.3-12 when that errata was released by Redhat. This did not fix the problem. I am reluctant to update further if the cause is gcc 2.96 as these machines are mission critical to the operation of our website. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Normal operation, non-existent to heavy loads. 2. 3. Actual Results: During normal operation, Samba dies as it can't find it's own processes in the scoreboard database. The same occurs with Apache, but not at the same time (thus far). Tux seems to have a thread running 100% on processor #1 as reported via top. At times the whole system will not respond, and the machine's reset button must be used. These problems do not occur at the same time. Expected Results: Normal operation. Additional info: A previous problem I had with these servers was data corruption on a MegaRaid controlled array. The systems are HP Lpr's, one with dual PIII 550MHz, and another with 850MHz. The Redhat 6.2 machine running Samba 2.0.7 is the same (550Mhz) and has had 0% downtime. I scraped the MegaRaid cards and had to go with the built in Symbios in the Lpr's to get anything working. After reading reports of gcc 2.96, I feel this is the cause of the MegaRaid problem, the Apache scoreboard file being corrupt, the Samba scoreboard database being corrupt, Tux locking up on processor #1, and the system completely locking up at times. I work in a 99.995% required uptime environment and absolutley do not have time to sort through the oops printout or anything else for that matter during down periods. I will, however, grabwhat I can during the next episode and update this report.
The cause is not gcc 2.96. 2.96 has proven to be a very stable compiler and is even recommended (next to *one* other gcc version) by Linus Torvalds for kernel use (in fact, Linus himself uses 2.96). We released a 2.4.9 kernel for 7.1 with a much upgraded TUX, it might be worth upgrading to that...
do current kernels still produce this problem?
I have upgraded both machines to 7.2 and then applied the 2.4.9-31 kernel along with the needed mod utils and newer tux userspace rpms. So far, everything is fine. I would say the problem was in the older tux, this can/should be closed...