Red Hat Bugzilla – Bug 125091
Last modified: 2008-02-27 02:15:27 EST
Description of problem:
I just configured a production server for my company to run a large
web application using Apache2, PHP 4.3.6 and Oracle. I use tux to
improve system performance but it seems the system is crashing every
I just installed the latest FC2 updates.
My system is an SMP 4x Compaq 700MHz each, 2GB in RAM. I first
installed FC2T3 and then updated to FC2 Final with yum.
Version-Release number of selected component (if applicable):
PHP 4.3.6 (I compiled it my self)
Oracle 9i Client (18.104.22.168.0)
It seems to happen under a relatively high load, let's say 13 or more
of load average but not it is not always the case, the system usually
stays between 3 and 8. We have 630 concurrent users, 30 Dynamic
requests per second.
I'll post fragments of /var/log/messages in an attachment.
Let me add that 2.6 rocks it reduced the average load from 40 to 8
compared with 2.4 (RH9). 2.4 was a lot stable though.
Created attachment 100799 [details]
This is taken from /var/log/messages
could you try the latest FC2 kernel rpm from:
there's one page-count bug that has been fixed since .358, lets make
sure you didnt get beaten by that one.
Mmmm... There are just 2.6.6 kernels and I have 2.6.5. Are those
kernels supposed to be production quality?? Or at least better than
the one I have now?
Anyway, right now I am downloading kernel-smp-2.6.6-1.411.i686.rpm
from the URL you gave me.
Is there another package that I need to upgrade along with the kernel?
My uptime is almost a full day and everything is fine so far.
08:06:18 up 23:07, 2 users, load average: 5.42, 4.70, 4.07
Just after I hitted commit in comment number three, there was another
crash in the server. But there is nothing about it in
/var/log/messages. Is there another log that I can check??
I just tried kernel-smp-2.6.6-1.411.i686.rpm this morning and it
didn't work well. Now Apache2 (FC2 version) crashes very easily with
that kernel and it is always reproducible by stopping httpd
(/etc/rc.d/init/httpd stop). And during normal operation it just
crashes after a few minutes.
I must say that after those crashes, restarting httpd doesnt work
anymore, no matter if you kill -9 the remaining httpd processes it
wont start again after the crash. The only option is reboot the system.
So, I had to go back to 2.6.5-358 smp.
As a side note I made an stupid little mistake upgrading the kernel...
I did rpm -Uvh instead of rpm -ivh, causing the lost of the "stable"
smp kernel. It was a nightmare to get it back but I did it.
I'll attach the relevant log information later. I wonder if I should
post the log messages of the boot process.
Created attachment 100870 [details]
This is from the smp kernel 2.6.6-1.411 from Arjan
Could you temporarily switch off Tux and see whether Apache alone
produces the crash? (using the 2.6.6-1.411 kernel.)
one other thing in your crashes are shared-memory related functions -
are you using ramfs or shmfs in any way to store/cache content, or is
this Apache's usual shm use?
kernel BUG at mm/shmem.c:614!
that is caused by a bug in a 411 patch; please try a later kernel
Ingo, I'll turn tux off on friday because monday is a very busy day...
I'll try to do it earlier if I can.
And yes, I'm using some shared memory functions. Our PHP application
uses an in-house mechanism made in PHP to store objects and resultsets
form the database in shared memory boosting execution speed. Our
library uses the PHP Shared Memory functions:
Since we have a lot of data to cache, I had to increase
/proc/sys/kernel/shmmax from 32 to 64MB. I don't think there is a
problem with that.
I installed apache 1.3.31 from sources and compiled PHP 4.3.7... I
want to see if apache2 (from FC2) is the ofender in this case. In
fact, this was the first time I ran my app with apache2 and as far as
I can see apache2 executes much faster than apache 1.3... So I would
like to switch back to apache2 ASAP.
On friday too, I'm willing to test the latest kernel from Arjan and
see what happens.
Created attachment 100917 [details]
With kernel 2.6.5-1.358smp
Bad news, I had another crash... like I said before I am using Apache 1.3.31
Any comments about the log?
the crash from June 7 you attached still seems to have Tux related
functions in them. Please try the latest post-411 kernel from Arjan
and try to exclude Tux to simplify things.
another independent thing to try would be to exclude the shm
extension. Cannot that feature work over a normal filesystem? Linux is
nearly as good caching files on ext3 as using shmfs, there should be
no noticeable performance difference. (as long you turn atime and
diratime off in that filesystem.)
Ingo, Right now I'm downloading kernel-smp-2.6.6-1.422.i686.rpm from
Arjan to install it this night.
Since I will disable tux I'm going back to apache2 to minimize
performance loss because now apache will have to serve static contet.
BTW, Is the new send_file() syscall as fast as tux??
Regarding SHM, yes it can be done in file system but that
implementation is not as robust as SHM. If that keeps bugging me I'll
make a better implmementation and then replace it.
Today have been the worst day... 8 crashes so far.
Ingo, is there a way to get a core dump or something? have you seen
something weird in tux?
Tomorrow's test will be:
- Apache2 (from FC2)
- PHP 4.3.7 (solves a bug with apache2 it seems)
- Tux disabled
- latest kernel from Arjan
I hope performance doesnt suffer to much tomorrow. Wish me luck!.
Good news... This configuration worked well the whole day. Not a
signle crash, not even a single syslog. So the problem seems to be
specific to TUX. Should I enable it again with this new kernel??
Ingo, is this problem solvable?? Do you need more info? I really want
to enable TUX because it is very important for our application
performance. It lowers memory consumption and reduces the number of
http processes, thus, lowering the number of connections to the
database. With tux we needed about 40-60 processes, now without it we
need 140-190... a huge difference.
Same thing with memory consumption: 400-600MB against 0.9-1.2GB.
Talking about other things... Arjan, Does this new kernel ship SELinux
on by default?? Should I turn it off?
I got this in syslog during boot up:
kernel: Security Scaffold v1.0.0 initialized
kernel: SELinux: Initializing.
kernel: SELinux: Starting in permissive mode
kernel: There is already a security framework initialized,
kernel: selinux_register_security: Registering secondary module
kernel: Capability LSM initialized as secondary
kernel: Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Could you try three things:
- try the new kernel with Tux enabled - does it still crash?
- if it still crashes, how hard would it be to disable the shmfs-using
component of your setup? If it's possible then does Tux still crash
in a non-shmfs setup too?
- if Tux crashes in this way too, could you try to disable
zerocopy_header and zerocopy_sendfile, in /proc/sys/net/tux/? [you can
set these while Tux is running, the flags take effect immediately.]
the oopses themselves dont show anything incriminating - code is
crashing that hasnt changed for some time. Maybe there's some 2.6
kernel bug lurking around, maybe it's the recent objrmap changes,
maybe it's something else.
Any change to try your current (apparently flawless) setup with Tux
reenabled? (but no other change.) Do the crashes reappear?
I've been monitoring the newest kernel 2.6.6-1.422 (from Arjan)
without TUX and it seems to work fine so far.
The fact that this is a production system with lots of users makes it
very hard to do some testing.
But I have some plans for this week... First of all, I already made a
first test. gkrellm was crashing the first kernel too (2.6.5-1.358)
and now it works fine (15 days of uptime so far).
I already installed the official 2.6.6-1.435 but I haven't reboot yet,
I plan to test it this thursday and if everything goes well I will
enable TUX on friday which is a (relatively) easy day for the system.
So, keep tuned. ;-)
Is there any reason that the component of this bug is set to 4Suite?
No, I selected tux when I created it, I wonder why this happened.
OTOH, I'm still trying to enable tux and make some test in the
production machine with the new official 2.6.6-1.435 kernel so we can
close this bug. I hope I can do this tomorrow.
Created attachment 102141 [details]
I'm sorry for the lateness,
At last I could do some testing of this with bad results, it blocked again very
early with kernel 2.6.6-1.435smp.
It was fast, I configured the web server to run tux yesterday night at 11:30
and it blocked today at 7:30AM just a few minutes after I got to my work place.
Apache2 (from FC2) alone works flawlessly, very stable.
I hope this log helps.
hm, your latest oops shows:
ecx: 00200200 edx: 00100100
this is roughly a single bit away from a NULL-ptr derived offset. How
reliable is the hardware?
another thing: do you have gzip compression enabled in Tux? If yes
then please turn it off.
ecx, edx? what does that mean?
The hardware has a good reliability, I haven't seen any problem so
far. Sometimes it gets more than 30 days of uptime. Right now it
shows an uptime of 6 days because of a kernel update. The server is
under heavy stress almost every day of the week, specially now that
it's not running with tux. During the day it goes from an avg. load
of 4 to 60 depending on the hour. Even so our web app performance is
good but I had to configure Apache to use 320 HTTPD proccesses to
handle the load, and even with that it reachs the limit sometimes.
How can I test reliability problems?? We have been with that server
for a very long time now and every thing works as expected.
And yes, I have gzip compression enabled in tux. So I'll turn it off
and see if that crashes the newest kernel (2.6.7-1.494.2.2smp) again.
I haven't tested that kernel with tux yet, so we'll see.
It seems that the latest kernel (2.6.7-1.494.2.2smp) doesnt have the
tux kernel module. because of this tux doesn't start, the
initialization script fails on the line where it does a 'modprobe tux'.
Every previous kernel have tux module except for 2.6.7.
Did you forget to ship it? What can I do? I had to suspend todays
Just for the record, 2.6.8-1.521 doesn't have tux either. What can I
do? I need to test.
About 82 of load average for the most part of the morning. What
happened with tux? Is it unsupported now?
the symbol fixes that made the tux module unavailable should be fixed
in the next (today's?) rpm in Arjan's directory.
Arjan's latest is .549 and still doesn't have the tux.ko module.
Created attachment 104014 [details]
Another crash with the latest Arjan's kernel (2.6.8-1.579smp)
It crashed again but this time it was a little bit different. Everything was
working really great and suddenly I saw some syslog messages. It was from TUX
at 07:28:42 AM. But it didn't crash immediatly, it kept working fine for
several minutes. Almost 20 minutes according to syslog.
I used kernel 2.6.8-1.579smp from Arjan. I made the changes in the
configuration that we agreed: 1) Disabled compression and 2) Changed the
symlink for a real directory in the document root.
I hope this trace helps.
I managed to do more testing in production environtment and tux still
There are several changes now. First of all, Our tradiotional server
just got busted and we are using a brand new and powerful server (IBM
xSeries 445) with 2.2 GB in RAM and 4 SMP Hyperthreaded processors,
2.2 GHz each, so linux sees 8 logical processors. Actually the server
comes with 8 real processors (16 logical) but the extra ones are
disconected right now.
I installed Fedora Core 3 with the latest official updates including
the kernel (2.6.9-1.681_FC3smp). Then I installed Oracle cliente 10g
which ran flawlessly on FC3.
Apache and PHP are from FC3 and I downloaded the PHP source from
www.php.net to compile the oci8 module and add it to the PHP module
directory (/usr/lib/php4). Everything works great with Apache alone.
Then I configured tux and when I started the service, it immediately
crashed as soon as I hitted enter. Then restarted the server which,
BTW, takes like 10 minutes just to get GRUB loaded, TUX started during
boot processs and this time it was running fine for about 8 minutes
and then crashed again. Since then I have been running with apache alone.
With the first crash I got a huge syslog, it seems it registered each
thread crash (8 threads). The second syslog was very small in
I'll attach the two syslogs in a very short time. I hope this really
Created attachment 108259 [details]
Crash with kernel 2.6.9-1.681_FC3smp
This one is the crash that I got after I started tux manually as soon as I
Created attachment 108261 [details]
Second crash with kernel 2.6.9-1.681_FC3smp
After 8 minutes of normal work I got this syslog. TUX was started
automatically during boot after the first crash.
BTW, this is my tux configuration:
# TUXTHREADS=1 <-- By default it starts 8 threads
DOCROOT=/var/www/html/ <-- The same as apache
#LOGFILE=/var/log/tux <-- I'm not logging the requests
# TUX Configuration
That's all... I hope.
Created attachment 112484 [details]
First crash with kernel 2.6.10-1.760_FC3smp
After a very long time I was able to reproduce this bug on a different system.
This time I had to setup a test server for a new web application with a similar
architecture than the one original used to open this bug report, in fact, this
is the very same server (recently we switched to an IBM xSeries 445 which
present the same problems with tux).
I installed Fedora Core 3 with the latest updates available at that time (March
01 2005 aprox.) and configured an Oracle 9i database along with the web servers
(tux and apache). Everything was working fine for a few days until tux started
This time the behaviour is a bit different. All tux threads crashed in less
than 2 minutes (it's an SMP 4x) but the system stayed operational, the only
thing that broke was tux. I could ssh it and tried to kill tux but I couldn't,
then reconfigured apache to work without tux but it didn't work because tux was
blocking the port 80, although the apache start script said that it launched
I had to reboot. After that, I leaved the server with the default
configuration (tux + apache) and hoped for the best... It crashed again later.
There are some comments wthin the log.
Created attachment 112488 [details]
Second crash with kernel 2.6.10-1.760_FC3smp
Hello there, this one is a continuation to the previous attachment. It's the
second crash that blocked the machine completely, we had to wait after the
weekend to reboot the server.
This time crashed only two threads.
Created attachment 112490 [details]
Crash log with 2.6.10-1.770_FC3 UP kernel
This one is from my workstation. I installed Fedora Core 3 with the latest
updates and It crashed on me twice until I deactivated tux to get some work
done with my projects. This is the first time I can reproduce the problem in
my PC but still I can't recognize a pattern.
But hey, this means we can use my box to make heavy testing of the problem with
deep changes like development kernels. I remember Ingo was complaining about
my web app using a shared memory based cache to store data, since this is not a
production system and performance is not critical, we can use a file based
cache or a fake cache or something like that just to make sure there wasn't any
problem with that.
I hope this time we can nail this problem down.
Created attachment 113415 [details]
Crash with latest stable kernel 2.6.11-1.14_FC3 UP
I got another crash from tux this morning after reboot (this is my
workstation). Looking at the backtrace, I'm not sure if it is for a symlink,
which I seriously doubt it cause I get them out from my document_root many
The other reason could be the SHMCache mechanism in my web app. Right now and
for severalweeks I have been running with a NullCache (which is a dumb object
saying 'go always to the database') as you suggested and in fact it have been
more stable than SHM but todays crash was using NullCache.
Please, take a look at the backtrace. Is that complete? I guess there are some
entries missing. However, I compared them with previous backtraces in the bug
report and it looks a bit different.
Another thing, this kernel (2.6.11-1.14_FC3) have been locking on me a couple
of times without any information about it in the logs. I can ping the computer
and it responds but I can't connect to it (ssh, telnet, ftp).
The only thing I see in the logs at boot time is this:
Apr 20 09:52:55 nalwalovaton kernel: PCI: Found IRQ 10 for device 0000:00:02.0
Apr 20 09:52:55 nalwalovaton kernel: [drm] Initialized i810 1.4.0 20030605 on
Apr 20 09:52:55 nalwalovaton kernel: mtrr: base(0x44000000) is not aligned on a
Apr 20 09:52:55 nalwalovaton kernel: [drm] Using v1.4 init.
It's me again with hot news.
I developed a filesystem based cache mechanism to use with my web app instead of
the SHM cache mechanism as Ingo suggested. I tried it for a few days in my
workstation and it was a lot more stable, tux used to crash once a day with SHM
Then I tried the filesystem cache on production environment (only apache) and it
worked fine, the performance is comparable. I scheduled to make a tux test on
saturday april 23 because there are less users but still ther is enough load
(about 40% of the usual load).
I reconfigured apache to be tux friendly, then I started tux and my web app was
up and running. The requests seemed to be flowing normally but there was a
little detail: Tux was hogging one of the CPUs, 100% all the time. The server
has 4 HT proccessors. After 20 minutes of work it crashed. I had to reboot.
After reboot, 15 minutes later (damn xSeries 445!), I felt a bit brave and
decided to start tux again. I noted the very same problem, one of the CPUs at
100% and almost 30 seconds later it crashed again... the fastest crash ever. :-(
This time I played safe and started only with apache.
After reboot ipcs showed a clean shared memory, so there is no way SHM Cache
could be causing this problem. A little thing I didn't changed was
"kernel.shmmax = 536870912" in /etc/sysctl.conf. May be this is a problem, but
during both tests there weren't any shm usage.
I'll attach both backtraces soon.
Created attachment 113694 [details]
First crash with production server (2.6.10-1.770_FC3smp)
Created attachment 113695 [details]
Second crash with production server
Ingo, I hope this info helps. At least I made a filesystem based cache ;-)
Hi Ingo, after seven months away I am back again with a new backtrace.
This time the box is a very old low end server with one processor (Pentium II)
with 350MHz, 512 MB of RAM and a SCSI storage controller: Adaptec AIC-7880U
according to lspci.
The OS is an up to date (as of 14 nov) Fedora Core 3 system, this includes the
latest glibc update and the latest kernel update. After the update the system
was prelinked (I don't know if that counts).
The machine was configured for testing and training of new enterprise web
applications that we are developing in our company, right now it has 25
concurrent users (in the class room) and I took the opportunity to give tux
another test. The performance of tux was great like always but after an hour it
crashed in the very same way like before (I'll attach the log later).
Unfortunately we couldn't install FC4 on that machine because it was required to
be configured exactly the same like our production server, which is a Fedora
Core 3 system but there are lots of updates pending to be applied.
After this training session (that ends on next friday), there is going to be
training sessions all over the country so we can use this server for testing of
patches or something. Since the machine is very low end, tux is going to be a
In our production environment we don't need tux yet because the servers are very
powerful, but tux is very important for us because it helps us to reduce the
number of persistent connections to the databases server.
Thanx again for your support.
Created attachment 121098 [details]
tux crash on a low end server
This is the backtrace of the uniprocessor server, may be using old technology
and avoiding SMP and HyperThreading will help you to understand the problem...
I had a lot of trouble with 2.6.x kernel and TUX.
I made a patch on 2.6.12 for TUX.
TUX Module options are below.
# TUX options
# CONFIG_TUX_EXTCGI is not set
# CONFIG_TUX_EXTENDED_LOG is not set
# CONFIG_TUX_DEBUG is not set
Don't use TUX's log or debug messages.
I had trouble between TUX and syslogd.
TUX logger and redirection has some trouble, too.
I did test on 1CPU PC.
If you will have some trouble, please capture the kernel's message and send to
I attach a patch file & TUX configure file.
NOTICE: This patch is experimental. ;)
Created attachment 121179 [details]
Don't use TUX logger and debug messages.
I hacking the TUX on linux 2.6.12.
I found that some routine has strange status.
For example, do_send_abuf() function in abuf.c.
If tcp_sendpage()'s return value was smaller than 0, req->in_file did lost
I think that if req->in_file does lost pointer, access of req->in_file->f_pos
is no means.
I did check the req->in_file in source code, then TUX didn't panic.
Is this correct? I don't think so :D
I wonder that why req->in_file did lost pointer.
If I will run my patched the TUX and kernel, What else happens?
I want your opinion. Have nice a day~
I got to reproduce another crash from tux using FC4 on my home machine, may be
it's a different problem than the one I'm seeing in the production server but
still it's a crash.
I am using an almost up to date FC4 system with 512 MB in RAM, CPU Athlon XP
2000, kernel 2.6.14-1.1637_FC4 but I had to use kernel 2.6.13-1.1532_FC4 because
the other one didn't come with tux (hint: upstream it?) I hope tomorrow I can
reproduce this problem with the latest FC4 kernel update.
The thing is that I was trying to reproduce my server (FC3) problem at home
(FC4). I configured tux and used only tux for the test, there was no apache or
any other user space web server. then I created a very simple static html file
which reloaded it self every 2 seconds and left it running for a while.
After an hour it was still working fine so I decided to stop tux (while the
browser was still running) and that's when it crashed. Then rebooted the
machine, re-executed the test but before stopping tux I closed the browser, this
time tux didn't crashed (I tried that several times). Then tried to stop tux
while the browser was running and it crashed again (very easy to reproduce).
Since my computer uses the proprietary driver from nVidia I changed the video
driver to "vesa generic" and rebooted the machine and retried the test to get a
back trace with a non tainted kernel... it still crashed.
The test environment was basically a simple gnome desktop plus tux.
My tux configuration is as follows:
# TUX Configuration
# TUXMODULES="demo.tux demo2.tux demo3.tux demo4.tux"
In the last file I "redirected" php requests to the user space web server but
there is no need for it for the test since there is no php files involved.
I'll post the html file from the test I made and the back traces I got.
Like I said, the page reloads it self every 2 seconds and the
MAX_KEEPALIVE_TIMEOUT parameter is configured to 10, may be that's what is
causing the crash but I'm just guessing.
Please Ingo, try to reproduce this on your machine and let's see what happens.
Created attachment 121826 [details]
This is the html page I made for the test
Created attachment 121827 [details]
First crash with a tainted kernel
Created attachment 121828 [details]
Second crash with a tainted kernel, slightly different than the first one, I don't know why
Created attachment 121829 [details]
Crash with a NON tainted kernel
Please Ingo, try to reproduce this problem, it's very easy on my side.
Created attachment 121849 [details]
Same crash on FC3 (2.6.12-1.1381_FC3)
Created attachment 121850 [details]
Same crash on FC3 with no keep alive
Tried the same test deactivating the keep alive option in TUX but still it
crashed, so the bug seems to be a little more complex than that. I hope you
can reproduce this on your side.
I have to tell that on these crashes the system keeps working fine (desktop
wise). TUX is the only one that gets blocked.
Created attachment 122530 [details]
Crash with rawhide 21 Dec 2005
I installed a RawHide test box on a VMWare 5.5 machine. It was FC4 at first
but then I yum it to the latest devel tree as of 21 Dec 2005 (kernel
2.6.14-1.1777_FC5.i686). I managed to reproduce the latest crashes I posted on
this bug report.
1st test: start and stop tux... [Ok]
2nd test: start tux, open firefox, request page, close firefox, stop tux...
3rd test: start tux, open firefox, request page, stop tux... [Failed]
In the third test tux crashed like in FC3 and FC4 mentioned in the previous
comments... see the attachment for the details.
Any thoughts on the backtrace?
PS. I'll try this on a real machine if I can get one.
Created attachment 122628 [details]
Simple Patch for tux3-2.6.14-A1
I tested tux3-22.214.171.124-A1 and 126.96.36.199 kernel(download from kernel.org).
The req->in_file in do_send_buf() lost pointer problem didn't occur.
Ingo's said is correct.
My test method is follow. My TUX box is Pantium III 800MHz(1CPU).
1. TUX service only JPG files (no TUX log, no TUX CGI, no TUX DEBUG).
2. Apache+PHP service PHP dynamic page.
3. I made a php program that send random 10 JPG files in 10000 files to user's
Some files not exist actually.
4. Then, user's access redirected to Apache, and TUX service random 10 JPG
5. In another Linux client box, 512 wget process request to TUX+Apache server
and wget process count keep up near 512.
6. Over 12 hours go on.
But William's crash come my system.
I tracked the crash, I found that It is same problem on req->in_file.
It locate at tux_flush_workqueue() function in input.c.
I think that req->in_file check is more safe before use.
I attach my simple patch. Could you test my patch?
1. Download 188.8.131.52 from kernel.org
2. Patch tux3-184.108.40.206-A1
3. Patch my file.
4. Kernel compile & Test.
Created attachment 144158 [details]
Oops from tux
today i got this oops. i was playing with s-c-services. i can't reproduce it.
ip6tables, selinux, audit is disabled...
ooops i forgot the important:
Tux has been removed from Fedora.