Description of problem: I just configured a production server for my company to run a large web application using Apache2, PHP 4.3.6 and Oracle. I use tux to improve system performance but it seems the system is crashing every few days. I just installed the latest FC2 updates. My system is an SMP 4x Compaq 700MHz each, 2GB in RAM. I first installed FC2T3 and then updated to FC2 Final with yum. Version-Release number of selected component (if applicable): tux-3.2.18-1 httpd-2.0.49-4 PHP 4.3.6 (I compiled it my self) Oracle 9i Client (9.2.0.4.0) How reproducible: It seems to happen under a relatively high load, let's say 13 or more of load average but not it is not always the case, the system usually stays between 3 and 8. We have 630 concurrent users, 30 Dynamic requests per second. I'll post fragments of /var/log/messages in an attachment. Additional info: Let me add that 2.6 rocks it reduced the average load from 40 to 8 compared with 2.4 (RH9). 2.4 was a lot stable though.
Created attachment 100799 [details] This is taken from /var/log/messages
could you try the latest FC2 kernel rpm from: http://people.redhat.com/arjanv/2.6/RPMS.kernel/ there's one page-count bug that has been fixed since .358, lets make sure you didnt get beaten by that one.
Mmmm... There are just 2.6.6 kernels and I have 2.6.5. Are those kernels supposed to be production quality?? Or at least better than the one I have now? Anyway, right now I am downloading kernel-smp-2.6.6-1.411.i686.rpm from the URL you gave me. Is there another package that I need to upgrade along with the kernel? My uptime is almost a full day and everything is fine so far. 08:06:18 up 23:07, 2 users, load average: 5.42, 4.70, 4.07
Damn!
Just after I hitted commit in comment number three, there was another crash in the server. But there is nothing about it in /var/log/messages. Is there another log that I can check??
I just tried kernel-smp-2.6.6-1.411.i686.rpm this morning and it didn't work well. Now Apache2 (FC2 version) crashes very easily with that kernel and it is always reproducible by stopping httpd (/etc/rc.d/init/httpd stop). And during normal operation it just crashes after a few minutes. I must say that after those crashes, restarting httpd doesnt work anymore, no matter if you kill -9 the remaining httpd processes it wont start again after the crash. The only option is reboot the system. So, I had to go back to 2.6.5-358 smp. As a side note I made an stupid little mistake upgrading the kernel... I did rpm -Uvh instead of rpm -ivh, causing the lost of the "stable" smp kernel. It was a nightmare to get it back but I did it. I'll attach the relevant log information later. I wonder if I should post the log messages of the boot process.
Created attachment 100870 [details] This is from the smp kernel 2.6.6-1.411 from Arjan
Could you temporarily switch off Tux and see whether Apache alone produces the crash? (using the 2.6.6-1.411 kernel.) one other thing in your crashes are shared-memory related functions - are you using ramfs or shmfs in any way to store/cache content, or is this Apache's usual shm use?
kernel BUG at mm/shmem.c:614! that is caused by a bug in a 411 patch; please try a later kernel
Ingo, I'll turn tux off on friday because monday is a very busy day... I'll try to do it earlier if I can. And yes, I'm using some shared memory functions. Our PHP application uses an in-house mechanism made in PHP to store objects and resultsets form the database in shared memory boosting execution speed. Our library uses the PHP Shared Memory functions: [] http://www.php.net/sem Since we have a lot of data to cache, I had to increase /proc/sys/kernel/shmmax from 32 to 64MB. I don't think there is a problem with that. I installed apache 1.3.31 from sources and compiled PHP 4.3.7... I want to see if apache2 (from FC2) is the ofender in this case. In fact, this was the first time I ran my app with apache2 and as far as I can see apache2 executes much faster than apache 1.3... So I would like to switch back to apache2 ASAP. On friday too, I'm willing to test the latest kernel from Arjan and see what happens. Regards.
Created attachment 100917 [details] With kernel 2.6.5-1.358smp Bad news, I had another crash... like I said before I am using Apache 1.3.31 this time. Any comments about the log?
the crash from June 7 you attached still seems to have Tux related functions in them. Please try the latest post-411 kernel from Arjan and try to exclude Tux to simplify things. another independent thing to try would be to exclude the shm extension. Cannot that feature work over a normal filesystem? Linux is nearly as good caching files on ext3 as using shmfs, there should be no noticeable performance difference. (as long you turn atime and diratime off in that filesystem.)
Ingo, Right now I'm downloading kernel-smp-2.6.6-1.422.i686.rpm from Arjan to install it this night. Since I will disable tux I'm going back to apache2 to minimize performance loss because now apache will have to serve static contet. BTW, Is the new send_file() syscall as fast as tux?? Regarding SHM, yes it can be done in file system but that implementation is not as robust as SHM. If that keeps bugging me I'll make a better implmementation and then replace it.
Today have been the worst day... 8 crashes so far. Ingo, is there a way to get a core dump or something? have you seen something weird in tux? Tomorrow's test will be: - Apache2 (from FC2) - PHP 4.3.7 (solves a bug with apache2 it seems) - Tux disabled - latest kernel from Arjan I hope performance doesnt suffer to much tomorrow. Wish me luck!.
Hey guys, Good news... This configuration worked well the whole day. Not a signle crash, not even a single syslog. So the problem seems to be specific to TUX. Should I enable it again with this new kernel?? Ingo, is this problem solvable?? Do you need more info? I really want to enable TUX because it is very important for our application performance. It lowers memory consumption and reduces the number of http processes, thus, lowering the number of connections to the database. With tux we needed about 40-60 processes, now without it we need 140-190... a huge difference. Same thing with memory consumption: 400-600MB against 0.9-1.2GB. -- Talking about other things... Arjan, Does this new kernel ship SELinux on by default?? Should I turn it off? I got this in syslog during boot up: kernel: Security Scaffold v1.0.0 initialized kernel: SELinux: Initializing. kernel: SELinux: Starting in permissive mode kernel: There is already a security framework initialized, register_security failed. kernel: selinux_register_security: Registering secondary module capability kernel: Capability LSM initialized as secondary kernel: Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Could you try three things: - try the new kernel with Tux enabled - does it still crash? - if it still crashes, how hard would it be to disable the shmfs-using component of your setup? If it's possible then does Tux still crash in a non-shmfs setup too? - if Tux crashes in this way too, could you try to disable zerocopy_header and zerocopy_sendfile, in /proc/sys/net/tux/? [you can set these while Tux is running, the flags take effect immediately.] the oopses themselves dont show anything incriminating - code is crashing that hasnt changed for some time. Maybe there's some 2.6 kernel bug lurking around, maybe it's the recent objrmap changes, maybe it's something else.
Any change to try your current (apparently flawless) setup with Tux reenabled? (but no other change.) Do the crashes reappear?
s/change/chance/
Hi Ingo, I've been monitoring the newest kernel 2.6.6-1.422 (from Arjan) without TUX and it seems to work fine so far. The fact that this is a production system with lots of users makes it very hard to do some testing. But I have some plans for this week... First of all, I already made a first test. gkrellm was crashing the first kernel too (2.6.5-1.358) and now it works fine (15 days of uptime so far). I already installed the official 2.6.6-1.435 but I haven't reboot yet, I plan to test it this thursday and if everything goes well I will enable TUX on friday which is a (relatively) easy day for the system. So, keep tuned. ;-)
Is there any reason that the component of this bug is set to 4Suite?
No, I selected tux when I created it, I wonder why this happened. OTOH, I'm still trying to enable tux and make some test in the production machine with the new official 2.6.6-1.435 kernel so we can close this bug. I hope I can do this tomorrow.
Created attachment 102141 [details] Another crash... I'm sorry for the lateness, At last I could do some testing of this with bad results, it blocked again very early with kernel 2.6.6-1.435smp. It was fast, I configured the web server to run tux yesterday night at 11:30 and it blocked today at 7:30AM just a few minutes after I got to my work place. Apache2 (from FC2) alone works flawlessly, very stable. I hope this log helps.
hm, your latest oops shows: ecx: 00200200 edx: 00100100 this is roughly a single bit away from a NULL-ptr derived offset. How reliable is the hardware? another thing: do you have gzip compression enabled in Tux? If yes then please turn it off.
ecx, edx? what does that mean? The hardware has a good reliability, I haven't seen any problem so far. Sometimes it gets more than 30 days of uptime. Right now it shows an uptime of 6 days because of a kernel update. The server is under heavy stress almost every day of the week, specially now that it's not running with tux. During the day it goes from an avg. load of 4 to 60 depending on the hour. Even so our web app performance is good but I had to configure Apache to use 320 HTTPD proccesses to handle the load, and even with that it reachs the limit sometimes. How can I test reliability problems?? We have been with that server for a very long time now and every thing works as expected. And yes, I have gzip compression enabled in tux. So I'll turn it off and see if that crashes the newest kernel (2.6.7-1.494.2.2smp) again. I haven't tested that kernel with tux yet, so we'll see.
Hi guys, It seems that the latest kernel (2.6.7-1.494.2.2smp) doesnt have the tux kernel module. because of this tux doesn't start, the initialization script fails on the line where it does a 'modprobe tux'. Every previous kernel have tux module except for 2.6.7. Did you forget to ship it? What can I do? I had to suspend todays crash test.
Just for the record, 2.6.8-1.521 doesn't have tux either. What can I do? I need to test.
About 82 of load average for the most part of the morning. What happened with tux? Is it unsupported now?
the symbol fixes that made the tux module unavailable should be fixed in the next (today's?) rpm in Arjan's directory.
Arjan's latest is .549 and still doesn't have the tux.ko module.
Created attachment 104014 [details] Another crash with the latest Arjan's kernel (2.6.8-1.579smp) It crashed again but this time it was a little bit different. Everything was working really great and suddenly I saw some syslog messages. It was from TUX at 07:28:42 AM. But it didn't crash immediatly, it kept working fine for several minutes. Almost 20 minutes according to syslog. I used kernel 2.6.8-1.579smp from Arjan. I made the changes in the configuration that we agreed: 1) Disabled compression and 2) Changed the symlink for a real directory in the document root. I hope this trace helps.
I managed to do more testing in production environtment and tux still crashes. There are several changes now. First of all, Our tradiotional server just got busted and we are using a brand new and powerful server (IBM xSeries 445) with 2.2 GB in RAM and 4 SMP Hyperthreaded processors, 2.2 GHz each, so linux sees 8 logical processors. Actually the server comes with 8 real processors (16 logical) but the extra ones are disconected right now. I installed Fedora Core 3 with the latest official updates including the kernel (2.6.9-1.681_FC3smp). Then I installed Oracle cliente 10g which ran flawlessly on FC3. Apache and PHP are from FC3 and I downloaded the PHP source from www.php.net to compile the oci8 module and add it to the PHP module directory (/usr/lib/php4). Everything works great with Apache alone. Then I configured tux and when I started the service, it immediately crashed as soon as I hitted enter. Then restarted the server which, BTW, takes like 10 minutes just to get GRUB loaded, TUX started during boot processs and this time it was running fine for about 8 minutes and then crashed again. Since then I have been running with apache alone. With the first crash I got a huge syslog, it seems it registered each thread crash (8 threads). The second syslog was very small in comparision. I'll attach the two syslogs in a very short time. I hope this really helps.
Created attachment 108259 [details] Crash with kernel 2.6.9-1.681_FC3smp This one is the crash that I got after I started tux manually as soon as I hitted enter.
Created attachment 108261 [details] Second crash with kernel 2.6.9-1.681_FC3smp After 8 minutes of normal work I got this syslog. TUX was started automatically during boot after the first crash.
BTW, this is my tux configuration: /etc/sysconfig/tux: # TUXTHREADS=1 <-- By default it starts 8 threads DOCROOT=/var/www/html/ <-- The same as apache #LOGFILE=/var/log/tux <-- I'm not logging the requests DAEMON_UID=apache DAEMON_GID=apache MAX_KEEPALIVE_TIMEOUT=15 /etc/sysctl.tux: # TUX Configuration #net.tux.serverport=80 net.tux.logging=0 net.tux.redirect_logging=0 net.tux.referer_logging=0 net.tux.compression=0 That's all... I hope.
Created attachment 112484 [details] First crash with kernel 2.6.10-1.760_FC3smp Hi again, After a very long time I was able to reproduce this bug on a different system. This time I had to setup a test server for a new web application with a similar architecture than the one original used to open this bug report, in fact, this is the very same server (recently we switched to an IBM xSeries 445 which present the same problems with tux). I installed Fedora Core 3 with the latest updates available at that time (March 01 2005 aprox.) and configured an Oracle 9i database along with the web servers (tux and apache). Everything was working fine for a few days until tux started to crash. This time the behaviour is a bit different. All tux threads crashed in less than 2 minutes (it's an SMP 4x) but the system stayed operational, the only thing that broke was tux. I could ssh it and tried to kill tux but I couldn't, then reconfigured apache to work without tux but it didn't work because tux was blocking the port 80, although the apache start script said that it launched successfully. I had to reboot. After that, I leaved the server with the default configuration (tux + apache) and hoped for the best... It crashed again later. :-( There are some comments wthin the log.
Created attachment 112488 [details] Second crash with kernel 2.6.10-1.760_FC3smp Hello there, this one is a continuation to the previous attachment. It's the second crash that blocked the machine completely, we had to wait after the weekend to reboot the server. This time crashed only two threads.
Created attachment 112490 [details] Crash log with 2.6.10-1.770_FC3 UP kernel This one is from my workstation. I installed Fedora Core 3 with the latest updates and It crashed on me twice until I deactivated tux to get some work done with my projects. This is the first time I can reproduce the problem in my PC but still I can't recognize a pattern. But hey, this means we can use my box to make heavy testing of the problem with deep changes like development kernels. I remember Ingo was complaining about my web app using a shared memory based cache to store data, since this is not a production system and performance is not critical, we can use a file based cache or a fake cache or something like that just to make sure there wasn't any problem with that. I hope this time we can nail this problem down.
Created attachment 113415 [details] Crash with latest stable kernel 2.6.11-1.14_FC3 UP Hi Ingo, I got another crash from tux this morning after reboot (this is my workstation). Looking at the backtrace, I'm not sure if it is for a symlink, which I seriously doubt it cause I get them out from my document_root many moons ago. The other reason could be the SHMCache mechanism in my web app. Right now and for severalweeks I have been running with a NullCache (which is a dumb object saying 'go always to the database') as you suggested and in fact it have been more stable than SHM but todays crash was using NullCache. Please, take a look at the backtrace. Is that complete? I guess there are some entries missing. However, I compared them with previous backtraces in the bug report and it looks a bit different. Another thing, this kernel (2.6.11-1.14_FC3) have been locking on me a couple of times without any information about it in the logs. I can ping the computer and it responds but I can't connect to it (ssh, telnet, ftp). The only thing I see in the logs at boot time is this: Apr 20 09:52:55 nalwalovaton kernel: PCI: Found IRQ 10 for device 0000:00:02.0 Apr 20 09:52:55 nalwalovaton kernel: [drm] Initialized i810 1.4.0 20030605 on minor 0: Apr 20 09:52:55 nalwalovaton kernel: mtrr: base(0x44000000) is not aligned on a size(0x180000) boundary Apr 20 09:52:55 nalwalovaton kernel: [drm] Using v1.4 init.
It's me again with hot news. I developed a filesystem based cache mechanism to use with my web app instead of the SHM cache mechanism as Ingo suggested. I tried it for a few days in my workstation and it was a lot more stable, tux used to crash once a day with SHM cache. Then I tried the filesystem cache on production environment (only apache) and it worked fine, the performance is comparable. I scheduled to make a tux test on saturday april 23 because there are less users but still ther is enough load (about 40% of the usual load). I reconfigured apache to be tux friendly, then I started tux and my web app was up and running. The requests seemed to be flowing normally but there was a little detail: Tux was hogging one of the CPUs, 100% all the time. The server has 4 HT proccessors. After 20 minutes of work it crashed. I had to reboot. After reboot, 15 minutes later (damn xSeries 445!), I felt a bit brave and decided to start tux again. I noted the very same problem, one of the CPUs at 100% and almost 30 seconds later it crashed again... the fastest crash ever. :-( This time I played safe and started only with apache. After reboot ipcs showed a clean shared memory, so there is no way SHM Cache could be causing this problem. A little thing I didn't changed was "kernel.shmmax = 536870912" in /etc/sysctl.conf. May be this is a problem, but during both tests there weren't any shm usage. I'll attach both backtraces soon.
Created attachment 113694 [details] First crash with production server (2.6.10-1.770_FC3smp)
Created attachment 113695 [details] Second crash with production server Ingo, I hope this info helps. At least I made a filesystem based cache ;-)
Hi Ingo, after seven months away I am back again with a new backtrace. This time the box is a very old low end server with one processor (Pentium II) with 350MHz, 512 MB of RAM and a SCSI storage controller: Adaptec AIC-7880U according to lspci. The OS is an up to date (as of 14 nov) Fedora Core 3 system, this includes the latest glibc update and the latest kernel update. After the update the system was prelinked (I don't know if that counts). The machine was configured for testing and training of new enterprise web applications that we are developing in our company, right now it has 25 concurrent users (in the class room) and I took the opportunity to give tux another test. The performance of tux was great like always but after an hour it crashed in the very same way like before (I'll attach the log later). Unfortunately we couldn't install FC4 on that machine because it was required to be configured exactly the same like our production server, which is a Fedora Core 3 system but there are lots of updates pending to be applied. After this training session (that ends on next friday), there is going to be training sessions all over the country so we can use this server for testing of patches or something. Since the machine is very low end, tux is going to be a must have. In our production environment we don't need tux yet because the servers are very powerful, but tux is very important for us because it helps us to reduce the number of persistent connections to the databases server. Thanx again for your support.
Created attachment 121098 [details] tux crash on a low end server This is the backtrace of the uniprocessor server, may be using old technology and avoiding SMP and HyperThreading will help you to understand the problem... I hope.
Hello, guys. I had a lot of trouble with 2.6.x kernel and TUX. I made a patch on 2.6.12 for TUX. TUX Module options are below. ---------------------------------------- .... CONFIG_TUX=m # # TUX options # # CONFIG_TUX_EXTCGI is not set # CONFIG_TUX_EXTENDED_LOG is not set # CONFIG_TUX_DEBUG is not set ...... ---------------------------------------- Don't use TUX's log or debug messages. I had trouble between TUX and syslogd. TUX logger and redirection has some trouble, too. I did test on 1CPU PC. If you will have some trouble, please capture the kernel's message and send to me. I attach a patch file & TUX configure file. NOTICE: This patch is experimental. ;)
Created attachment 121179 [details] 2.6.12 patch. Don't use TUX logger and debug messages.
Hello, Mingo. I hacking the TUX on linux 2.6.12. I found that some routine has strange status. For example, do_send_abuf() function in abuf.c. If tcp_sendpage()'s return value was smaller than 0, req->in_file did lost pointer. I think that if req->in_file does lost pointer, access of req->in_file->f_pos is no means. I did check the req->in_file in source code, then TUX didn't panic. Is this correct? I don't think so :D I wonder that why req->in_file did lost pointer. If I will run my patched the TUX and kernel, What else happens? I want your opinion. Have nice a day~
I got to reproduce another crash from tux using FC4 on my home machine, may be it's a different problem than the one I'm seeing in the production server but still it's a crash. I am using an almost up to date FC4 system with 512 MB in RAM, CPU Athlon XP 2000, kernel 2.6.14-1.1637_FC4 but I had to use kernel 2.6.13-1.1532_FC4 because the other one didn't come with tux (hint: upstream it?) I hope tomorrow I can reproduce this problem with the latest FC4 kernel update. The thing is that I was trying to reproduce my server (FC3) problem at home (FC4). I configured tux and used only tux for the test, there was no apache or any other user space web server. then I created a very simple static html file which reloaded it self every 2 seconds and left it running for a while. After an hour it was still working fine so I decided to stop tux (while the browser was still running) and that's when it crashed. Then rebooted the machine, re-executed the test but before stopping tux I closed the browser, this time tux didn't crashed (I tried that several times). Then tried to stop tux while the browser was running and it crashed again (very easy to reproduce). Since my computer uses the proprietary driver from nVidia I changed the video driver to "vesa generic" and rebooted the machine and retried the test to get a back trace with a non tainted kernel... it still crashed. The test environment was basically a simple gnome desktop plus tux. My tux configuration is as follows: /etc/sysctl.tux: # TUX Configuration #net.tux.serverport=80 net.tux.logging=0 net.tux.redirect_logging=0 net.tux.referer_logging=0 net.tux.compression=0 /etc/sysconfig/tux: # TUXTHREADS=1 DOCROOT=/var/www/html/ # LOGFILE=/var/log/tux DAEMON_UID=apache DAEMON_GID=apache # CGIROOT=/var/www/html MAX_KEEPALIVE_TIMEOUT=10 # TUXMODULES="demo.tux demo2.tux demo3.tux demo4.tux" # MODULEPATH="/" /etc/tux.mime.types: TUX/redirect php In the last file I "redirected" php requests to the user space web server but there is no need for it for the test since there is no php files involved. I'll post the html file from the test I made and the back traces I got. Like I said, the page reloads it self every 2 seconds and the MAX_KEEPALIVE_TIMEOUT parameter is configured to 10, may be that's what is causing the crash but I'm just guessing. Please Ingo, try to reproduce this on your machine and let's see what happens.
Created attachment 121826 [details] This is the html page I made for the test
Created attachment 121827 [details] First crash with a tainted kernel
Created attachment 121828 [details] Second crash with a tainted kernel, slightly different than the first one, I don't know why
Created attachment 121829 [details] Crash with a NON tainted kernel Please Ingo, try to reproduce this problem, it's very easy on my side.
Created attachment 121849 [details] Same crash on FC3 (2.6.12-1.1381_FC3)
Created attachment 121850 [details] Same crash on FC3 with no keep alive Tried the same test deactivating the keep alive option in TUX but still it crashed, so the bug seems to be a little more complex than that. I hope you can reproduce this on your side. I have to tell that on these crashes the system keeps working fine (desktop wise). TUX is the only one that gets blocked.
Created attachment 122530 [details] Crash with rawhide 21 Dec 2005 I installed a RawHide test box on a VMWare 5.5 machine. It was FC4 at first but then I yum it to the latest devel tree as of 21 Dec 2005 (kernel 2.6.14-1.1777_FC5.i686). I managed to reproduce the latest crashes I posted on this bug report. 1st test: start and stop tux... [Ok] 2nd test: start tux, open firefox, request page, close firefox, stop tux... [Ok] 3rd test: start tux, open firefox, request page, stop tux... [Failed] In the third test tux crashed like in FC3 and FC4 mentioned in the previous comments... see the attachment for the details. Any thoughts on the backtrace? PS. I'll try this on a real machine if I can get one.
Created attachment 122628 [details] Simple Patch for tux3-2.6.14-A1 Hello. I tested tux3-2.6.14.2-A1 and 2.6.14.2 kernel(download from kernel.org). The req->in_file in do_send_buf() lost pointer problem didn't occur. Ingo's said is correct. My test method is follow. My TUX box is Pantium III 800MHz(1CPU). 1. TUX service only JPG files (no TUX log, no TUX CGI, no TUX DEBUG). 2. Apache+PHP service PHP dynamic page. 3. I made a php program that send random 10 JPG files in 10000 files to user's browser. Some files not exist actually. 4. Then, user's access redirected to Apache, and TUX service random 10 JPG files. 5. In another Linux client box, 512 wget process request to TUX+Apache server at once and wget process count keep up near 512. 6. Over 12 hours go on. But William's crash come my system. I tracked the crash, I found that It is same problem on req->in_file. It locate at tux_flush_workqueue() function in input.c. I think that req->in_file check is more safe before use. I attach my simple patch. Could you test my patch? 1. Download 2.6.14.2 from kernel.org 2. Patch tux3-2.6.14.2-A1 3. Patch my file. 4. Kernel compile & Test. Thanks.
Created attachment 144158 [details] Oops from tux
to #56: today i got this oops. i was playing with s-c-services. i can't reproduce it. ip6tables, selinux, audit is disabled...
ooops i forgot the important: 2.6.18-1.2868.fc6 (x86_64)
Tux has been removed from Fedora.