From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031030 Description of problem: mailman version: see above kernel 2.4.22-1.2135.nptl (or 2.4.22-2129.nptl, doesn't matter) python-2.2.3-7 machine: athlon xp1800+ on epox nforce2 board disks: 2x scsi on sym53c8xx host adapter, 4x scsi on adaptec aic7xxxx host adapter memory: 1gb ddr-266 This is a mail server for about 400 users. It's running postfix and dovecot (both via fedora installation). It's up to date to the latest yum update. It is intended to run mailman as mailing-list manager, with about 30 lists (mostly low to very low traffic). The problem is as following: As soon as I start up the mailman service (which starts mailman's qrunner scripts), the whole machine locks up tcp/ip connectivity. It isn't just postfix, ssh to the box also waits and timeouts, eventually. Local activity (running commands via a local terminal) works ok, also if you're logged in remote before the whole mess goes berzerk. "su" etc. also hang for quite a while. Even when executed locally. Running mailman basically disables all Internet services on the machine. If I shut down mailman (via "service mailman stop"), everything instantly returns to normal as soon as the qrunners are killed. Mailman runs at least one qrunner with 100% cpu. As I understand, this also shouldn't be the case but shouldn't affect the rest of the system. Postfix, and mailman, are properly configured, and aware of each other. Even if they were misconfigured, there shouldn't be a total _lockup_ of all tcp/ip connectivity, including unrelated services, from/to the machine, just because one process is running havoc. I tried to rebuild python from srpm but gave up when I saw that it required, among other things, mesa gl, tcl/tk etc. I'm sure as hell not going to install that stuff on the mail server just to rebuild the scripting language the mailing list manager happens to be written in. I've also seen the machine randomly drop volumes from the software RAID (I've got a RAID-1 and a RAID-10 configured). This seems to be somehwat incidental to the mailman issue (after, say, half a day of chugging along with mailman running). Memory, after extensive tests, is ok. As seems to be the rest of the system. Memory usage is moderate, at best. About 500MB were free at the time, and no swap space used. Disk I/O went down aswell, only every 5 seconds or so, something was written to disks (as opposed to basically continually when mailman is not running). Context switch/interrupt activity is normal. Network response time (ping) etc. is normal. I'm at a total loss. I don't understand how a single process, bugged as it may be, can cause such havoc. It surely shouldn't lock up tcp/ip connectivity. I don't know if the loss of disks from the s/w raid was only accidentally incident with that. The disks are almost new and known to be good. They also work well after reconstructing on them. Loss of disks seems to be random and not fixed on particular disks (it only happened twice, anyways). The tcp/ip lockup is reproducible and happens every time. The machine is now running fine w/o mailman stuff running. If I start up mailman, it locks up again. So I don't run it at the moment. Anybody else seen that problem? Version-Release number of selected component (if applicable): mailman-2.1.2-2 How reproducible: Always Steps to Reproduce: 1. see above. Actual Results: machine locks up all tcp/ip connectivity, for extended time (probably infinite, most connections timeout, at least). Expected Results: no major interruptions. Additional info:
Off the top of my head I'm at a lost to explain the behavior you're seeing. For starters mailman keeps several log files in /var/log/mailman. One of those files is named "error" is there anything in that file which looks suspicious. There is another file called qrunner, anything suspicious in that file? If you're not sure how to interpret the file contents you can attach them to this bug report. Which version of postfix are you running? postfix-2.0.11 had a bug with receiving smtp requests from clients.
I have a few more thoughts, but the existing log files should proably be able to answer these but let me throw these out as suggestions. Take a look at your /etc/hosts file, by any chance is your hostname listed on the local loop back entry? mailmanctl which is what launches the qrunners takes a "quiet" flag (-q) this is passed by /etc/init.d/mailman, you could turn that flag off and get more verbose logging. When the problem occurs what is the system load like? Is either the mailmanctl or the qrunners sucking up lots of cpu time? Duing a "ps ax" do you see the same processes or does it look like the processes are dying and immediately being recreated? I sort of doubt this is happening as mailmanctl has a check against this and the log files should document this if its happening. Did you edit your mm_cfg.py file to define your DEFAULT_URL_HOST and DEFAULT_EMAIL_HOST? When you do "service mailman start" are there any messages emmitted? Have you by any chance upgraded from a prior 2.0.x mailman to this newer 2.1.x mailman? If so did you follow the upgrade instructions and run the upgrade utility?
The problem is reproducable on every hardware we tried (two other real systems and a vmware test environment) - FC1 - standard postfix rpm from FC1 - standard mailman rpm from FC1 (2.1.2) Same behaviour as explained here. Fix: Upgraded with this RPM http://download.fedora.redhat.com/pub/fedora/linux/core/development/i386/Fedora/RPMS/mailman-2.1.4-1.i386.rpm so it seems to be soley a mailman 2.1.2 bug, most likely the one described in the changes file for 2.1.3 Sorry that i can't supply anymore information on the bug, we're just glad that the 2.1.4 development RPM seems to work as expected now. But the problem seems to be reproducable. Pershaps someone could just supply a FC1 update with mailman 2.1.4? Probably i have some more time this week and look if we have some of the logfiles left or so.
Thanks. mailman-2.1.4-1 is already in the queue for FC1 due to a security errata. Don't worry about the log files or spending any more cycles on this since the current package seems to fix the problem. I'm going to close this out. I'm glad its working for you, I was going to suggest trying the new package but you were one step ahead of me!
I want to reopen the bug for the following reasons: Maybe the category ought to be changed. While mailman seems to be the trigger for this bug, it rather points to a rather serious problem in the kernel. No user process should make the kernel stall all incoming/outcoming connections (i.e., listen/accept on a socket, already established connections were not affected.) To my, admittedly rather uninformed, evaluation, this smells after locks being held for longer than they should (maybe something is spinning on a lock and thereby stalls all other socket-related connection stuff in the kernel.) From my original bug report (see above): "Even if they were misconfigured, there shouldn't be a total _lockup_ of all tcp/ip connectivity, including unrelated services, from/to the machine, just because one process is running havoc." A newer mailman version seems to have "fixed" the problem in that it isn't triggered anymore. Maybe a simple test case can be established that also exhibits the same syndrome. I don't have the free time currently to craft such a test case. However, the problem was not machine dependent. We have tried it on two totally different machines, and the problem was the same. OTOH, I don't know how relevant this is for current fedora development, in the light of linux 2.6 being incorporated. Please feel free to close it again. I just wanted to make a point that, in my opinion, the real problem is not solved. --mkb
You have a valid point, I will forward this to the kernel group, but before I do so I need to understand what was provoking this and what changed to make it go away in 2.1.4. The only comment I found was in the 2.1.3 changelog which appears below, can you tell me if this is what you believe the culprit was? Were you seeing 100% cpu utilization? If this is the culprit I imagine what was going on was high utilization of TCP port 25 between mailman and postfix and that the scheduling of TCP port access was not fair starving other TCP clients. If the change entry is not what you believe was the root could you please specify what you believe it is. - When some or all of a message's recipients have temporary delivery failures, the message is moved to a "retry" queue. This queue wakes up occasionally and moves the file back to the outgoing queue for attempted redelivery. This should fix most observed OutgoingRunner 100% cpu consumption, especially for bounces to local recipients when using the Postfix MTA.
The symptoms were as following: One of the qrunners was utilizing 100% cpu. This was the only process that was runnable all the time. Postfix master/smtpd resource usage didn't go up. Actually, the load average, minus the +1.0 from the respective qrunner, went down, due to the much decreased postfix load. I _believe_ that mailman configuration was Ok, aswell as Postfix configuration. I edited the mm_cfg.py file to set the parameters as per mailman docs in /usr/share/doc/mailman*. Aswell as the postfix config (for example changing in postfix main.cf the value of the unknown_local_recipient_reject_code variable from 450 to 550 so that mailman doesn't try it again and again all over but gets a straight reject.) AFAIK the newer installed mailman rpm (which doesn't trigger the problem) is using the same configuration (Sascha Lenz just replaced the older version with the newer one). The symptoms were immediate: as soon as "service mailman start" succeeded, all connectivity was stalled. as soon as "service mailman stop" succeeded, things were back to normal again. There haven't been any messages from those init scripts starting up the qrunners, neither in mailman's logfiles (they were empty). Neither have I seen any other messages related to this in /var/log/messages, /var/log/maillog or dmesg. Another thing: typically the system would write constantly from/to disks with up to 2MB/s (this can be seen by watching "vmstat 1" for some while). As soon as the qrunner was chugging along with full cpu load, the output looked different: disk i/o went to to ca. 50K/s for 4 seconds, then every 5th second (approx.) it wrote out a much larger chunk (perhaps flushing buffers). This may be due to postfix network connections being stalled and hence there simply hasn't been that much to write for a while, but why then the extra i/o every 5th second. I cannot quite explain the behaviour. Perhaps other things than socket i/o was affected by the resource starvation. Plus: when running mailman in that condition for some time (over half an hour or so) the linux software raid (see raidtab(5)) would have a tendency to randomly lose disks. The disks are good, relatively new, and u2w scsi disks, and it never happened when mailman was not running. However, it happened always when we left mailman running for some time (typically from half an hour onwards). Disk loss was random (we have a raid1 and a raid1+0 with 6 disks total on two different host adapters and it lost almost any of the "real" partitions over several runs) and reversible -- it could always be undone by hot-removing and hot-adding the failed disk, at which time it was reconstructed and continued to work just fine. No more than one "real" partition was lost at a time (thankfully) although we didn't really try to push it.. maybe it would have lost more after a while. Other things like for example "su" would also hang. This might be due to the PAM libs doing some kind of network i/o (although we only have local users on the machine). Maybe also some fifos or unix-domain sockets were stalled. The problem is not related to one single machine. We have installed fc1 on a second box (that was our previous mail server, and ran without any issues for over a year, albeit with a different OS and without mailman) and could reproduce at least the socket lockup there. That's all that comes to my mind concerning this. It might not be very helpful and I would go at it with a debugger and a more planned approach at trying to provoke failures, if I had more time (which I currently don't), sorry.
Is that reproducible on more current FC with 2.6 kernel?
Closing per lack of response to previous comment. Note that FC1 and FC2 are no longer supported even by Fedora Legacy. If this still occurs on FC3 or FC4, please assign to that version and Fedora Legacy. If it still occurs on FC5 or FC6, please reopen and assign to the correct version.