112875 – mailman qrunners lock up all tcp/ip connectivity

Bug 112875 - mailman qrunners lock up all tcp/ip connectivity

Summary: mailman qrunners lock up all tcp/ip connectivity

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mailman
Sub Component:
Version:	1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Harald Hoyer
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-01-05 07:01 UTC by Matthias Buelow
Modified:	2007-11-30 22:10 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-10-28 16:40:50 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Matthias Buelow 2004-01-05 07:01:42 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1)
Gecko/20031030

Description of problem:
mailman version: see above
kernel 2.4.22-1.2135.nptl  (or 2.4.22-2129.nptl, doesn't matter)
python-2.2.3-7

machine: athlon xp1800+ on epox nforce2 board
disks: 2x scsi on sym53c8xx host adapter,
4x scsi on adaptec aic7xxxx host adapter
memory: 1gb ddr-266

This is a mail server for about 400 users.  It's running postfix and
dovecot (both via fedora installation).
It's up to date to the latest yum update.

It is intended to run mailman as mailing-list manager, with about 30
lists (mostly low to very low traffic).

The problem is as following:

As soon as I start up the mailman service (which starts mailman's
qrunner scripts), the whole machine locks up tcp/ip connectivity.
It isn't just postfix, ssh to the box also waits and timeouts, eventually.
Local activity (running commands via a local terminal) works ok, also
if you're logged in remote before the whole mess goes berzerk.
"su" etc. also hang for quite a while.  Even when executed locally.

Running mailman basically disables all Internet services on the machine.

If I shut down mailman (via "service mailman stop"), everything
instantly returns to normal as soon as the qrunners are killed.

Mailman runs at least one qrunner with 100% cpu.  As I understand,
this also shouldn't be the case but shouldn't affect the rest of
the system.

Postfix, and mailman, are properly configured, and aware of each other.

Even if they were misconfigured, there shouldn't be a total _lockup_
of all tcp/ip connectivity, including unrelated services, from/to the
machine, just because one process is running havoc.

I tried to rebuild python from srpm but gave up when I saw that it
required, among other things, mesa gl, tcl/tk etc.  I'm sure as hell
not going to install that stuff on the mail server just to rebuild the
scripting language the mailing list manager happens to be written in.

I've also seen the machine randomly drop volumes from the software
RAID (I've got a RAID-1 and a RAID-10 configured).  This seems to be
somehwat incidental to the mailman issue (after, say, half a day of
chugging along with mailman running).  Memory, after extensive tests,
is ok.  As seems to be the rest of the system.

Memory usage is moderate, at best.  About 500MB were free at the time,
and no swap space used.  Disk I/O went down aswell, only every 5
seconds or so, something was written to disks (as opposed to basically
continually when mailman is not running).  Context switch/interrupt
activity is normal.  Network response time (ping) etc. is normal.

I'm at a total loss.  I don't understand how a single process, bugged
as it may be, can cause such havoc.  It surely shouldn't lock up
tcp/ip connectivity.  I don't know if the loss of disks from the s/w
raid was only accidentally incident with that.  The disks are almost
new and known to be good.  They also work well after reconstructing on
them.  Loss of disks seems to be random and not fixed on particular
disks (it only happened twice, anyways).  The tcp/ip lockup is
reproducible and happens every time.

The machine is now running fine w/o mailman stuff running.
If I start up mailman, it locks up again.
So I don't run it at the moment.

Anybody else seen that problem?


Version-Release number of selected component (if applicable):
mailman-2.1.2-2

How reproducible:
Always

Steps to Reproduce:
1. see above.
    

Actual Results:  machine locks up all tcp/ip connectivity, for
extended time (probably infinite, most connections timeout, at least).

Expected Results:  no major interruptions.

Additional info:

Comment 1 John Dennis 2004-01-26 20:21:23 UTC

Off the top of my head I'm at a lost to explain the behavior you're
seeing. For starters mailman keeps several log files in
/var/log/mailman. One of those files is named "error" is there
anything in that file which looks suspicious. There is another file
called qrunner, anything suspicious in that file? If you're not sure
how to interpret the file contents you can attach them to this bug
report. Which version of postfix are you running? postfix-2.0.11 had a
bug with receiving smtp requests from clients.

Comment 2 John Dennis 2004-01-26 23:29:03 UTC

I have a few more thoughts, but the existing log files should proably
be able to answer these but let me throw these out as suggestions.

Take a look at your /etc/hosts file, by any chance is your hostname
listed on the local loop back entry?

mailmanctl which is what launches the qrunners takes a "quiet" flag
(-q) this is passed by /etc/init.d/mailman, you could turn that flag
off and get more verbose logging.

When the problem occurs what is the system load like?

Is either the mailmanctl or the qrunners sucking up lots of cpu time?
Duing a "ps ax" do you see the same processes or does it look like the
processes are dying and immediately being recreated? I sort of doubt
this is happening as mailmanctl has a check against this and the log
files should document this if its happening.

Did you edit your mm_cfg.py file to define your DEFAULT_URL_HOST and
DEFAULT_EMAIL_HOST?

When you do "service mailman start" are there any messages emmitted?

Have you by any chance upgraded from a prior 2.0.x mailman to this
newer 2.1.x mailman? If so did you follow the upgrade instructions and
run the upgrade utility?

Comment 3 Sascha Lenz 2004-01-27 18:00:59 UTC

The problem is reproducable on every hardware we tried (two other real
systems and a vmware test environment)

- FC1
- standard postfix rpm from FC1
- standard mailman rpm from FC1 (2.1.2)

Same behaviour as explained here.

Fix: 

Upgraded with this RPM

http://download.fedora.redhat.com/pub/fedora/linux/core/development/i386/Fedora/RPMS/mailman-2.1.4-1.i386.rpm

so it seems to be soley a mailman 2.1.2 bug, most likely the one
described in the changes file for 2.1.3

Sorry that i can't supply anymore information on the bug, we're just
glad that the 2.1.4 development RPM seems to work as expected now.

But the problem seems to be reproducable. 

Pershaps someone could just supply a FC1 update with mailman 2.1.4? 

Probably i have some more time this week and look if we have some of
the logfiles left or so.

Comment 4 John Dennis 2004-01-27 18:11:42 UTC

Thanks. mailman-2.1.4-1 is already in the queue for FC1 due to a
security errata. Don't worry about the log files or spending any more
cycles on this since the current package seems to fix the problem. I'm
going to close this out. I'm glad its working for you, I was going to
suggest trying the new package but you were one step ahead of me!

Comment 5 Matthias Buelow 2004-01-27 19:01:22 UTC

I want to reopen the bug for the following reasons:

Maybe the category ought to be changed.

While mailman seems to be the trigger for this bug, it rather points
to a rather serious problem in the kernel.
No user process should make the kernel stall all incoming/outcoming
connections (i.e., listen/accept on a socket, already established
connections were not affected.)
To my, admittedly rather uninformed, evaluation, this smells after
locks being held for longer than they should (maybe something is
spinning on a lock and thereby stalls all other socket-related
connection stuff in the kernel.)

From my original bug report (see above):

"Even if they were misconfigured, there shouldn't be a total _lockup_
of all tcp/ip connectivity, including unrelated services, from/to the
machine, just because one process is running havoc."

A newer mailman version seems to have "fixed" the problem in that it
isn't triggered anymore.
Maybe a simple test case can be established that also exhibits the
same syndrome.  I don't have the free time currently to craft such a
test case.
However, the problem was not machine dependent.  We have tried it on
two totally different machines, and the problem was the same.

OTOH, I don't know how relevant this is for current fedora
development, in the light of linux 2.6 being incorporated.
Please feel free to close it again.  I just wanted to make a point
that, in my opinion, the real problem is not solved.

--mkb

Comment 6 John Dennis 2004-01-27 19:20:55 UTC

You have a valid point, I will forward this to the kernel group, but
before I do so I need to understand what was provoking this and what
changed to make it go away in 2.1.4. The only comment I found was in
the 2.1.3 changelog which appears below, can you tell me if this is
what you believe the culprit was? Were you seeing 100% cpu
utilization? If this is the culprit I imagine what was going on was
high utilization of TCP port 25 between mailman and postfix and that
the scheduling of TCP port access was not fair starving other TCP
clients. If the change entry is not what you believe was the root
could you please specify what you believe it is.

        - When some or all of a message's recipients have temporary
          delivery failures, the message is moved to a "retry" queue.
          This queue wakes up occasionally and moves the file back to
          the outgoing queue for attempted redelivery.  This should
          fix most observed OutgoingRunner 100% cpu consumption,
          especially for bounces to local recipients when using the
          Postfix MTA.

Comment 7 Matthias Buelow 2004-01-27 20:05:10 UTC

The symptoms were as following:

One of the qrunners was utilizing 100% cpu.
This was the only process that was runnable all the time.
Postfix master/smtpd resource usage didn't go up.
Actually, the load average, minus the +1.0 from the respective
qrunner, went down, due to the much decreased postfix load. 

I _believe_ that mailman configuration was Ok, aswell as Postfix
configuration.  I edited the mm_cfg.py file to set the parameters as
per mailman docs in /usr/share/doc/mailman*.  Aswell as the postfix
config (for example changing in postfix main.cf the value of the
unknown_local_recipient_reject_code variable from 450 to 550 so that
mailman doesn't try it again and again all over but gets a straight
reject.)
AFAIK the newer installed mailman rpm (which doesn't trigger the
problem) is using the same configuration (Sascha Lenz just replaced
the older version with the newer one).

The symptoms were immediate: as soon as "service mailman start"
succeeded, all connectivity was stalled.  as soon as "service mailman
stop" succeeded, things were back to normal again.  There haven't been
any messages from those init scripts starting up the qrunners, neither
in mailman's logfiles (they were empty).  Neither have I seen any
other messages related to this in /var/log/messages, /var/log/maillog
or dmesg.

Another thing: typically the system would write constantly from/to
disks with up to 2MB/s (this can be seen by watching "vmstat 1" for
some while).  As soon as the qrunner was chugging along with full cpu
load, the output looked different: disk i/o went to to ca. 50K/s for 4
seconds, then every 5th second (approx.) it wrote out a much larger
chunk (perhaps flushing buffers).  This may be due to postfix network
connections being stalled and hence there simply hasn't been that much
to write for a while, but why then the extra i/o every 5th second.  I
cannot quite explain the behaviour.  Perhaps other things than socket
i/o was affected by the resource starvation.

Plus: when running mailman in that condition for some time (over half
an hour or so) the linux software raid (see raidtab(5)) would have a
tendency to randomly lose disks.  The disks are good, relatively new,
and u2w scsi disks, and it never happened when mailman was not
running.  However, it happened always when we left mailman running for
some time (typically from half an hour onwards).  Disk loss was random
(we have a raid1 and a raid1+0 with 6 disks total on two different
host adapters and it lost almost any of the "real" partitions over
several runs) and reversible -- it could always be undone by
hot-removing and hot-adding the failed disk, at which time it was
reconstructed and continued to work just fine.  No more than one
"real" partition was lost at a time (thankfully) although we didn't
really try to push it.. maybe it would have lost more after a while.

Other things like for example "su" would also hang.  This might be due
to the PAM libs doing some kind of network i/o (although we only have
local users on the machine).  Maybe also some fifos or unix-domain
sockets were stalled.

The problem is not related to one single machine.  We have installed
fc1 on a second box (that was our previous mail server, and ran
without any issues for over a year, albeit with a different OS and
without mailman) and could reproduce at least the socket lockup there.

That's all that comes to my mind concerning this.  It might not be
very helpful and I would go at it with a debugger and a more planned
approach at trying to provoke failures, if I had more time (which I
currently don't), sorry.

Comment 8 Daniel Roesen 2005-06-04 22:42:12 UTC

Is that reproducible on more current FC with 2.6 kernel?

Comment 9 John Thacker 2006-10-28 16:40:50 UTC

Closing per lack of response to previous comment.  Note that FC1 and FC2 are no
longer supported even by Fedora Legacy.  If this still occurs on FC3 or FC4,
please assign to that version and Fedora Legacy.  If it still occurs on FC5 or
FC6, please reopen and assign to the correct version.

Note You need to log in before you can comment on or make changes to this bug.