212922 – /sbin/service iptables stop hangs on modprobe -r ipt_state

Bug 212922 - /sbin/service iptables stop hangs on modprobe -r ipt_state

Summary: /sbin/service iptables stop hangs on modprobe -r ipt_state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	430698
TreeView+	depends on / blocked

Reported:	2006-10-30 04:46 UTC by Richard Andrews
Modified:	2008-12-22 21:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:	RHSA-2008-0665
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-07-24 19:12:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
alt-sysrq-t output from a system that has modprobe -r ipt_state hung (196.95 KB, text/plain) 2007-07-12 17:05 UTC, Alex Tang	no flags	Details
ps -efwww --forest output from the time of the problem (15.72 KB, text/plain) 2007-07-12 17:06 UTC, Alex Tang	no flags	Details
lsof output from the 2007-07-18 occurrence of the problem (804.62 KB, text/plain) 2007-07-19 00:08 UTC, Alex Tang	no flags	Details
ps -efwww --forest output from the 2007-07-18 occurrence. (23.44 KB, text/plain) 2007-07-19 00:09 UTC, Alex Tang	no flags	Details
alt-sysrq-t output (348.16 KB, text/plain) 2007-07-19 00:21 UTC, Alex Tang	no flags	Details
patch to avoid deadlock during sockopt unregister (1.58 KB, patch) 2007-08-06 20:11 UTC, Neil Horman	no flags	Details \| Diff
new patch (1.58 KB, patch) 2007-08-07 01:07 UTC, Alex Tang	no flags	Details \| Diff
new patch to prevent deadlock (2.06 KB, patch) 2007-08-23 16:16 UTC, Neil Horman	no flags	Details \| Diff
screenshot showing boot hang (10.22 KB, image/png) 2007-08-27 17:21 UTC, Alex Tang	no flags	Details
new sockopt patch (1.12 KB, patch) 2007-08-27 19:55 UTC, Neil Horman	no flags	Details \| Diff
new patch (1.62 KB, text/x-patch) 2007-08-28 13:22 UTC, Neil Horman	no flags	Details
patch to block module unloads with module refcounting in nf_sockopt (7.21 KB, patch) 2007-08-31 20:10 UTC, Neil Horman	no flags	Details \| Diff
fixed patch (7.21 KB, patch) 2007-08-31 20:22 UTC, Neil Horman	no flags	Details \| Diff
new, working patch (8.33 KB, patch) 2007-09-02 17:31 UTC, Neil Horman	no flags	Details \| Diff
final, working patch (8.77 KB, patch) 2007-09-04 18:29 UTC, Neil Horman	no flags	Details \| Diff
patch to cause modprobe to not block (2.78 KB, patch) 2007-09-04 20:33 UTC, Neil Horman	no flags	Details \| Diff
Show Obsolete (7) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2008:0665	0	normal	SHIPPED_LIVE	Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7	2008-07-24 16:41:06 UTC

Description Richard Andrews 2006-10-30 04:46:55 UTC

This is a continuation of bug 165744 as the problem persists beyond
kernel-2.6.9-34.EL.

From: <jmorris>  2006-10-29 22:37 EST
> Please open a new bugzilla with full details on how to reproduce 
> the problem you're seeing.


Previous reports in bug 165744 state that the problem was fixed in
http://rhn.redhat.com/errata/RHSA-2006-0132.html 
(errata to RHEL4.2) 
however I have the problem in a system which was installed as RHEL4.3 - so the
kernel is the fixed version.

I have marked this as a kernel fault because it was previously marked as such. I
don't know if it is really the kernel or modules utils.

How to reproduce:
/sbin/service iptables restart

Additional info:
I have more than 4 identical boxes (hardware and software) and only one exhibits
this behaviour. On the machine which exhibits the problem it always occurs. On
the other machines it has never been observed.
I cannot experiment much with the faulty system as it went into  production in a
remote data centre before the fault was noticed. If we don't unload the module
then it continues to operate, so it is being left in production.

Same symptoms as described in bug 165744 : removing the ipt_state module gets
stuck and uses all of one CPU in a busy loop. Only way out is a reboot.

Hardware:
Intel TIGI2U : dual Xeon 3.2GHz

Additional software of interest:
Additional software is user-space only - no changes to kernel or kernel modules.

Some proprietary TCP and UDP software.
openvpn uses packet socket.
openSWAN


Here's some dmesg for hardware ID:

<snip>...
Initializing CPU#0
CPU 0 irqstacks, hard=c03ea000 soft=c03ca000
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 3193.920 MHz processor.
Using tsc for high-res timesource
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 2073580k/2097024k available (1868k kernel code, 22448k reserved, 755k
data, 180k init, 1179520k highmem)
Calibrating delay using timer specific routine.. 6435.76 BogoMIPS (lpj=3217880)
Security Scaffold v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
There is already a security framework initialized, register_security failed.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU: After generic identify, caps: bfebfbff 20100000 00000000 00000000
CPU: After vendor identify, caps:  bfebfbff 20100000 00000000 00000000
monitor/mwait feature present.
using mwait in idle threads.
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU0: Initial APIC ID: 0, Physical Processor ID: 0
CPU: After all inits, caps:        bfebfbff 20100000 00000000 00000080
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel P4/Xeon Extended MCE MSRs (24) available
CPU0: Thermal monitoring enabled
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
CPU0: Intel(R) Xeon(TM) CPU 3.20GHz stepping 0a
per-CPU timeslice cutoff: 2009.79 usecs.
task migration cache decay timeout: 3 msecs.
Booting processor 1/1 eip 3000
CPU 1 irqstacks, hard=c03eb000 soft=c03cb000
Initializing CPU#1
Calibrating delay using timer specific routine.. 6383.12 BogoMIPS (lpj=3191562)
CPU: After generic identify, caps: bfebfbff 20100000 00000000 00000000
CPU: After vendor identify, caps:  bfebfbff 20100000 00000000 00000000
monitor/mwait feature present.
CPU: Trace cache: 12K uops, L1 D cache: 16K
... <snip>

Comment 1 Neil Horman 2006-11-07 21:39:24 UTC

can you provide a sysrq-t from the system when it hangs please?

Comment 2 Richard Andrews 2006-11-07 22:09:29 UTC

I assume you mean alt-SysRq-T key combo.

This is difficult. :(

As I stated the machine is in production in a remote data centre and if we don't
try to unload ipt_state it is running fine. Hence it will be left alone.

Getting to the machine physically is difficult, so console operation is problematic.

I'll report back when someone gets a chance to do this. Probably be a week or more.

Comment 3 Neil Horman 2007-02-09 13:01:52 UTC

Not sure why this got set back to waiting on me, we're still waiting for the
sysrq-t output.

Comment 6 Alex Tang 2007-07-12 17:05:56 UTC

Created attachment 159071 [details]
alt-sysrq-t output from a system that has modprobe -r ipt_state hung

We have systems running 2.6.9-55.ELsmp and iptables-1.2.11-3.1.RHEL4.  We're
intermittently seeing a similar (maybe same) "modprobe -r ipt_state" hang.  

When the problem occurs, all iptables functionality is lost. 

One of the earlier comments said that they were seeing 100% CPU load on one
CPU.  We are not seeing the CPU load spike when this condition occurs.

kill -9 on the modprobe has no effect whatsoever.   The machine is otherwise
functional, albeit without any firewall.  We must reboot the system in order to
re-enable the firewall (and on at least one occassion, shutdown hung while
trying to shutdown iptables, go figure).

We've only started seeing this problem since moving our product from RHEL3 to
RHEL4 (makes sense), and we've probably seen this problem occur 7-10 times in
the last 4 months or so.

I have no idea how to reproduce the problem.  

I am including the alt-sysrq-t output from the latest system that had this
problem.  I will shortly be attaching a "ps -ef --forest" output from the
system at the time of the problem.

The first modprobe to hang was PID 434.  We did not realize that this problem
was occurring, so other iptables restarts were attempted and they all hung
(thus the many other modprobes that are hung too).

Comment 7 Alex Tang 2007-07-12 17:06:46 UTC

Created attachment 159072 [details]
ps -efwww --forest output from the time of the problem

Comment 8 John G. Myers 2007-07-13 16:43:21 UTC

Requested info has been provided.

Comment 9 Neil Horman 2007-07-16 20:18:55 UTC

Looking at the sysrq-t output, it appears that the kernel has made a callout to
a userspace helper (modprobe), in an effort to load some iptables rules, and the
modprobe is stuck trying to gain a lock on a file (not sure which file yet). 
I'll dig into this in the AM to see if I can find which file is blocking
modprobe.  If we can get past that I expect that we can move modprobe forward
and unblock all the iptables backlog

Comment 10 Neil Horman 2007-07-17 17:39:56 UTC

Ok, It definately looks like we have a deadlock here between the iptables task
that created a need for a new module to be loaded, and the modprobe task that
tried to carry that load out.  We have 7 modprobe processes all blocked in
__posix_lock_file, except for one, that is running doing something else, but
apparently not completing its job.  My guess is the iptables tasks are waiting
on the modprobes to complete, the modprobes are waiting to get the lock on the
module they are trying to load, and the one modprobe that isn't waiting on the
posix_lock_file is waiting on a lock that the iptables task is holding in the
kernel.

The good news is that you should be able to work around this problem fairly
easily.  If you can just take the set of iptables modules that you normally use
and load them manually (say in in rc.local on bootup using modprobe <module
name>, we should break the deadlock loop and everything should work.  That
should get you moving for now until we figure this out.

the bad news is that the sysrq didn't seem to capture some pieces of the puzzle.
 This line:
Jul  6 21:20:48 host2 kernel: modprobe      R running  2980   434    364       
             (NOTLB)
Represents the one modprobe process that isn't stuck on posix_lock_file in the
sysrq-t.  This means that we don't know what its doing (beyond running something
in user space).  To solve this you will need to recreate the problem, use
sysrq-t to find the modprobe process that is in this running state, and then
strace, or gdb on it so we have some clue as to where its running.  If you can
provide that data that would be great.

also, I note this bug was open on 4.3, I assume you have confirmed that this bug
still happens on 4.6?

Comment 11 Alex Tang 2007-07-17 17:51:25 UTC

Thanks Neil, i was the person who put in the alt-sysrq-t output from one of our
machines.

Some further information that may help...

The modprobe with pid 434 (the one that sysrq didn't capture any information),
was the first one to hang.  When this problem occurred, doing an strace on the
process showed nothing (literally, nothing...no output after strace says
"attaching...").  I did not try gdb however.

The second thing is that this is not a consistent problem by far.  calling these
  modprobes (the standard way we do it is by calling /etc/rc.d/init.d/iptables
restart), works almost all of the time.  As i said i comment #6, 

  We've only started seeing this problem since moving our product from RHEL3 
  to RHEL4 (makes sense), and we've probably seen this problem occur 7-10 
  times in the last 4 months or so.

This is on a couple dozen dev boxes + some production boxes, so it probably
works out to this problem is occurring 1-3% of the time.

Also, when the problem does occur, it doesn't happen on boot, it happens when
doing an iptables restart during the normal course of running a server.

I will attempt to get lsof and gdb output when we see this happen again.

if you have any further info, that'd be great.

Comment 12 Neil Horman 2007-07-17 18:24:38 UTC

"The modprobe with pid 434 (the one that sysrq didn't capture any information),
was the first one to hang.  When this problem occurred, doing an strace on the
process showed nothing (literally, nothing...no output after strace says
"attaching...")."

Given that we don't have a stack trace from sysrq-t either, that just tells us
that modprobe is doing something entirely in userspace (not making any system
calls).  You'll have to attach gdb to it to get that backtrace, or alternatively
you can use the pstack utility.


Given the information we have, I can (and will) continue to look at this, but
its really a though exercize without more info.  I might find a problem while
reviewing, but more data will help me narrow the search.  Get it to me when you
are able.

It is good to know that that modprobe was the first to hang.  That does further
confirm that it is holding the posix file lock on the module in question, and is
just blocked on something else that we can't see yet.

And don't forget, if this become too painful to deal with regularly, you can use
that workaround i mentioned above to avoid the problem until we get it sorted
properly.  Thanks!

Comment 13 Alex Tang 2007-07-19 00:06:43 UTC

This happened again on another machine today.  I am attaching lsof and "ps
-efwww --forest" output.  

Similar (but not identical) conditions occurred, when doing a
"/etc/rc.d/init.d/iptables restart" as part of running our product, "modprobe -r
iptable_nat" became hung.  

Other modprobes are similarly hung (presumably because the first one still has
the lock).

I am not able to provide alt-sysrq-t output at this time (i'm waiting to see if
we can get access to the console). 

I can (and will) provide lsof and "ps -efwww --forest" output as attachments.

The pid of the original hung "modprobe -r iptable_nat" is 11565.  

Just like the previous time this occurred, when doing a "gdb" or "pstack" on the
pid also hung.  I was not able to break out of either with a "Ctrl-C", i had to
suspend and then "kill %1" the process gdb or pstack process.

  # gdb /sbin/modprobe -p 11565
  GNU gdb 6.6
  Copyright (C) 2006 Free Software Foundation, Inc.
  GDB is free software, covered by the GNU General Public License, and you are
  welcome to change it and/or distribute copies of it under certain conditions.
  Type "show copying" to see the conditions.
  There is absolutely no warranty for GDB.  Type "show warranty" for details.
  This GDB was configured as "i686-pc-linux-gnu"...
  (no debugging symbols found)
  Using host libthread_db library "/lib/tls/libthread_db.so.1".
  Attaching to program: /sbin/modprobe, process 11565
      (nothing...)
      (Ctrl-C yields nothing)
      (Ctrl-Z gets me back to a shell...)
  [1]+  Stopped                 gdb /sbin/modprobe -p 11565
  # kill %1
  # Detaching from program: /sbin/modprobe, process 11565
  Quitting: ptrace: No such process.

When doing pstack, i get

  # pstack 11565
      (nothing...)
      (Ctrl-C, nothing)
      (Ctrl-Z, back to shell)
  # kill %1
  [1]+  Terminated              pstack 11565

Doing "strace" on the process was similar, but i couldn't even "kill" it, i had
to "kill -9" on the strace process:

  # strace -p 11565
  Process 11565 attached - interrupt to quit
      (Nothing...)
      (Ctrl-C, nothing)
      (Ctrl-Z, back to shell)

  [1]+  Stopped                 strace -p 11565
  # ps -ef | grep strace
  root     30480 18014  0 16:52 pts/1    00:00:00 strace -p 11565
  root     30495 18014  0 16:52 pts/1    00:00:00 grep strace
  # kill %1
  # !ps
  ps -ef | grep strace
  root     30480 18014  0 16:52 pts/1    00:00:00 strace -p 11565
  root     30531 18014  0 16:52 pts/1    00:00:00 grep strace
  # kill -9 %1
  [1]+  Killed                  strace -p 11565
  # !ps
  ps -ef | grep strace
  root     30542 18014  0 16:52 pts/1    00:00:00 grep strace

lsmod shows:

Module                  Size  Used by
iptable_filter          6977  1 
pcspkr                  7825  0 
mptctl                 66629  0 
mptbase                63393  1 mptctl
sg                     38369  0 
ipmi_devintf           13385  2 
ipmi_si                37449  1 
ipmi_msghandler        32041  2 ipmi_devintf,ipmi_si
dell_rbu               10269  0 
md5                     8129  1 
ipv6                  243425  32 
ip_conntrack           46085  0 
ip_tables              22721  1 iptable_filter
dagmem                 51392  0 
dm_mod                 65001  0 
button                 10705  0 
battery                12997  0 
ac                      8901  0 
uhci_hcd               33241  0 
ehci_hcd               32325  0 
e752x_edac             14917  0 
edac_mc                20185  1 e752x_edac
hw_random               9685  0 
i8xx_tco               11229  0 
e1000                 119633  0 
bonding                65896  0 
ata_piix               19141  0 
libata                106141  1 ata_piix
floppy                 58193  0 
ext3                  119113  3 
jbd                    59609  1 ext3
megaraid_mbox          37329  4 
megaraid_mm            17777  3 megaraid_mbox
sd_mod                 20545  5 
scsi_mod              119757  4 sg,libata,megaraid_mbox,sd_mod

kernel-2.6.9-55.ELsmp
iptables-1.2.11-3.1.RHEL4

Another piece of info...Last time, i said that the hung process was not using
100% CPU.  this time, it is indeed using 100% CPU.  This makes me think that I
may not have been right about my previous assertion that it was not using 100%CPU.

Comment 14 Alex Tang 2007-07-19 00:08:32 UTC

Created attachment 159567 [details]
lsof output from the 2007-07-18 occurrence of the problem

Comment 15 Alex Tang 2007-07-19 00:09:27 UTC

Created attachment 159568 [details]
ps -efwww --forest output from the 2007-07-18 occurrence.

Comment 16 Alex Tang 2007-07-19 00:16:22 UTC

A coworker mentioned to try to SIGQUIT (kill -3) the process to see if we could
get it to coredump.  This didn't produce any core dump we could find. 

I'm still trying to get alt-sysrq-t info, but it may not happen until tomorrow.


Also, in comment #12, Neil mentioned:

  And don't forget, if this become too painful to deal with regularly, you can 
  use that workaround i mentioned above to avoid the problem until we get it
  sorted properly.  Thanks!

the workaround was specified in comment #10:

  The good news is that you should be able to work around this problem fairly
  easily.  If you can just take the set of iptables modules that you normally
  use and load them manually (say in in rc.local on bootup using modprobe
  <module name>, we should break the deadlock loop and everything should work.
  That should get you moving for now until we figure this out.

I'm confused, and not sure how this applies.

the firewall is running properly, so i'm assuming the modules are loaded
already.  This problem occurs when we're trying to restart iptables.  How does
loading the iptables modules at rc.local time help?  

Thanks again.

Comment 17 Alex Tang 2007-07-19 00:21:18 UTC

Created attachment 159571 [details]
alt-sysrq-t output

I spoke too soon.  We were able to get alt-sysrq-t output.  Here it is.

If you need anything else from the machine, let me know asap.  We have to
reboot soon.

Comment 18 Neil Horman 2007-07-19 11:10:02 UTC

Thank you for the output, unfortunately, its the same as before, the sysrq-t
output provides no stack trace for the modprobe operation thats running, which
is most curious.  that would suggest that its running in the kernel, and unable
to process the SIGSTOP/SIGCONT that pstack/gdb sends to it.  but the lack of
backtrace suggests that it is running in userspace where it _should_ be able to
respond to SIGSTOP/CONT signals.  Do you have netdump set up on these machines?
 If you need to reboot after this happens, I'm starting to think that the best
solution to move forward here is to have you issue a sysrq-c to crash the system
when its hung, collect a vmcore via netdump, and send that in for me to look at.

regarding me workaround, you bring up a good point.  When you restart iptables,
the modules are all unloaded on the stop operation, and then demand reloaded on
the start operation, which is why preloading them will help.  Unfortunately
preloading them from rc.local is the wrong place to do this.  What you should do
is add a set of modprobe operations directly into the top of the start function
of /etc/init.d/iptables to get this workaround to work properly.  thank you for
pointing that out.

Let me know if you can capture a vmcore from these systems, hopefully that will
tell us what we need.  Also, since it wasn't answered previously, I'll ask
again: Are you able to confirm that these problems persist on 4.6?  There are
several updates to the ipt_recent adn ipt_conntrack module sets.  Nothing
conclusive, but if you have a test system, it would be good just to ensure that
we're not chasing a problem that is already fixed.

Also, I note you are running a module called dagmem, which I'm unfamiliar with,
can you tell me a bit more about what that module does?

Comment 19 Alex Tang 2007-07-25 17:10:39 UTC

Hi there.  Thanks for your help.

I am still trying to get this to happen on a machine where we have more
convenient console access, unfortunately, the two times this has happened in the
last week have not been on machines where we had console access at all, so i had
to reboot the machine and keep going.

The next time this occurs on a machine i have better access to, i'll try to get
netdump on it and get you the information requested.

As for your other questions: 

  * i'm not sure what you mean by "4.6"? though, I'd be happy to test on 
    whatever new anything you'd like me to.
  * The machines in question may have a tcp capture card made by endace 
    (endace.com) on them.  the dagmem module is one of their drivers i believe.

Thanks again.

Comment 20 Alex Tang 2007-08-03 23:49:01 UTC

Hello again.

i finally had the problem occur on a machine that we have not only physical
access to, but another machine in the same subnet that we were able to run
netdump-server on.  

The pid of the first hung modprobe (which happened during a
"/etc/rc.d/init.d/iptables restart") is 19017 (ppid is 18856). There were a
bunch more that stacked up on the machine thereafter.

The log and the vmcore (bzipped) can be found at
http://gallery.funkware.com/modprobe-crash/

Sorry i forgot to click the "I am providing the requested info" box for my last
response, so there's some other information and a question in there for you too.

Please let me know if i can provide you with further information.

Thanks.

Comment 21 Neil Horman 2007-08-03 23:58:52 UTC

I've got the log, but the vmcore.bz2 is giving me a 403: Forbidden.  Can you fix
that up so I can d/l it/  thanks!

Comment 22 Alex Tang 2007-08-04 00:39:56 UTC

d'oh!  perms changed with bzip.  fixed now. Thx.

Comment 23 Neil Horman 2007-08-04 00:43:05 UTC

also, i noted something from the logs:
2.6.9-55.0.2.EL.P1smp
Was this a custom built kernel?  Is this the same as the kernels you've run
previously?  I'm not going to be able to debug this vmcore if you don't have the
debuginfo packages for this kernel to provide to me.  And if this is the same
kernel as what you were running previously, can you confirm that this problem
happens on a stock RHEL kernel please?  Thanks!

Comment 24 Alex Tang 2007-08-04 07:17:04 UTC

Sorry about that, i should have provided that info.  

The kernel src and all other RPMS are at the same URL.  The only difference
between the 2.6.9-55.0.2.EL.P1 and the 2.6.9-55.0.2.EL is an attempted patch for
ip_conntrack loosing entries
(https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=205966, using the patch
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=148203).  This patch
didn't solve that issue, but i digress. 

This is the first time we've seen this problem on this kernel.  The other two
that I reported earlier in the bug comments (as well as one or two more) were on
stock 2.6.9-55 and 2.6.9-55.0.2 kernels.

Thanks.

Comment 25 Neil Horman 2007-08-06 19:22:45 UTC

well, the core is showing me what the problem (at least in this particular
deadlock) is:

Basically what happens is this:

1) process A (modprobe) runs to remove ip_tables.ko

2) process B (iptables-restore) runs and calls setsockopt on a netfilter socket,
increasing the ip_tables socket_ops use count

3) process A acquires a file lock on the file ip_tables.ko, calls remove_module
in the kernel, which in turn executes the ip_tables module cleanup routine,
which calls nf_unregister_sockopt

4) nf_unregister_sockopt, seeing that the use count is non-zero, puts the
calling process into uninterruptible sleep, expecting the process using the
socket option code to wake it up when it exits the kernel

4) the user of the socket option code (process B) in do_ipt_get_ctl, calls
ipt_find_table_lock, which in this case calls request_module to load
ip_tables_nat.ko

5) request_module forks a copy of modprobe (process C) to load the module and
blocks until modprobe exits.

6) Process C. forked by request_module process the dependencies of
ip_tables_nat.ko, of which ip_tables.ko is one.

7) Process C attempts to lock the request module and all its dependencies, it
blocks when it attempts to lock ip_tables.ko (which was previously locked in step 3)


This is messy.  I'm working on a patch

Comment 26 Neil Horman 2007-08-06 20:11:36 UTC

Created attachment 160770 [details]
patch to avoid deadlock during sockopt unregister

Its not a fantastic patch, but it should get the job done.  It protects the
whole of nf_sockopt with a  read/write lock.  It should allow for parallel
execution of socket options, only serializing on write_lock when we unregister
the socket options for a given module.	I've not tested it yet, but I wanted
you to have it. If you could build it into your kernel and test it, while I
find a system to test on here, that would be great.  Let me know how it goes.  


I also think this problem may be upstream, so once you give me a thumbs up,
I'll take this to lkml and get it posted internally.  Thanks!

Comment 27 Alex Tang 2007-08-07 01:07:44 UTC

Created attachment 160787 [details]
new patch

Thanks neil.

the patch as defined didn't compile properly.  There were complaints about
unresolved symbols with "use_lock".  Not being too familiar with the code, i
changed those two references to nf_use_lock. It compiled but i'm looking for
guidance if this was the correct thing to do.

I'm attaching my updated version of the patch.

Comment 28 Alex Tang 2007-08-07 01:16:51 UTC

Also, I have take the new kernel with the updated patch and am testing it.  I'll
let you know status as soon as i can.

Comment 29 Neil Horman 2007-08-07 10:52:22 UTC

yes, that change was exactly the right thing to do.  Sorry about that.  I'm
building it here myself, but as I've not reproduced the problem previously, I'd
appreciate it if you could hammer this patch as hard as possible.  Let me know
when you're comfortable with it.

Comment 30 RHEL Program Management 2007-08-07 10:54:57 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 31 Alex Tang 2007-08-23 05:43:20 UTC

Things looked pretty good for a while.  But unfortunately, we had the same
iptables/modprobe hang today.  There was one difference today in that there were
two modprobe processes that were hung using 99%CPU in that unkillable state. 
There were other modprobes that were stacked up just like the other times this
has happened to us, but they were merely waiting for a lock and could be killed
normally. 

We tried getting netdump installed on this server, but had problems with the way
the netdump client tries to look up mac addrs when you have nic bonding turned
on, and it hosed the machine before we could get the dump.

I will try to get a dump if/when we see it again on a machine we can get all of
this netdump stuff setup on.

Comment 32 Alex Tang 2007-08-23 08:15:09 UTC

As an aside, a colleague mentioned that in the /etc/sysconfig/iptables-config
file, there is a parameter IPTABLES_MODULES_UNLOAD, which defaults to yes.  

The comments for that parameter say:

  # Unload modules on restart and stop
  # Value: yes|no, default: yes
  # This option has to be 'yes' to get to a sane state for a firewall
  # restart or stop. Only set to 'no' if there are problems unloading netfilter
  # modules.
  IPTABLES_MODULES_UNLOAD="yes"

Although i'd like to work towards a proper resolution to this problem, i was
wondering if you would recommend using this flag (setting it to "no")?

Thanks.

Comment 33 Neil Horman 2007-08-23 11:38:29 UTC

That option is there so that in the event that you restart iptables you don't
have to take the extra time to unload/reload your modules.  It means that the
modules don't get re-initialized, but that shouldn't be a problem, so in your
case, I'd say it would be fine to change that option to NO.

As for your problem, I think that it may have something to do withthe fact that
I converted the code to use read/write spinlocks rather than semaphores.  Let me
update the patch for you and see if I can correct that.

Comment 34 Neil Horman 2007-08-23 16:16:59 UTC

Created attachment 169395 [details]
new patch to prevent deadlock

Heres a new patch to prevent the deadlock.  It works the same as before, but
uses semaphore instead of spinlocks.  That should correct the 100% cpu pegging.
 It also has some extra code to bughalt if the remove operation decides to go
into uninterruptible sleep.  If we bughalt that way, I've done something wrong,
otherwise, we are looking at a new bug.  Please give it a test as see how it
does.  Thanks!

Comment 35 Alex Tang 2007-08-23 16:57:42 UTC

Thanks for the patch.  I am building and will start testing immediately.

Pardon my confusion, but for clarification, if we use
IPTABLES_MODULES_UNLOAD="no", will that circumvent the problem we're seeing in
this bug?  

Thanks again.

Comment 36 Neil Horman 2007-08-23 17:31:00 UTC

yes, you have to leave that option set to "YES" to properly test this patch.

Comment 37 Alex Tang 2007-08-23 22:51:15 UTC

Thanks Neil.  I have some good news.  We have been able to produce a simple set
of two test scripts that can (relatively) reliably reproduce the problem on the
previous kernel (have not tried the latest patch yet).

If we setup two scripts, one restarts iptables "/etc/rc.d/init.d/iptables
restart" in a tight loop while the other does an iptables status
"/etc/rc.d/init.d/iptables status" in a tight loop.  the problem will happen
eventually.  I've reproduced this in house on three machines (different
hardware).  The fastest machine 2xdual core xeon with no load is quick (25-100
iterations, <30 seconds).  The most loaded machine (single P4, high load) takes
a couple of hours.

The scripts are: 
  #!/bin/bash
  i=0
  while [ -z "" ]; do
  echo "Restarting iptables $i"
  /etc/init.d/iptables restart
  i=$(($i+1))
  done

and 

  #!/bin/bash
  i=0
  while [ -z "" ]; do
  echo "Getting iptables status $i"
  /etc/init.d/iptables status
  i=$(($i+1))
  done

That being said, i do have some followup questions about the
IPTABLES_MODULES_UNLOAD parameter.  

In particualr, are there any drawbacks to using IPTABLES_MODULES_UNLOAD="no"?  I
can see one drawback would be that when shutting down iptables, the modules are
not unloaded, but in our case, we never run without iptables.  Are there other
things that we should consider?

If we can use that flag without any unintended side effects, we may choose to do
so and then the priority of this particular bug becomes less severe for us. 
Besides, setting IPTABLES_MODULES_UNLOAD="no" makes restarting iptables go way
faster.

Thanks again.  I'll post more info about the new kernel as soon as i can.

Comment 38 Neil Horman 2007-08-24 00:03:49 UTC

"I have some good news...."
That is good, I'll try to dig up a machine to recreate here

"In particualr, are there any drawbacks to using IPTABLES_MODULES_UNLOAD="no"?"
Yes, a few plusses and minuses.  You've found a few already.  Since the modules
stay loaded load times are faster, but at the same time if you permanently stop
the service the modules stay loaded, eating up memory.  There is also the chance
of you uncovering latent bugs that result from the modules not getting
re-initializd when you restart the service.  Not overly likely, but possible. 
Either way we will want to track down this bug to make sure there is one less
way for this to fail.  I understand how the workaround will make this less
important for you, but hopefully I'll be able to recreate here and move forward
on my own.  Please test the patch in comment 34 for me please, though, untill I
can confirm that I can reproduce.

Thanks again, and let me know what you find!

Comment 39 Alex Tang 2007-08-24 00:35:58 UTC

Thanks again Neil.  Even if we go with the workaround, I will continue to
provide whatever info I can.  

For now however, I have built a new patched kernel, but it's not booting
properly.  I'm going to see if i screwed up my build, but i probably won't have
any information until at least tomorrow.

Thanks again.

Comment 40 Neil Horman 2007-08-24 11:46:39 UTC

No problem, let me know how it goes.  Also, if you would please, provide me with
your saved iptables rules set.  I'm guessing I'm going to need them if  the
above script is going to reproduce the same behavior that you have at your site.
 Thanks!

Comment 41 Alex Tang 2007-08-27 16:39:16 UTC

A bit more info.

First, unfortunately, the second patch as supplied doesn't work.  The
/etc/rc.d/init.d/iptables script cannot start iptables upon boot.  On the
console, it stops when it says "Applying iptables firewall rules". 
ctrl-alt-delete will reboot the machine however.  I could probably boot into
single user to see more about what's going on, but i haven't had time yet.

Secondly, the test (provided in comment 37) that reproduces the problem will
cause the modprobe hang when the iptables restart command is issuing the
"modprobe -r iptable_filter" command.  The original problem as described as well
as the original problem i reported in comment #6 was a hung "modeprobe -r ipt_nat".

Does the difference in module make a difference in terms of this bug?  The rest
of the symptoms are identical.

Also, when i use the scripts to reproduce the problem, the "iptables status"
command is stuck at: 

  iptables -t filter --list -n

and there is a 

  modeprobe -q -- iptable_filter 

which seems to be the "other half" of the deadlock.

Looking at the "ps -ef --forest" output, i see: 

root         6     1  0 Aug06 ?        00:00:03 [events/0]
root         8     6  0 Aug06 ?        00:00:02  \_ [khelper]
root     22111     8  0 16:00 ?        00:00:00  |   \_ [khelper]
root     22112 22111  0 16:00 ?        00:00:00  |       \_ /sbin/modprobe -q --
iptable_filter

What about the command   "iptables -t filter --list -n" is causing
"/sbin/modprobe -q -- iptable_filter" to be running?

I also found that in this scenario, doing a both modprobes are hung and cannot
be killed, but when the "modprobe -q..." command is killed using "kill -9", it
frees up the "modprobe -r iptable_filter", and the iptables restart command will
run to completion.

I don't know if this helps at all, i just thought i'd report some further findings.

Comment 42 Neil Horman 2007-08-27 17:02:58 UTC

I'm sorry, clarify this for me,  in the first paragraph you say:
"The /etc/rc.d/init.d/iptables script cannot start iptables upon boot"

And in the second paragraph you say:
"Secondly, the test (provided in comment 37) that reproduces the problem will
cause the modprobe hang when the iptables restart command is issuing the
"modprobe -r iptable_filter" command"

Are you indicating that the patch will hang the system when used on boot, but
will not hang the system if issued manually after the system has completed booting?

I can see how a hang might occur on account of the additional locking in the
nfs_register routine, but I would expect if there were a deadlock there it would
happen universally.  We should be able to remove that.

Also, on the hang with the modprobe -r command, can you check your logs on
reboot (all of them incase logroller has run), to make sure that you don't have
any bughalt messages.  That should provide us with a clue as to what else might
be going on here.
Thanks!

Comment 43 Alex Tang 2007-08-27 17:16:19 UTC

Whoops. Sorry for being unclear.  

When using the new patch (attachment id 169395) , the system will not boot. 
I'll attach a screenshot of a system that will not boot.  Since i was not able
to get the system to boot, i went back to the old patch (attachment 160787 [details]) to
see if i could gather more data.  The rest of my previous comment is based on a
kernel using the previous patch.

Comment 44 Alex Tang 2007-08-27 17:21:21 UTC

Created attachment 174041 [details]
screenshot showing boot hang

Comment 45 Neil Horman 2007-08-27 19:44:16 UTC

Ok, that makes more sense then.  I can fix up this patch, and get you past the
boot hang to further test it.  I'll repost it shortly.

Comment 46 Neil Horman 2007-08-27 19:55:04 UTC

Created attachment 174261 [details]
new sockopt patch

I've not built/tested it yet, but am doing so now.  Wanted you to have it in
the interim.  Its the same patch as my most recent one, but doesn't do the
write locking of the rwsem on netfilter registration, which I think is what was
causing the deadlock. If you use this patch instead, then you can get the extra
debug checks that I coded in.  Also, if you can attach that iptables config so
that I can better try to to reproduce here with your script, I would appreciate
it.  Thanks!

Comment 47 Neil Horman 2007-08-28 12:15:36 UTC

hold off on the above patch for a second, I think I missed a chunk.  I'm fixing
right now.

Comment 48 Neil Horman 2007-08-28 13:22:23 UTC

Created attachment 176401 [details]
new patch

sorry, theres the correct version of the patch.  It builds and runs properly in
my trivial environment.  Please test it out and let me know, and send me your
iptables config so that I can continue to try to reproduce.  Thanks

Comment 49 Alex Tang 2007-08-30 22:43:06 UTC

I tried the latest patch, and it's still having the same problem.  If you'd like
me to get a netdump from it, i can get that.  

I changed the status script from comment #37 to issue a "iptables -t filter
--list -n" instead of "/etc/rc.d/init.d/iptables status" because it makes the
problem happen more often.

I didn't see anything new in the kernel log.

here's the iptables rules you asked for.  Sorry for taking so long to get it to you.

# Generated by iptables-save v1.2.11 on Thu Aug 30 15:20:14 2007
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [167:29794]
:RH-Firewall-1-INPUT - [0:0]
-A INPUT -j RH-Firewall-1-INPUT
-A FORWARD -j RH-Firewall-1-INPUT
-A RH-Firewall-1-INPUT -i lo -j ACCEPT
-A RH-Firewall-1-INPUT -p icmp -m icmp --icmp-type any -j ACCEPT
-A RH-Firewall-1-INPUT -p ipv6-crypt -j ACCEPT
-A RH-Firewall-1-INPUT -p ipv6-auth -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A RH-Firewall-1-INPUT -s 10.22.242.8 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 25 -j ACCEPT
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-port-unreachable
COMMIT
# Completed on Thu Aug 30 15:20:14 2007
# Generated by iptables-save v1.2.11 on Thu Aug 30 15:20:14 2007
*nat
:PREROUTING ACCEPT [0:0]
:POSTROUTING ACCEPT [3:240]
:OUTPUT ACCEPT [3:240]
COMMIT
# Completed on Thu Aug 30 15:20:14 2007

Comment 50 Neil Horman 2007-08-31 00:15:45 UTC

Dang, i think I see whats going wrong.  The use of the extra semaphore doesn't
help.  Its still possible for a context to hold a read lock, while another
context is blocking on the write_lock while holding a flock in user space, which
means the same deadlock can occur.  I'm sorry, I should have seen that.

Forunately, I think I have an alternate solution, in which we set a cleanup flag
in place of the cleanup_task pointer.  By setting it, we can defer
deregistration of the task until the use count is zero in nf_sockopt, by
pretending that the nf_sock_ops are deregistered in nf_sockopt.  This will let
us unregister a module in a non-blocking fashion, I think.  I'll write up a
patch ASAP in the AM.  Thanks!

Comment 51 Neil Horman 2007-08-31 20:10:59 UTC

Created attachment 183901 [details]
patch to block module unloads with module refcounting in nf_sockopt

Ok, finally, heres a new patch.  It redoes how we fence module reference
counting in the netfilter code.  If you have a chance to try it out I'd
appreciate it.	

Good news.  Using your scripts and iptables config, I'm able to reproduce the
problem here.  So I'm building the patch as well here and will test it out
monday morning. I built a variant of this patch earlier and it seemed to be
working well for me.  Unfortunately I had misbalanced the module_get/module_put
so I wound up never being able to unload the module.  This patch should correct
that.  

Let me know how the patch works for you.  I'll post when I have more results
here.

Comment 52 Neil Horman 2007-08-31 20:22:53 UTC

Created attachment 183921 [details]
fixed patch

sorry, new patch.  Just hit a build break.  minor typo.

Comment 53 Neil Horman 2007-09-01 12:37:48 UTC

dang, I just hit an oops in my testing.  Silly problem.  Hold of on this patch
and I'll have it fixed shortly.  Sorry for the delay.

Comment 56 Neil Horman 2007-09-02 17:31:25 UTC

Created attachment 184891 [details]
new, working patch

Ok, So Good news / Bad news

The good news is this patch builds/runs and partially works

The bad news is that at this point I can see no way around the need to modify
the module init-tools as well.

This patch changes our situation by blocking module unloads via a non-zero
refcount, which we modify in nf_sockopt to indicate the module is in use.  This
is the "right thing to do"(tm).  Unfortunately, the delete_module syscall
offers teh caller the option to block or not block if the refcount is non-zero.
 The rmmod utility defaults to using non-blocking operation, while the modprobe
utility defaults to using blocking operations (which makes little sense to me).
 iptables removes modules with the latter utility (of course).

The other bit of good news is that module refcount waits use  an interruptible
sleep, meaning you can still get the deadlock, but you can ctrl-c your way out
of it, and resume normal operation without having to reboot.  So this patch
puts us half way there.

I've tested the patch here, and am comfortable with it.  Please feel free to do
the same if you like.  I'm going to post an upstream version of this, and then
follow it on with a modprobe patch to default its remove operation to being
non-blocking.  When that is done, we should have the puzzle completed and the
system will no longer deadlock.

Comment 57 Neil Horman 2007-09-04 12:18:58 UTC

Arrg!  one more corner case!  Apparently the RHeL4 kernel has an independent bug
that prevents non-blocking modprobe/rmmod operation from working consistently. 
I'll have to add that to the patch.

Comment 58 Neil Horman 2007-09-04 18:29:47 UTC

Created attachment 186411 [details]
final, working patch

I'm going to cry.

So, i've finally managed to get the locking strightened out.  And it all seemed
to be working quite well.

The I found that the patch uncovers a leak in the ip_tables module refcounting.
 After a while the refcount for the ip_tables module underflows, which results
in the ip_tables module becomming unremovable.	

I'm attaching the patch that I currently have for your testing.  It should
pretty well clear up your production environment, but it will fail under your
test case.  Either way, I think it would be best if we fixed the ip_tables
underflow separately, as this patch is already getting a bit unruly.

Comment 59 Neil Horman 2007-09-04 20:33:15 UTC

Created attachment 186601 [details]
patch to cause modprobe to not block

I've sent upstream version of the patches out for review.  Before i forget, you
also need this patch for modprobe.  If the patches I sent upstream get
traction, I'll propose for internal inclusion and tackle the ip_tables refcount
problem

Comment 60 Alex Tang 2007-09-05 17:28:48 UTC

Thanks neil.

I was wondering if you could explain more about the ip_tables module refcounting
leak.  

Unfortunately, due to another bug (which i can't seem to locate in bugzilla
today) where the ip_conntrack table fills up because some connections aren't
removed from the table appropriately, we end up having to restart iptables
(which clears the ip_conntrack table).  On our most heavily loaded systems
(hundreds of thousands to millions of SMTP connections per day), the conntrack
table can fill up once or twice a day (ip_conntrack_max is 65536), hence we
restart iptables once or twice a day.  

How often will we see this leak, and when it happens, how do we get around it?

Thanks again.

Comment 61 Neil Horman 2007-09-05 18:01:33 UTC

Sure, From what I'm seeing at the moment, it appears as though we have an issue
(I think during a table replacement operation).  What basically happens is that
we do something to cause a table replacement, which calls the do_replace
function in the module.  This function has some logic to adjust the module
reference count based on the number of rules in the new and old tables.  It
appears this logic attempts to do a module_put when the reference count on the
module is already zero.  Since we have no check on this error in the module
code, we wind up underflowing the modules reference count and a zero refcount
becomes a reference count of approximately 4 billion (2^32, since the reference
counter is a 32 bit integer).  The end result is that the ip_tables module can
not be removed (due to an absurdly high reference count), although it still
should be functional.  I should be able to fix this subsequent bug, but I'm
going to have to handle it in a separate bugzilla to keep the patch for this
origional bug from getting too unruly.  restarting ip_tables infrequently should
be fine, I've only been able to reproduce the problem using your test scripts.

On the upside, this new problem does not occur upstream, and the patch set for
this orgional bug seems to be getting received well.

Comment 62 Neil Horman 2007-09-06 13:54:12 UTC

I've posted this internally, and am tracking the remaining underflow issue in
bz280431

Comment 64 Vivek Goyal 2008-03-25 21:04:39 UTC

Committed in 68.25. Released in 68.26. RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 68 errata-xmlrpc 2008-07-24 19:12:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Comment 69 Alex Tang 2008-12-22 21:14:23 UTC

Just as an FYI, we haven't been tracking for this problem in a while, but I saw it happen again on a stock 4.7 system (kernel-2.6.9-78.0.1.ELsmp).

sysrq-t output shows:

Dec 22 21:03:05 foo kernel: modprobe      R running  2980  7668   7469                     (NOTLB)

But nothing else for that process.  I can't get vmcore off of this machine. If you're still interested, i'll try the next time we see it.

Note You need to log in before you can comment on or make changes to this bug.