Bug 485904

Summary: [RHEL4] Netfilter modules unloading hangs
Product: Red Hat Enterprise Linux 4 Reporter: Tomas Smetana <tsmetana>
Component: kernelAssignee: Jiri Pirko <jpirko>
Status: CLOSED ERRATA QA Contact: Evan McNabb <emcnabb>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: anton, davem, mgahagan, nhorman, rkhan, tao, tgraf, tumeya, vgoyal
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 485903 Environment:
Last Closed: 2011-02-16 16:01:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 485903    
Bug Blocks:    
Attachments:
Description Flags
reproducer script none

Description Tomas Smetana 2009-02-17 11:44:08 UTC
+++ This bug was initially created as a clone of Bug #485903 +++

Description of problem:
The unloading of netfilter modules (triggered by e.g. service iptables stop) may hang under certain circumstances.  Please see the reproducer.

Version-Release number of selected component (if applicable):
kernel-2.6.9-78.EL

How reproducible:
always

Steps to Reproduce:
1. set up iptables:

iptables -F
iptables -X
iptables -A OUTPUT -d 192.168.122.254/255.255.255.0 -o eth0 -p tcp -m state --state NEW -m tcp --dport 7365 -j ACCEPT

The 192.168.122.254 host should not exist,

2. run the following script (note that timing matters -- running the commands by hand may not reproduce the problem)

#!/bin/sh
ping 192.168.122.254 -c1 -w1
arp -d 192.168.122.254
/etc/init.d/iptables stop

3. observe the results
  
Actual results:
the initscritpt would never finish:

PING 192.168.122.254 (192.168.122.254) 56(84) bytes of data.

--- 192.168.122.254 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1000ms

Flushing firewall rules:                                   [  OK  ]
Setting chains to policy ACCEPT: filter                    [  OK  ]
Unloading iptables modules:

... and nothing else. The output of ps -ef looks like this

root     16344 16272 99 12:13 pts/1    00:06:18 modprobe -r xt_state

Expected results:
clean module unload

Additional info:
Adding sleep before the arp command in the script prevents the problems.  Also there are several patches in the upstream kernel that look to be related:
http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg14393.html
http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg07687.html

The kernel is spinning in the ip_conntrack_cleanup() function:

i_see_dead_people:
       ip_conntrack_flush();
       if (atomic_read(&ip_conntrack_count) != 0) {
               schedule();
               goto i_see_dead_people;
       }

where the ip_conntrack_count is never zeroed.

Comment 1 Mike Gahagan 2009-02-18 21:33:29 UTC
Were you using some special routing table to reproduce this bug? I have tried on both rhel 4.8 (-81 kernel) as well as rhel 5 (-130 and -92 kernels) and was unable to reproduce the problem.

Comment 2 Tomas Smetana 2009-02-20 10:02:08 UTC
(In reply to comment #1)
> Were you using some special routing table to reproduce this bug? I have tried
> on both rhel 4.8 (-81 kernel) as well as rhel 5 (-130 and -92 kernels) and was
> unable to reproduce the problem.

Yes.  Sorry, I forgot to mention.  I had to 'route add 192.168.122.254 dev eth0' on the testing system to reproduce the behaviour.

Comment 3 Mike Gahagan 2009-02-20 16:15:54 UTC
I could have sworn I had tried that, figuring you had to have done something to the routing table to reproduce this. It reproduces for me in rhel 5.3, I'll try rhel 4 in a bit, but I'll go ahead and give this a qa ack.

Comment 4 Mike Gahagan 2009-02-20 16:17:10 UTC
Created attachment 332725 [details]
reproducer script

reproducer script, updated to add a route to the non existant host needed to reproduce the bug.

Comment 6 RHEL Program Management 2009-03-12 18:54:31 UTC
Since RHEL 4.8 External Beta has begun, and this bugzilla remains 
unresolved, it has been rejected as it is not proposed as exception or 
blocker.

Comment 11 Neil Horman 2010-02-17 13:05:34 UTC
I apologize Jiri, I've been busy with other bugs, I'll look at this today

Comment 12 Neil Horman 2010-02-17 13:50:15 UTC
My first thought, looking at this bug is that I'd like to confirm what we're seeing.  I see the origional  comment that we're spinning in the cleanup code.  I think thats quite likely accurate, but it would be nice to have something confirming it here.  A sysrq-t showing that stack trace would be good towards that end.

Also, it would be nice if we could get the contents of /proc/net/netfilter/ip_conntrack and ip_conntrack_expect.  That will tell us a bit more about who isn't getting cleaned from the various lists during cleanup

given the reproducer, what I _think_ might be happening is that we're incrementing the ip_conntrack_count for entries when we create them, but before we put them on the ip_conntrack_hash lists.  Since we only clean the latter, the addition of a route to an unreachable destination prevents us from ever confirming the conntrack entry, so we can never clean it, and never remove the module.  Thats just a theory though, the above data will help confirm it.

Comment 14 Jiri Pirko 2010-02-18 15:14:07 UTC
please use rhel5 bz to make comments - bz485903 ...setting dependency

Comment 15 Neil Horman 2010-02-18 15:59:54 UTC
I made  a mistake in the name of the proc files though, its nfs_conntrack and nf_contract_expect you want to examine before and after the hang.  My expectation is that we're seeing something get on the expect list, holding a reference, but never transition to the nf_conntrack list, so it never gets clean.  Thats likely what we need to look at.

Comment 18 RHEL Program Management 2010-10-12 16:50:10 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 19 Vivek Goyal 2010-10-14 14:39:21 UTC
Committed in 89.43.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 22 Vivek Goyal 2010-10-18 21:28:48 UTC
Some errors/confusions while adding the bz to errata tool. Returning bz to MODIFIED state so that it can be added to errata.

Comment 25 errata-xmlrpc 2011-02-16 16:01:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html