Bug 485903
Summary: | [RHEL5] Netfilter modules unloading hangs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Tomas Smetana <tsmetana> | ||||
Component: | kernel | Assignee: | Jiri Pirko <jpirko> | ||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.3 | CC: | anton, chaot_s, davem, dhoward, eguan, holmes86, jinzishuai, jpirko, jtillots, jwest, masanari_iida, me, mgahagan, nhorman, rkhan, tao, tgraf, tis, tumeya | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Calling the "service iptables stop" command causes the iptables init script to unload the netfilter modules. Because a clean-up code path was not taken, an endless loop occurred, which resulted in the init script becoming unresponsive. This update ensures that the clean-up code path is correctly taken, with the result that stopping the iptables service now works as expected.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 485904 (view as bug list) | Environment: | |||||
Last Closed: | 2011-01-13 20:46:00 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 485904, 533192, 600215 | ||||||
Attachments: |
|
Description
Tomas Smetana
2009-02-17 11:34:30 UTC
The kernel is spinning in the ip_conntrack_cleanup() function: i_see_dead_people: ip_conntrack_flush(); if (atomic_read(&ip_conntrack_count) != 0) { schedule(); goto i_see_dead_people; } where the ip_conntrack_count is never zeroed. Created attachment 332728 [details]
reproducer script
updated reproducer script, adds route to the non-existant host.
Just a note: I have tried to backport the patches that regarded the RCU usage in netfilter and looked "suspicious" to me. The problem is that netfilter code has changed quite a lot in the recent upstream releases and any backport (I'd made) is a bit dangerous or incomplete, which was my problem -- I shot more or less blindly and the patches I tried simply didn't work for me. Please let me know if you made any progress on this. Digging into this and the problem seems that there is probably a missing (unreached) nf_conntrack_put somewhere - the reference count for ct never counts down to 1 and therefore nfct->destroy() (where decrementing of ip_conntrack_count is done) is never called. That's the reason for looping in "goto i_see_dead_people;" (atomic_read(&ip_conntrack_count == 1 all the time). when I do this: ping 192.168.122.254 -c1 -w1 +sleep 1 arp -d 192.168.122.254 it do not hang. I'll dig in this more... Has anyone come up with a work-around? As it stands, system-config-securitylevel cannot complete. Normal shutdown is also problematic. I found out the following thing. Using eth0 uninitialized, the reproducer script does not hang. Then after bringing it up with "ifconfig eth0 up" and running reproducer again, it will also not hang. Then I assign ip address by "ifconfig eth0 10.0.0.1 netmask 255.255.255.0" and I run the reproducer, the hang occurs. Testing this with kernel 2.6.18-187.el5. Neal would you please look at this? Thanks The issue that we are seeing (looping in ip_conntrack_cleanup) happens because ip_conntrack_count never reaches 0. That's because one instance of ip_conntrack is never freed by ip_conntrack_free() (ip_conntrack_count is decrementing there). ip_conntrack_free() is called from destroy_conntrack() and it is called from nf_conntrack_put() once refcount (&nfct->use) reaches zero. Looking at this with prinks on appropriate places, when reproducing with "ping 192.168.122.254 -c1 -w1 || sleep x.y && arp -d 192.168.122.254" the mentioned refcount goes up and down (1-4) during ~1sec and then it stays still for ~10secs. When "sleep x.y" is long enough, it will make it to refcnt=1 before calling "arp part". If "arp part" is called earlier (sleep ~<1s), refcount stays >1 and then (after ~10secs) appropriate ip_conntrack not freed. In another words the "arp part" stops the refcnt from changing. The problem in "arp part" happens somewhere in neigh_update() function called from arp_req_delete(). Still not sure where exactly or why... I think my comments from bz 485904 are still valid. I made a mistake in the name of the proc files though, its nfs_conntrack and nf_contract_expect you want to examine before and after the hang. My expectation is that we're seeing something get on the expect list, holding a reference, but never transition to the nf_conntrack list, so it never gets clean. Thats likely what we need to look at. Hm, do not see these files there: # ls /proc/net/netfilter/ nf_log nf_queue Doing manual search in other suspicious dirs, I cannot find them either. /proc/net/nf_conntrack and /proc/net/nf_conntrack_expect Ok I found a fix. Indeed the problem was in neigh_update(). Timer was deleted but references were not put. Following upstream commit fixes this: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5ef12d98a19254ee5dc851bd83e214b43ec1f725 This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. We are having issues with this bug as well. We are using RHEL 5 on a distributed cluster spread across 5 states. The remoteness of several of the cluster pieces makes having reliably rebooting machines a priority. Can this patch be incorporated ASAP? in kernel-2.6.18-200.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. kernel 203 from http://people.redhat.com/jwilson/el5/203.el5/i686/kernel-2.6.18-203.el5.i686.rpm works for me. Please note that when installing http://people.redhat.com/jwilson/el5/203.el5/i386/kernel-headers-2.6.18-203.el5.i386.rpm I get the following error: [root@hostname ~]# uname -a Linux hostname.domain.tld 2.6.18-194.3.1.el5 #1 SMP Thu May 13 13:09:10 EDT 2010 i686 athlon i386 GNU/Linux [root@hostname ~]# rpm -ihv kernel-headers-2.6.18-203.el5.i386.rpm Preparing... ########################################### [100%] file /usr/include/linux/gfs2_ondisk.h from install of kernel-headers-2.6.18-203.el5.i386 conflicts with file from package kernel-headers-2.6.18-194.3.1.el5.i386 file /usr/include/linux/taskstats.h from install of kernel-headers-2.6.18-203.el5.i386 conflicts with file from package kernel-headers-2.6.18-194.3.1.el5.i386 [root@hostname ~]# before the 203 kernel reloading iptables hung at unloading the netfilters, with the 203 kernel it works just fine. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Calling the "service iptables stop" command causes the iptables init script to unload the netfilter modules. Because a clean-up code path was not taken, an endless loop occurred, which resulted in the init script becoming unresponsive. This update ensures that the clean-up code path is correctly taken, with the result that stopping the iptables service now works as expected. If I am not mistaken, this one was fixed on 2.6.18-194.6.1. * Mon Jun 07 2010 Jiri Pirko [2.6.18-194.6.1.el5] - [net] neigh: fix state transitions via Netlink request (Jiri Pirko) [600215 485903] And 2.6.18-194.11.1 was released on 10th/August. I can see BZ#600215 is on the list of following URL. http://www.redhat.com/docs/en-US/errata/RHSA-2010-0504/Kernel_Security_Update/index.html So if someone from RH confirm the release, set this BZ status to CLOSED. Thanks Event posted on 08-13-2010 02:14pm JST by tumeya > So if someone from RH confirm the release, set this BZ status to CLOSED. BZ600215 addressed EUS delivery for this bug. It got pushed out on July 1st btw. This BZ, bz485903, however must stay open until its delivery on 5.6.0. This event sent from IssueTracker by tumeya issue 261512 verified by job https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=178158 see case /kernel/errata/5.5.z/600215-netfilter An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html I am still having exactly the same problem after upgrading to RHEL-5.6 and the 2.6.18-238.1.1.el5 kernel. (In reply to comment #41) > I am still having exactly the same problem after upgrading to RHEL-5.6 and the > 2.6.18-238.1.1.el5 kernel. That's most probably a different issue which looks alike. Would you please file a new bug with reproducing steps? Thanks. |