Created attachment 1229393 [details]
perl script to demo this bug; scans all IPs on your /24 LAN (change the subnet value first)
Description of problem:
Note: I've bisected already, see below.
Hi! 4.8.x caused a script of mine that pings all IPs on my LAN /24 subnet in about 0.5s, and nmap doing the same, to error on the send() call with "operation not permitted". This happens after a somewhat random number of packets have already been sent. That number shrinks each time you run the script, so the first run you'll get up to around 200 pings, then it goes down to 50 pings, before the error. If you wait, it goes back up to around 200 pings. It almost never completes all 253 of them.
Interestingly, the problem only occurs when you ping different IPs. If you send the same ping count using my script to just one IP, there is no bug.
Strangely, iptable_nat module MUST be loaded for the bug to show up! If you rmmod it, the bug goes away. Interestingly, the bug occurs even if you have every iptables table (including -t nat) completely empty (no rules). All that is required is iptable_nat simply to be loaded.
4.7.0 kernels don't have this problem: the pings go out and everything is fine no matter how fast you repeat the script.
I bisected the bug to:
I had to skip 7c9664351980aaa6a4b8837a314360b3a4ad382a because it wouldn't boot... just panic on every try. So I can't narrow it any closer than within 2
I played with all the sysctls that looked relevant, like: ratelimit, per_sec, max, etc. I modified everything I could find but nothing made the problem go away, though I *think* some had a modest effect on how many times I could run the script before the error popped up, but even if I took them to extreme values the bug never went away. I'm back to the Fedora defaults now, and can attach on request.
Version-Release number of selected component (if applicable):
(4.8.8 and 4.8.10 also have the bug, one other guy who has experienced this on Ubuntu claims it's anything above 4.8.4
Steps to Reproduce:
1. Boot into 4.8.8+ kernel
2. modprobe iptable_nat
3. nmap -PE 192.168.100.0/24 (use your subnet)
or run my attached script which, makes it even clearer, after modifying the subnet. Might have to repeat the nmap or script rapidly sequentially a few times.
Starting Nmap 7.12 ( https://nmap.org ) at 2016-12-08 01:33 CST
sendto in send_ip_packet_sd: sendto(5, packet, 44, 0, 192.168.100.131, 16) => Operation not permitted
Offending packet: TCP 192.168.100.1:38593 > 192.168.100.131:143 S ttl=53 id=5409 iplen=44 seq=3581424515 win=1024 <mss 1460>
sendto in send_ip_packet_sd: sendto(5, packet, 44, 0, 192.168.100.2, 16) => Operation not permitted
no errors, normal nmap output
I've also posted this info to LKML, CC the committers, with the same subject, no replies yet.
Fedora 23 changed to end-of-life (EOL) status on 2016-12-20. Fedora 23 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.
Bug remains in F24, which I just upgraded to. LKML guys have seen my bug request and are looking into it. I will report back.
LKML guys have come up with a patch and added it to 4.8-stable review, as per the following:
[PATCH 4.8 83/85] Revert "netfilter: nat: convert nat bysrc hash to rhashtable"
[PATCH 4.8 84/85] Revert "netfilter: move nat hlist_head to nf_conn"
They simply reverted the 2 commits I bisected to. They say a "real" fix has been done in 4.9.
I suppose this will roll around into F24's kernel errata eventually and then we can close this bz?
FYI, I found this ticket in my quest to resolve the same issues I was seeing with nmap - albeit on a different Linux 4.8 distribution. Cross-linking to http://unix.stackexchange.com/q/337082/83846, in case the details there are helpful for anyone else experiencing this. Looking forward for 4.9 to become officially available on our respective distributions!
As of kernel 4.9.5-100.fc24.x86_64 this bug is fixed. Apparently the LKML guys have tweaked the 4.8 stable to fix this bug as well, but I have not tested that. Closing bug. Thanks!