Description of problem:
When removing "blackhole" or "unreachable" type routes using "ip route
del blackhole 192.168.0.0/16" kernel panics as below but keeps
running. "ip route list" command will lock up and not return.
Version-Release number of selected component (if applicable):
2.6.9-1.667 and 2.6.9-678_FC3
Every time. Tried on 2 seperate installs on Intel i686 and athlon
Steps to Reproduce:
1. ip route add blackhole 192.168.0.0/16
2. ip route del blackhole 192.168.0.0/16
You should see the kernel oops on the console/in logs here.
3. ip route list
Will not return.
Kernel panic, and lockup of route commands.
Silent removal of specified blackhole route, route command continue to
work as normal.
Copy of an actual session including Oops attached.
Same results using the old "route" command as opposed to the "ip
Same results using the zebra daemon.
Created attachment 107176 [details]
Copy of the output of demonstration session.
Here I go through the steps that cause the bug, and record the kernel oops as
logged to syslog.
This is identical to what I see on the console in run level 1.
Line numbers from the source rpm of kernel 2.6.9-678_FC3:
I belive that the Oops is occouring in line 526 of
include/linux/list.h: "*pprev = next;" because pprev is null.
This was inlined at line 166 of net/ipv4/fib_semantics.c:
"hlist_del(&nh->nh_hash);" which is releasing the next hop hash lists
I am guessing that a blackhole route manages to inject an incomplete
hash entry into the nexthops list with pprev set to null somehow.
Waiting for a kernel to compile on a very slow machine to confirm this
I think the cause of the problem is lines 742,743 of
net/ipv4/fib_semantics.c in fib_create_info():
Basicly if there is no nh_dev part of the next_hop structure, then the
nh_hash is never initialised so will has pprev set to null.
If this is a valid senario, hlist_del() needs to check nh_dev and only
run __hlist_del() if it is non-null. Otherwise the continue should
become some sort of error and the cause of an invalid nh_dev tracked
down. Or alternativly, the nh_hash needs to be initialized into a no
device type chain.
I *think* this effect the stock 2.6.9 kernel as well. I am unable to
verify that though.
Fixed in kernel-2.6.9-1.681_FC3!
I had just figured out the patch as well!