Description of problem: When removing "blackhole" or "unreachable" type routes using "ip route del blackhole 192.168.0.0/16" kernel panics as below but keeps running. "ip route list" command will lock up and not return. Version-Release number of selected component (if applicable): 2.6.9-1.667 and 2.6.9-678_FC3 How reproducible: Every time. Tried on 2 seperate installs on Intel i686 and athlon hardware. Steps to Reproduce: 1. ip route add blackhole 192.168.0.0/16 2. ip route del blackhole 192.168.0.0/16 You should see the kernel oops on the console/in logs here. 3. ip route list Will not return. Actual results: Kernel panic, and lockup of route commands. Expected results: Silent removal of specified blackhole route, route command continue to work as normal. Additional info: Copy of an actual session including Oops attached. Same results using the old "route" command as opposed to the "ip route" command. Same results using the zebra daemon.
Created attachment 107176 [details] Copy of the output of demonstration session. Here I go through the steps that cause the bug, and record the kernel oops as logged to syslog. This is identical to what I see on the console in run level 1.
Line numbers from the source rpm of kernel 2.6.9-678_FC3: I belive that the Oops is occouring in line 526 of include/linux/list.h: "*pprev = next;" because pprev is null. This was inlined at line 166 of net/ipv4/fib_semantics.c: "hlist_del(&nh->nh_hash);" which is releasing the next hop hash lists I am guessing that a blackhole route manages to inject an incomplete hash entry into the nexthops list with pprev set to null somehow. Waiting for a kernel to compile on a very slow machine to confirm this through printk...
I think the cause of the problem is lines 742,743 of net/ipv4/fib_semantics.c in fib_create_info(): >if (!nh->nh_dev) > continue; Basicly if there is no nh_dev part of the next_hop structure, then the nh_hash is never initialised so will has pprev set to null. If this is a valid senario, hlist_del() needs to check nh_dev and only run __hlist_del() if it is non-null. Otherwise the continue should become some sort of error and the cause of an invalid nh_dev tracked down. Or alternativly, the nh_hash needs to be initialized into a no device type chain. I *think* this effect the stock 2.6.9 kernel as well. I am unable to verify that though.
Fixed in kernel-2.6.9-1.681_FC3! I had just figured out the patch as well!