Created attachment 499296 [details] make libdhcp use rtnl_link_get_by_number() instread of rtnl_link_get() Problem description: One of our customers faced a problem that turned out to be a bug in "libdhcp" and "libnl" libraries interaction. In brief: nic_get_links() function from "libdhcp" library uses rtnl_link_get() function from "libnl" library in order to get interfaces configuration from the kernel via netlink. Doing this nic_get_links() assumes that network interfaces have strictly consequent indexes (which is incorrect in general). static int nic_get_links(NLH_t nh, char *if_name, int if_index) { ... nitems = nl_cache_nitems(cache); while (i <= nitems) { if ((rlink = rtnl_link_get(cache, i)) == NULL) { eprintf(NIC_FATAL, "%s (%d): %s\n", ... } ... rtnl_link_put(rlink); i++; ... } nic_get_links() calls rtnl_link_get(cache, i) - it tries to get i-th interface, while rtnl_link_get() searches not for i-th interface but for network interface with index == i. struct rtnl_link *rtnl_link_get(struct nl_cache *cache, int ifindex) { ... nl_list_for_each_entry(link, &cache->c_items, ce_list) { if (link->l_index == ifindex) { nl_object_get((struct nl_object *) link); return link; } } return NULL; } In case network interfaces have indexes with "holes" - this code does not work. In our particular case the node could not boot from an iSCSI device, because "nash" inside initrd image failed to configure network interfaces due to this issue (Parallels Virtuozzo Containers kernel virtualizes network interfaces => their indexes may be non-consequent). But this issue can be easily reproducible on a stock RHEL5 kernel - even on an already booted and fully operational node: # Initially we see the network interface indexes are sequential: [root@dhcp-10-30-20-94 ~]# ip a l 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:1c:42:42:44:47 brd ff:ff:ff:ff:ff:ff inet 10.30.20.94/16 brd 10.30.255.255 scope global eth0 inet6 2a00:1b48:1003:30:21c:42ff:fe42:4447/64 scope global dynamic valid_lft 2592000sec preferred_lft 604800sec inet6 fe80::21c:42ff:fe42:4447/64 scope link valid_lft forever preferred_lft forever 3: sit0: <NOARP> mtu 1480 qdisc noop link/sit 0.0.0.0 brd 0.0.0.0 # Let's make them non-sequential: [root@dhcp-10-30-20-94 ~]# vconfig add eth0 10 ... [root@dhcp-10-30-20-94 ~]# vconfig add eth0 11 ... [root@dhcp-10-30-20-94 ~]# vconfig rem eth0.10 ... [root@dhcp-10-30-20-94 ~]# ip a l 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:1c:42:42:44:47 brd ff:ff:ff:ff:ff:ff inet 10.30.20.94/16 brd 10.30.255.255 scope global eth0 inet6 2a00:1b48:1003:30:21c:42ff:fe42:4447/64 scope global dynamic valid_lft 2591992sec preferred_lft 604792sec inet6 fe80::21c:42ff:fe42:4447/64 scope link valid_lft forever preferred_lft forever 3: sit0: <NOARP> mtu 1480 qdisc noop link/sit 0.0.0.0 brd 0.0.0.0 5: eth0.11@eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop link/ether 00:1c:42:42:44:47 brd ff:ff:ff:ff:ff:ff # And here is the command sequence that fails: [root@dhcp-10-30-20-94 ~]# nash (running in test mode). Red Hat nash version 5.1.19.6 starting netname 00:1c:42:42:44:47 eth0.11 network --device eth0.11 --bootproto static --ip 10.0.0.2 --netmask 255.255.255.0 <<< press Ctrl-d here >>> ERROR: Interface setup failed: pumpSetupInterface failed: get link - 19: No such device. 1) i did this test on the following node: Red Hat Enterprise Linux Server release 5.5 (Tikanga) 2.6.18-194.el5 x86_64 kernel nash-5.1.19.6-61 But originally the issue was found on a RHEL5.6 => nash-5.1.19.6-68.1, libnl-1.0-0.10.pre5.5, libdhcp-1.20-10.el5 are also affected. ("nash" uses "libdhcp" and "libnl" but compiled statically => it should be also considered) 2) the example above shows that "nash" is affected, but in fact any tool that uses "libdhcp"/"libnl" will behave the same way. 3) how to solve this: it seems the simplest way here is to add another function to "libnl" which will return i-th interface configuration and make nic_get_links() function from "libdhcp" library use it instead of rtnl_link_get(). Attaching patches with the implementation. Hope that helps. -- Best regards, Konstantin Khorenko, PVCfL/OpenVZ developer, Parallels
Created attachment 499297 [details] add rtnl_get_link_by_number() function to libnl
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** This bug has been marked as a duplicate of bug 599633 ***