Description of problem: RHEL3 (up to U5) has an issue where if multiple instances of the "ifconfig" utility are started at the same time (or quickly in sequence), some calls will fail (and produce no output). As far as we can tell, this only happens on machines with 4 or more e1000-based NICs, with at least two bonded interfaces configured and multiple (>2) service IPs are configured for use by Red Hat Cluster Manager. This problem should be addressed in the next update of the RHEL3 kernel. However, it may be beneficial to provide this workaround in addition to the kernel fix for several reasons (no need to upgrade kernel, preference for iproute over ifconfig, etc...). Background: Cluster Manager on RHEL3 (1.2.x) parses the output of "ifconfig" to monitor cluster service IP addresses, and has done so since 2002, when Cluster Manager was first introduced with RHEL 2.1. What is proposed: The enhancement here is to incorporate an already-written, field-tested workaround in an update of Cluster Manager for RHEL3 as a configuration option. How reproducible: <0.1% of the calls to ifconfig. On a system with multiple service IP addresses with 10 second check intervals, the problem will occur approximately 4-8 times per day. When it occurs, services will be restarted on the same node, causing downtime for several seconds. Additional info: This workaround will be a user-configurable option from the command line, and will not be present in the GUI (redhat-config-cluster). It will not be enabled by default. There are significant behavioral changes with how service IP addresses are handled, namely: (a) No more virtual interfaces. Previously, a service IP address would end up as "eth0:0" or "eth0:1", etc. in the output of "ifconfig". The new iproute utilities, rather than assigning virtual devices (eth0:x) simply assign multiple IPs to the same NIC at once. Consequently, the service IP addresses will no longer show up in the output of the "ifconfig" utility. Instead, administrators should use "ip addr list" to view them. (b) The workaround completely ignores user-specified broadcast and netmask addresses. Generally, this is fine, as the service IPs must match broadcast/netmask addresses of existing system IPs anyway prior to being brought up. Bugs reported about this fact will be closed "NOTABUG". (c) The configuration change must only be done when ALL SERVICES are in the DISABLED state, but ALL NODES are online.
Workaround: cludb -p clusvcmgrd%use_netlink yes Above caveats apply (all nodes up, but all services disabled).
Created attachment 117193 [details] Patch implementing RFE
Note: The original summary implies that this issue is related to Red Hat Cluster Manager. In fact, it is not related to Cluster Manager at all, and is merely exacerbated by using Cluster Manager due to the fact that the ifconfig utility is often called in quick succession while checking the status of services.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-676.html
*** Bug 184039 has been marked as a duplicate of this bug. ***