Description of problem: We have 2node cluster created at HP servers with iLO. They are connected by bond0 interface connected to two LAN switches. iLO of first node is connected to first switch, second to another. During some tests we restarted both switches the same time, so network was off for about a minute but due to our config nodes wanted to fence other. After bond0 is up i can't connect anywhere from any of nodes. Version-Release number of selected component (if applicable): How reproducible: everytime (checked twice) Steps to Reproduce: 1: create 2node cluster with bond interface connected to 2switches and service using virtual ip and ext3 partitions 2: start service on node1 3: restart both switches Actual results: unable to connect to/from nodes tedse-ora1:root:/var/log> ifconfig bond0 bond0 Link encap:Ethernet HWaddr 00:17:A4:3E:08:E4 inet addr:10.4.1.1 Bcast:10.4.1.63 Mask:255.255.255.192 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:472458 errors:0 dropped:0 overruns:0 frame:0 TX packets:440936 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:100841760 (96.1 MiB) TX bytes:211038613 (201.2 MiB) Wed Jan 17 13:39:27 CET 2007 tedse-ora1:root:/var/log> arp -a tedse-ora2.ax4.com (10.4.1.2) at 00:17:A4:3E:10:14 [ether] on bond0 ? (10.4.1.62) at <incomplete> on bond0 Wed Jan 17 13:39:35 CET 2007 tedse-ora2:root:~> arp -a ? (10.4.1.62) at <incomplete> on bond0 tedse-ora1.ax4.com (10.4.1.1) at 00:17:A4:3E:08:E4 [ether] on bond0 Wed Jan 17 11:38:00 CET 2007 tedse-ora2:root:~> ifconfig bond0 bond0 Link encap:Ethernet HWaddr 00:17:A4:3E:10:14 inet addr:10.4.1.2 Bcast:10.4.1.63 Mask:255.255.255.192 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:268745 errors:0 dropped:0 overruns:0 frame:0 TX packets:271030 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:23073816 (22.0 MiB) TX bytes:101298717 (96.6 MiB) Wed Jan 17 11:42:02 CET 2007 on normal working node arp -a looks (another pair of nodes, configured identicly): tefse-ora2:root:~> arp -a ? (10.3.1.60) at 00:00:5E:00:01:06 [ether] on bond0 ? (10.3.1.62) at 00:00:5E:00:01:06 [ether] on bond0 Expected results: why those node can't restore their connections? At the same time we have tested other 2node clusters (without virtual ip, and started services) and they restored their connection Additional info: TEDSE-ORA2: /var/log/messages Jan 17 11:07:21 tedse-ora2 kernel: tg3: eth1: Link is down. Jan 17 11:07:21 tedse-ora2 kernel: tg3: eth0: Link is down. Jan 17 11:07:21 tedse-ora2 kernel: bonding: bond0: link status definitely down f or interface eth0, disabling it Jan 17 11:07:21 tedse-ora2 kernel: bonding: bond0: link status definitely down f or interface eth1, disabling it Jan 17 11:07:21 tedse-ora2 kernel: bonding: bond0: now running without any activ e interface ! Jan 17 11:07:40 tedse-ora2 kernel: CMAN: removing node tedse-ora1 from the clust er : Missed too many heartbeats Jan 17 11:07:40 tedse-ora2 fenced[3134]: tedse-ora1 not a cluster member after 0 sec post_fail_delay Jan 17 11:07:40 tedse-ora2 fenced[3134]: fencing node "tedse-ora1" Jan 17 11:07:44 tedse-ora2 fenced[3134]: agent "fence_ilo" reports: Connect fail ed: connect: No route to host; No route to host at /usr/lib/perl5/vendor_perl/5. 8.5/i386-linux-thread-multi/Net/SSL.pm line 104, <> line 4. Jan 17 11:07:44 tedse-ora2 fence_manual: Node tedse-ora1 needs to be reset befor e recovery can procede. Waiting for tedse-ora1 to rejoin the cluster or for man ual acknowledgement that it has been reset (i.e. fence_ack_manual -n tedse-ora1) Jan 17 11:09:11 tedse-ora2 kernel: tg3: eth1: Link is up at 1000 Mbps, full dupl ex. Jan 17 11:09:11 tedse-ora2 kernel: tg3: eth1: Flow control is off for TX and off for RX. Jan 17 11:09:11 tedse-ora2 kernel: tg3: eth0: Link is up at 1000 Mbps, full dupl ex. Jan 17 11:09:11 tedse-ora2 kernel: tg3: eth0: Flow control is off for TX and off for RX. Jan 17 11:09:11 tedse-ora2 kernel: bonding: bond0: link status definitely up for interface eth0. Jan 17 11:09:11 tedse-ora2 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 17 11:09:11 tedse-ora2 kernel: bonding: bond0: making interface eth0 the new active one. ifdown eth0 i ifup eth0 Jan 17 11:43:17 tedse-ora2 kernel: bonding: bond0: link status definitely down f or interface eth0, disabling it Jan 17 11:43:17 tedse-ora2 kernel: bonding: bond0: making interface eth1 the new active one. TEDSE-ORA1: /var/log/messages Jan 17 12:07:04 tedse-ora1 clurgmgrd: [5687]: <info> Executing /opt/oracle.init status Jan 17 12:07:21 tedse-ora1 kernel: tg3: eth1: Link is down. Jan 17 12:07:21 tedse-ora1 kernel: tg3: eth0: Link is down. Jan 17 12:07:21 tedse-ora1 kernel: bonding: bond0: link status definitely down f or interface eth0, disabling it Jan 17 12:07:21 tedse-ora1 kernel: bonding: bond0: link status definitely down f or interface eth1, disabling it Jan 17 12:07:21 tedse-ora1 kernel: bonding: bond0: now running without any activ e interface ! Jan 17 12:07:24 tedse-ora1 clurgmgrd: [5687]: <warning> Link for bond0: Not dete cted Jan 17 12:07:24 tedse-ora1 clurgmgrd: [5687]: <warning> No link on bond0... Jan 17 12:07:24 tedse-ora1 clurgmgrd[5687]: <notice> status on ip "10.4.1.10" re turned 1 (generic error) Jan 17 12:07:24 tedse-ora1 clurgmgrd[5687]: <notice> Stopping service tedsv-ora an 17 12:07:36 tedse-ora1 clurgmgrd: [5687]: <info> Executing /opt/oracle.init stop Jan 17 12:07:36 tedse-ora1 su(pam_unix)[4255]: session opened for user oracle by (uid=0) Jan 17 12:07:36 tedse-ora1 su(pam_unix)[4255]: session closed for user oracle Jan 17 12:07:36 tedse-ora1 su(pam_unix)[4283]: session opened for user oracle by (uid=0) Jan 17 12:07:46 tedse-ora1 kernel: CMAN: removing node tedse-ora2 from the clust er : Missed too many heartbeats Jan 17 12:07:46 tedse-ora1 fenced[3118]: tedse-ora2 not a cluster member after 0 sec post_fail_delay Jan 17 12:07:46 tedse-ora1 fenced[3118]: fencing node "tedse-ora2" Jan 17 12:07:49 tedse-ora1 fenced[3118]: agent "fence_ilo" reports: Connect fail ed: connect: No route to host; No route to host at /usr/lib/perl5/vendor_perl/5. 8.5/i386-linux-thread-multi/Net/SSL.pm line 104, <> line 4. Jan 17 12:07:49 tedse-ora1 fence_manual: Node tedse-ora2 needs to be reset befor e recovery can procede. Waiting for tedse-ora2 to rejoin the cluster or for man ual acknowledgement that it has been reset (i.e. fence_ack_manual -n tedse-ora2) Jan 17 12:07:52 tedse-ora1 su(pam_unix)[4283]: session closed for user oracle Jan 17 12:07:52 tedse-ora1 clurgmgrd: [5687]: <info> Removing IPv4 address 10.4. 1.10 from bond0 Jan 17 12:08:03 tedse-ora1 clurgmgrd: [5687]: <info> unmounting /opt/oracle/u01 Jan 17 12:08:03 tedse-ora1 clurgmgrd: [5687]: <info> unmounting /opt/oracle/u02 Jan 17 12:08:04 tedse-ora1 clurgmgrd: [5687]: <info> unmounting /opt/oracle/u04 Jan 17 12:08:04 tedse-ora1 clurgmgrd: [5687]: <info> unmounting /opt/oracle/u05 Jan 17 12:09:11 tedse-ora1 kernel: tg3: eth0: Link is up at 1000 Mbps, full dupl ex. Jan 17 12:09:11 tedse-ora1 kernel: tg3: eth0: Flow control is off for TX and off for RX. Jan 17 12:09:11 tedse-ora1 kernel: bonding: bond0: link status definitely up for interface eth0. Jan 17 12:09:11 tedse-ora1 kernel: bonding: bond0: making interface eth0 the new active one. Jan 17 12:09:11 tedse-ora1 kernel: tg3: eth1: Link is up at 1000 Mbps, full dupl ex. Jan 17 12:09:11 tedse-ora1 kernel: tg3: eth1: Flow control is off for TX and off for RX. Jan 17 12:09:11 tedse-ora1 kernel: bonding: bond0: link status definitely up for interface eth1.
now we have fence_ilo and fence_manual configured so node tries to fence via ilo and when it's impossible (no network avaliable) it uses fence_manual. It doesnt matter, becouse only way to connect to nodes is via ilo interface...
update, after ifdown eth0; ifup eth0 (one of bonded interfaces) on nodes are visible in network and can work normaly...
and one more thing, those nodes, by mistake are in 2different time zones
Please clarify what you expected the behavior to be.
(In reply to comment #4) > Please clarify what you expected the behavior to be. I'm expecting that after switches back online nodes can access network and can be accessed from network via bond interface... if so, cluster software can perform any action ie. fence other node... Additional Comment: We have made some more tests... at Comment #2 i've wrote that after ifdown/ifup eth0, which is part of bond0, node can reconnect to network and can work, and it was possible to do succesful fence of the other node (i've done it myself using fence_ilo). It's partial truth - becouse after ifup eth0 port on switch is still down and all traffic goes via eth1. When if've ifdown eth1 node lost network, and was fenced by other node... I think that it can be problem with bond and virtual ip added by cluster service, becouse we didn't notticed such problem on other 2node clusters which havent got virtual ip.
This does not appear to be a documentation bug. Recommend changing module from the rh-cs-en component (cluster admin doc component), to a more appropriate component. Need information from Engineering to determine the most appropriate component and action.
Hi, Any ideas how to avoid this problems? (we would like to join this system into production, so answer to this bug becoming critical...) Thanks
Hello? Any ideas how to avoid this problems? (we are going to join this system into production, so answer to this bug is critical...) Thanks
I don't fully understand the kernel's requirements on bonding configurations in multiple switch environments; it appears that there needs to be an interswitch link in order for the bonding driver to operate correctly in this topology. What other differences are there between the "working" cluster and the "non-working" cluster? That is, are they on different switch hardware?
(In reply to comment #10) > I don't fully understand the kernel's requirements on bonding configurations in > multiple switch environments; it appears that there needs to be an interswitch > link in order for the bonding driver to operate correctly in this topology. we wanted to have two fully redundant paths, and it worked, till one of my co-workers restarted both switches at the same time... > > What other differences are there between the "working" cluster and the > "non-working" cluster? That is, are they on different switch hardware? no, on one pair of switches (those restarted) we have two 2node clusters and 3node cluster. One of 2node cluster just stopped working in cluster, but i was able to connect to both nodes. 3nodes cluster stopped working in cluster and because of lost quorum was waiting for my action - but i was able to connect to all nodes. Last 2node cluster - that one with virtual ip was unaccessible from network. All those system are configured almost the same way - only difference is in cluster.conf.
Could you please attach the output files from running the command `sysreport` on tedse-ora1 and tedse-ora2? I would also like the contents of /proc/net/bonding/bond0 on both systems when the network is no longer working after a switch restart. I would like to understand why your network stops working after the swithes have restarted. You should not require ifdown/ifup on the bond members (eth0 and eth1). Does the network (arp/ping) ever work (maybe after 30 sec or a few min?) after rebooting the switches OR is ifdown/ifup eth0 and ifdown/ifup eth1 always required to make the system function correctly?
we have found problem in those servers we have 3NIC 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 07:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703 Gigabit Ethernet (rev 10) while manual loading modules tg3 and bonding NICs are named eth0, eth1, eth2 but while auto loading network modules via modprobe.conf during start they are called eth1, eth2 and eth0 because second and third interfaces are connected to one switch, during its reboot we are loosing connection... at bond0 config we specified names eth0 and eth1, and as You can see those are different interfaces than we thought. Now, I'm searching way to name them properly, as we wanted...
Do you still need some assitance with this? Getting devices named correctly can be tricky at times. Please see: http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/ref-guide/s1-networkscripts-interfaces.html for more info or let me know if you are still having problems. Thanks.
If this is resolved, I would like to close out this issuse. If not, please let me know what I can do to help resolve it. Thanks.
You can close this bug as we have solved our problems in this case thanks for support