Bug 222990 - Problems with bond at cluster enviroment after major network failure
Summary: Problems with bond at cluster enviroment after major network failure
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: i686
OS: Linux
medium
urgent
Target Milestone: ---
: ---
Assignee: Andy Gospodarek
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-17 12:54 UTC by Tomasz Jaszowski
Modified: 2014-06-29 22:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-07-18 14:18:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Tomasz Jaszowski 2007-01-17 12:54:20 UTC
Description of problem:
We have 2node cluster created at HP servers with iLO. They are connected by
bond0 interface connected to two LAN switches. iLO of first node is connected to
first switch, second to another.
During some tests we restarted both switches the same time, so network was off
for about a minute but due to our config nodes wanted to fence other.

After bond0 is up i can't connect anywhere from any of nodes. 

Version-Release number of selected component (if applicable):


How reproducible:
everytime (checked twice)

Steps to Reproduce:
1: create 2node cluster with bond interface connected to 2switches and service
using virtual ip and ext3 partitions
2: start service on node1 
3: restart both switches 
  
Actual results:
unable to connect to/from nodes

tedse-ora1:root:/var/log> ifconfig bond0
bond0     Link encap:Ethernet  HWaddr 00:17:A4:3E:08:E4
          inet addr:10.4.1.1  Bcast:10.4.1.63  Mask:255.255.255.192
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:472458 errors:0 dropped:0 overruns:0 frame:0
          TX packets:440936 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:100841760 (96.1 MiB)  TX bytes:211038613 (201.2 MiB)

Wed Jan 17 13:39:27 CET 2007
tedse-ora1:root:/var/log> arp -a
tedse-ora2.ax4.com (10.4.1.2) at 00:17:A4:3E:10:14 [ether] on bond0
? (10.4.1.62) at <incomplete> on bond0
Wed Jan 17 13:39:35 CET 2007


tedse-ora2:root:~> arp -a
? (10.4.1.62) at <incomplete> on bond0
tedse-ora1.ax4.com (10.4.1.1) at 00:17:A4:3E:08:E4 [ether] on bond0
Wed Jan 17 11:38:00 CET 2007
tedse-ora2:root:~> ifconfig bond0
bond0     Link encap:Ethernet  HWaddr 00:17:A4:3E:10:14
          inet addr:10.4.1.2  Bcast:10.4.1.63  Mask:255.255.255.192
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:268745 errors:0 dropped:0 overruns:0 frame:0
          TX packets:271030 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:23073816 (22.0 MiB)  TX bytes:101298717 (96.6 MiB)

Wed Jan 17 11:42:02 CET 2007

on normal working node arp -a looks (another pair of nodes, configured identicly):

tefse-ora2:root:~> arp -a
? (10.3.1.60) at 00:00:5E:00:01:06 [ether] on bond0
? (10.3.1.62) at 00:00:5E:00:01:06 [ether] on bond0
Expected results:
why those node can't restore their connections? At the same time we have tested
other 2node clusters (without virtual ip, and started services) and they
restored their connection


Additional info:
TEDSE-ORA2: /var/log/messages

Jan 17 11:07:21 tedse-ora2 kernel: tg3: eth1: Link is down.
Jan 17 11:07:21 tedse-ora2 kernel: tg3: eth0: Link is down.
Jan 17 11:07:21 tedse-ora2 kernel: bonding: bond0: link status definitely down f
or interface eth0, disabling it
Jan 17 11:07:21 tedse-ora2 kernel: bonding: bond0: link status definitely down f
or interface eth1, disabling it
Jan 17 11:07:21 tedse-ora2 kernel: bonding: bond0: now running without any activ
e interface !
Jan 17 11:07:40 tedse-ora2 kernel: CMAN: removing node tedse-ora1 from the clust
er : Missed too many heartbeats
Jan 17 11:07:40 tedse-ora2 fenced[3134]: tedse-ora1 not a cluster member after 0
 sec post_fail_delay
Jan 17 11:07:40 tedse-ora2 fenced[3134]: fencing node "tedse-ora1"
Jan 17 11:07:44 tedse-ora2 fenced[3134]: agent "fence_ilo" reports: Connect fail
ed: connect: No route to host; No route to host at /usr/lib/perl5/vendor_perl/5.
8.5/i386-linux-thread-multi/Net/SSL.pm line 104, <> line 4.
Jan 17 11:07:44 tedse-ora2 fence_manual: Node tedse-ora1 needs to be reset befor
e recovery can procede.  Waiting for tedse-ora1 to rejoin the cluster or for man
ual acknowledgement that it has been reset (i.e. fence_ack_manual -n tedse-ora1)
Jan 17 11:09:11 tedse-ora2 kernel: tg3: eth1: Link is up at 1000 Mbps, full dupl
ex.
Jan 17 11:09:11 tedse-ora2 kernel: tg3: eth1: Flow control is off for TX and off
 for RX.
Jan 17 11:09:11 tedse-ora2 kernel: tg3: eth0: Link is up at 1000 Mbps, full dupl
ex.
Jan 17 11:09:11 tedse-ora2 kernel: tg3: eth0: Flow control is off for TX and off
 for RX.
Jan 17 11:09:11 tedse-ora2 kernel: bonding: bond0: link status definitely up for
 interface eth0.
Jan 17 11:09:11 tedse-ora2 kernel: bonding: bond0: link status definitely up for
 interface eth1.
Jan 17 11:09:11 tedse-ora2 kernel: bonding: bond0: making interface eth0 the new
 active one.


ifdown eth0 i ifup eth0

Jan 17 11:43:17 tedse-ora2 kernel: bonding: bond0: link status definitely down f
or interface eth0, disabling it
Jan 17 11:43:17 tedse-ora2 kernel: bonding: bond0: making interface eth1 the new
 active one.


TEDSE-ORA1: /var/log/messages

Jan 17 12:07:04 tedse-ora1 clurgmgrd: [5687]: <info> Executing /opt/oracle.init
status
Jan 17 12:07:21 tedse-ora1 kernel: tg3: eth1: Link is down.
Jan 17 12:07:21 tedse-ora1 kernel: tg3: eth0: Link is down.
Jan 17 12:07:21 tedse-ora1 kernel: bonding: bond0: link status definitely down f
or interface eth0, disabling it
Jan 17 12:07:21 tedse-ora1 kernel: bonding: bond0: link status definitely down f
or interface eth1, disabling it
Jan 17 12:07:21 tedse-ora1 kernel: bonding: bond0: now running without any activ
e interface !
Jan 17 12:07:24 tedse-ora1 clurgmgrd: [5687]: <warning> Link for bond0: Not dete
cted
Jan 17 12:07:24 tedse-ora1 clurgmgrd: [5687]: <warning> No link on bond0...
Jan 17 12:07:24 tedse-ora1 clurgmgrd[5687]: <notice> status on ip "10.4.1.10" re
turned 1 (generic error)
Jan 17 12:07:24 tedse-ora1 clurgmgrd[5687]: <notice> Stopping service tedsv-ora

an 17 12:07:36 tedse-ora1 clurgmgrd: [5687]: <info> Executing /opt/oracle.init
stop
Jan 17 12:07:36 tedse-ora1 su(pam_unix)[4255]: session opened for user oracle by
 (uid=0)
Jan 17 12:07:36 tedse-ora1 su(pam_unix)[4255]: session closed for user oracle
Jan 17 12:07:36 tedse-ora1 su(pam_unix)[4283]: session opened for user oracle by
 (uid=0)
Jan 17 12:07:46 tedse-ora1 kernel: CMAN: removing node tedse-ora2 from the clust
er : Missed too many heartbeats
Jan 17 12:07:46 tedse-ora1 fenced[3118]: tedse-ora2 not a cluster member after 0
 sec post_fail_delay
Jan 17 12:07:46 tedse-ora1 fenced[3118]: fencing node "tedse-ora2"
Jan 17 12:07:49 tedse-ora1 fenced[3118]: agent "fence_ilo" reports: Connect fail
ed: connect: No route to host; No route to host at /usr/lib/perl5/vendor_perl/5.
8.5/i386-linux-thread-multi/Net/SSL.pm line 104, <> line 4.
Jan 17 12:07:49 tedse-ora1 fence_manual: Node tedse-ora2 needs to be reset befor
e recovery can procede.  Waiting for tedse-ora2 to rejoin the cluster or for man
ual acknowledgement that it has been reset (i.e. fence_ack_manual -n tedse-ora2)

Jan 17 12:07:52 tedse-ora1 su(pam_unix)[4283]: session closed for user oracle
Jan 17 12:07:52 tedse-ora1 clurgmgrd: [5687]: <info> Removing IPv4 address 10.4.
1.10 from bond0
Jan 17 12:08:03 tedse-ora1 clurgmgrd: [5687]: <info> unmounting /opt/oracle/u01
Jan 17 12:08:03 tedse-ora1 clurgmgrd: [5687]: <info> unmounting /opt/oracle/u02
Jan 17 12:08:04 tedse-ora1 clurgmgrd: [5687]: <info> unmounting /opt/oracle/u04
Jan 17 12:08:04 tedse-ora1 clurgmgrd: [5687]: <info> unmounting /opt/oracle/u05
Jan 17 12:09:11 tedse-ora1 kernel: tg3: eth0: Link is up at 1000 Mbps, full dupl
ex.
Jan 17 12:09:11 tedse-ora1 kernel: tg3: eth0: Flow control is off for TX and off
 for RX.
Jan 17 12:09:11 tedse-ora1 kernel: bonding: bond0: link status definitely up for
 interface eth0.
Jan 17 12:09:11 tedse-ora1 kernel: bonding: bond0: making interface eth0 the new
 active one.
Jan 17 12:09:11 tedse-ora1 kernel: tg3: eth1: Link is up at 1000 Mbps, full dupl
ex.
Jan 17 12:09:11 tedse-ora1 kernel: tg3: eth1: Flow control is off for TX and off
 for RX.
Jan 17 12:09:11 tedse-ora1 kernel: bonding: bond0: link status definitely up for
 interface eth1.

Comment 1 Tomasz Jaszowski 2007-01-17 12:59:33 UTC
now we have fence_ilo and fence_manual configured so node tries to fence via ilo
and when it's impossible (no network avaliable) it uses fence_manual. It doesnt
matter, becouse only way to connect to nodes is via ilo interface...

Comment 2 Tomasz Jaszowski 2007-01-17 14:30:12 UTC
update, 

after ifdown eth0; ifup eth0 (one of bonded interfaces) on nodes are visible in
network and can work normaly...

Comment 3 Tomasz Jaszowski 2007-01-17 14:33:11 UTC
and one more thing, those nodes, by mistake are in 2different time zones

Comment 4 Paul Kennedy 2007-01-17 16:40:11 UTC
Please clarify what you expected the behavior to be.

Comment 5 Tomasz Jaszowski 2007-01-18 10:18:11 UTC
(In reply to comment #4)
> Please clarify what you expected the behavior to be.

I'm expecting that after switches back online nodes can access network and can
be accessed from network via bond interface... if so, cluster software can
perform any action ie. fence other node...

Additional Comment:

We have made some more tests... at Comment #2 i've wrote that after ifdown/ifup
eth0, which is part of bond0, node can reconnect to network and can work, and it
was possible to do succesful fence of the other node (i've done it myself using
fence_ilo). It's partial truth - becouse after ifup eth0 port on switch is still
down and all traffic goes via eth1. When if've ifdown eth1 node lost network,
and was fenced by other node...

I think that it can be problem with bond and virtual ip added by cluster
service, becouse we didn't notticed such problem on other 2node clusters which
havent got virtual ip.

Comment 6 Paul Kennedy 2007-01-18 14:13:05 UTC
This does not appear to be a documentation bug. Recommend changing module from
the rh-cs-en component (cluster admin doc component), to a more appropriate
component. Need information from Engineering to determine the most appropriate
component and action.

Comment 8 Tomasz Jaszowski 2007-01-22 12:16:27 UTC
Hi,

 Any ideas how to avoid this problems? (we would like to join this system into
production, so answer to this bug becoming critical...)

Thanks

Comment 9 Tomasz Jaszowski 2007-01-24 13:00:23 UTC
Hello?

 Any ideas how to avoid this problems? (we are going to join this system into
production, so answer to this bug is critical...)

Thanks

Comment 10 Lon Hohberger 2007-01-24 20:14:58 UTC
I don't fully understand the kernel's requirements on bonding configurations in
multiple switch environments; it appears that there needs to be an interswitch
link in order for the bonding driver to operate correctly in this topology.

What other differences are there between the "working" cluster and the
"non-working" cluster?  That is, are they on different switch hardware?

Comment 11 Tomasz Jaszowski 2007-01-25 21:16:06 UTC
(In reply to comment #10)
> I don't fully understand the kernel's requirements on bonding configurations in
> multiple switch environments; it appears that there needs to be an interswitch
> link in order for the bonding driver to operate correctly in this topology.

we wanted to have two fully redundant paths, and it worked, till one of my
co-workers restarted both switches at the same time...

> 
> What other differences are there between the "working" cluster and the
> "non-working" cluster?  That is, are they on different switch hardware?

no, on one pair of switches (those restarted) we have two 2node clusters and
3node cluster. One of 2node cluster just stopped working in cluster, but i was
able to connect to both nodes. 3nodes cluster stopped working in cluster and
because of lost quorum was waiting for my action - but i was able to connect to
all nodes. Last 2node cluster - that one with virtual ip was unaccessible from
network. All those system are configured almost the same way - only difference
is in cluster.conf.

Comment 12 Andy Gospodarek 2007-02-08 22:07:41 UTC
Could you please attach the output files from running the command `sysreport` on
tedse-ora1 and tedse-ora2?  I would also like the contents of
/proc/net/bonding/bond0 on both systems when the network is no longer working
after a switch restart.

I would like to understand why your network stops working after the swithes have
restarted.  You should not require ifdown/ifup on the bond members (eth0 and
eth1).  

Does the network (arp/ping) ever work (maybe after 30 sec or a few min?) after
rebooting the switches OR is ifdown/ifup eth0 and ifdown/ifup eth1 always
required to make the system function correctly?

Comment 13 Tomasz Jaszowski 2007-02-19 11:34:34 UTC
we have found problem

in those servers we have 3NIC 

02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
07:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703 Gigabit
Ethernet (rev 10)

while manual loading modules tg3 and bonding NICs are named eth0, eth1, eth2
but while auto loading network modules via modprobe.conf during start they are
called eth1, eth2 and eth0 

because second and third interfaces are connected to one switch, during its
reboot we are loosing connection... at bond0 config we specified names eth0 and
eth1, and as You can see those are different interfaces than we thought.

Now, I'm searching way to name them properly, as we wanted...

Comment 14 Andy Gospodarek 2007-03-23 14:18:27 UTC
Do you still need some assitance with this?  

Getting devices named correctly can be tricky at times.  Please see:

http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/ref-guide/s1-networkscripts-interfaces.html

for more info or let me know if you are still having problems.  Thanks.

Comment 15 Andy Gospodarek 2007-05-03 15:31:23 UTC
If this is resolved, I would like to close out this issuse.  If not, please let
me know what I can do to help resolve it.  Thanks.

Comment 16 Tomasz Jaszowski 2007-07-18 06:46:05 UTC
You can close this bug as we have solved our problems in this case

thanks for support


Note You need to log in before you can comment on or make changes to this bug.