Bug 1417143
Summary: | Failover to inactive port of team interface, delays in around 20 seconds. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Fani Orestiadou <forestia> |
Component: | libteam | Assignee: | Xin Long <lxin> |
Status: | CLOSED NOTABUG | QA Contact: | Network QE <network-qe> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 7.2 | CC: | aloughla, lxin |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-03-09 15:27:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1420851 |
Description
Fani Orestiadou
2017-01-27 09:51:56 UTC
The failover time 20s is expected. Note that you did not set linkwatch.send_always = true for arp_ping. It means only the active port sent arp_request, the active port will check if the port is alive by arp_REPLY while the backup port do it by the arp_REQUEST forwarded by the switch dev. so when you disable the active port: 1st 5s timeout: missed++, missed = 1, your "missed_max": 1, no action. 2nd 5s timeout: missed++, missed = 2, both ports went down as missed > missed_max. as no ports were alive, it set force_send = true for both ports. 3rd 5s timeout: both ports would send arp_requests. 4th 5s timeout: the backup port got the arp_reply and changed to active port and set the old active port's force_send = false. only the new active port sent arp_request again as before. After that the failover process finished. If you want it to process faster, pls try to set linkwatch.send_always = true. Thanks. Hello, Thank you for your reply. Actually we have tried to set the send_always option to true, but then we face different issue from the switch side. 1. If hw_policy is set to same all, there is MAC address flapping on the switch MAC Address Table: First time: 9457-a559-20e6 3335 Learned GE1/0/18 Y Second time: 9457-a559-20e6 3335 Learned XGE1/0/25 Y [root@srv-as1-01 ~]# ping -i0.1 -W0.1 10.254.51.11 PING 10.254.51.11 (10.254.51.11) 56(84) bytes of data. 64 bytes from 10.254.51.11: icmp_seq=37 ttl=61 time=0.525 ms 64 bytes from 10.254.51.11: icmp_seq=38 ttl=61 time=0.495 ms 64 bytes from 10.254.51.11: icmp_seq=39 ttl=61 time=0.503 ms <--- 64 bytes from 10.254.51.11: icmp_seq=83 ttl=61 time=0.501 ms <--- 64 bytes from 10.254.51.11: icmp_seq=84 ttl=61 time=0.505 ms 2. If hw_policy is set to only_active, frequent changes of the ARP table of the switch (L3 switch) causes pause of traffic flow (5 seconds each time, "interval": 5000) First time: 10.254.35.11 2880-23b3-9896 3335 GE1/0/18 20 D Second time: 10.254.35.11 9457-a559-20e6 3335 XGE1/0/25 20 D Third time: 10.254.35.11 2880-23b3-9896 3335 GE1/0/18 20 D [root@srv-as1-01 ~]# ping 10.254.51.11 PING 10.254.51.11 (10.254.51.11) 56(84) bytes of data. 64 bytes from 10.254.51.11: icmp_seq=3 ttl=61 time=0.551 ms 64 bytes from 10.254.51.11: icmp_seq=8 ttl=61 time=0.552 ms 64 bytes from 10.254.51.11: icmp_seq=13 ttl=61 time=0.561 ms And in my opinion this is also a BUG related to the send_always option, unless i am missing something here. Thank you (In reply to Fani Orestiadou from comment #3) > Hello, > > Thank you for your reply. > Actually we have tried to set the send_always option to true, but then we > face different issue from the switch side. > > 1. If hw_policy is set to same all, there is MAC address flapping on the > switch > > MAC Address Table: > First time: > 9457-a559-20e6 3335 Learned GE1/0/18 Y > Second time: > 9457-a559-20e6 3335 Learned XGE1/0/25 Y I think it can't be avoided :( > > [root@srv-as1-01 ~]# ping -i0.1 -W0.1 10.254.51.11 > PING 10.254.51.11 (10.254.51.11) 56(84) bytes of data. > 64 bytes from 10.254.51.11: icmp_seq=37 ttl=61 time=0.525 ms > 64 bytes from 10.254.51.11: icmp_seq=38 ttl=61 time=0.495 ms > 64 bytes from 10.254.51.11: icmp_seq=39 ttl=61 time=0.503 ms <--- > 64 bytes from 10.254.51.11: icmp_seq=83 ttl=61 time=0.501 ms <--- > 64 bytes from 10.254.51.11: icmp_seq=84 ttl=61 time=0.505 ms > > 2. If hw_policy is set to only_active, frequent changes of the ARP table of > the switch (L3 switch) causes pause of traffic flow (5 seconds each time, > "interval": 5000) > > First time: > 10.254.35.11 2880-23b3-9896 3335 GE1/0/18 20 D > Second time: > 10.254.35.11 9457-a559-20e6 3335 XGE1/0/25 20 D > Third time: > 10.254.35.11 2880-23b3-9896 3335 GE1/0/18 20 D can you try these 2 ways to work around it: 1. remove "source_host": "10.254.35.11" :) or 2. make "link_watch" per port, like: "eth2": { "link_watch": { then set different "source_host" addrs in each port's link_watch. > > [root@srv-as1-01 ~]# ping 10.254.51.11 > PING 10.254.51.11 (10.254.51.11) 56(84) bytes of data. > 64 bytes from 10.254.51.11: icmp_seq=3 ttl=61 time=0.551 ms > 64 bytes from 10.254.51.11: icmp_seq=8 ttl=61 time=0.552 ms > 64 bytes from 10.254.51.11: icmp_seq=13 ttl=61 time=0.561 ms > > And in my opinion this is also a BUG related to the send_always option, > unless i am missing something here. > > Thank you Hello, OK, tested to remove the source_host and now there is no delay during failover and no issue from switch side, but there is still issue if customer enables the validate_active option. It seems that even if the ARP probes are sent and received, still the team increases the missed counter. So having the below configuration: TEAM_CONFIG='{"runner": {"name": "activebackup", "hwaddr_policy": "only_active"}, "link_watch": { "validate_active": true, "name": "arp_ping", "interval": 5000, "missed_max": 2, "target_host": "10.254.35.1", "send_always": true }}' Testing the failover: Tue Feb 7 14:59:09 UTC 2017 "eno3": { "ifname": "eno3" "interval": 5000, "missed": 0, "missed_max": 2, "ens2f2": { "ifname": "ens2f2" "interval": 5000, "missed": 0, "missed_max": 2, "active_port": "eno3" Tue Feb 7 14:59:15 UTC 2017 "eno3": { "ifname": "eno3" "interval": 5000, "missed": 1, "missed_max": 2, "ens2f2": { "ifname": "ens2f2" "interval": 5000, "missed": 0, "missed_max": 2, "active_port": "eno3" Tue Feb 7 14:59:20 UTC 2017 "eno3": { "ifname": "eno3" "interval": 5000, "missed": 2, "missed_max": 2, "ens2f2": { "ifname": "ens2f2" "interval": 5000, "missed": 0, "missed_max": 2, "active_port": "eno3" Tue Feb 7 14:59:25 UTC 2017 "eno3": { "ifname": "eno3" "interval": 5000, "missed": 3, "missed_max": 2, "ens2f2": { "ifname": "ens2f2" "interval": 5000, "missed": 0, "missed_max": 2, "active_port": "ens2f2" <---- failover occurs Now the missed counter wrongly increases for active port Tue Feb 7 14:59:30 UTC 2017 "eno3": { "ifname": "eno3" "interval": 5000, "missed": 4, "missed_max": 2, "ens2f2": { "ifname": "ens2f2" "interval": 5000, "missed": 1, <---- "missed_max": 2, "active_port": "ens2f2" Tue Feb 7 14:59:35 UTC 2017 "eno3": { "ifname": "eno3" "interval": 5000, "missed": 5, "missed_max": 2, "ens2f2": { "ifname": "ens2f2" "interval": 5000, "missed": 2, <----- "missed_max": 2, "active_port": "ens2f2" Tue Feb 7 14:59:41 UTC 2017 "eno3": { "ifname": "eno3" "interval": 5000, "missed": 6, "missed_max": 2, "ens2f2": { "ifname": "ens2f2" "interval": 5000, "missed": 3, <------ "missed_max": 2, "active_port": "" # tcpdump -nlpev -i ens2f2 arp # 14:59:40.342416 28:80:23:b3:98:96 > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.254.35.1 (Broadcast) tell 0.0.0.0, length 28 # 14:59:40.342906 00:00:5e:00:01:23 > Broadcast, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.254.35.1 is-at 00:00:5e:00:01:23, length 46 In the case that both validate_active and validate_inactive are enabled, the failover does not work at. Any idea why there is such behaviour? Thank you! (In reply to Fani Orestiadou from comment #5) > Hello, > > OK, tested to remove the source_host and now there is no delay during > failover and no issue from switch side, but there is still issue if customer > enables the validate_active option. > It seems that even if the ARP probes are sent and received, still the team > increases the missed counter. > > So having the below configuration: > TEAM_CONFIG='{"runner": {"name": "activebackup", "hwaddr_policy": > "only_active"}, "link_watch": { "validate_active": true, "name": "arp_ping", > "interval": 5000, "missed_max": 2, "target_host": "10.254.35.1", > "send_always": true }}' > > Testing the failover: > > Tue Feb 7 14:59:09 UTC 2017 > "eno3": { > "ifname": "eno3" > "interval": 5000, > "missed": 0, > "missed_max": 2, > "ens2f2": { > "ifname": "ens2f2" > "interval": 5000, > "missed": 0, > "missed_max": 2, > "active_port": "eno3" > Tue Feb 7 14:59:15 UTC 2017 > "eno3": { > "ifname": "eno3" > "interval": 5000, > "missed": 1, > "missed_max": 2, > "ens2f2": { > "ifname": "ens2f2" > "interval": 5000, > "missed": 0, > "missed_max": 2, > "active_port": "eno3" > Tue Feb 7 14:59:20 UTC 2017 > "eno3": { > "ifname": "eno3" > "interval": 5000, > "missed": 2, > "missed_max": 2, > "ens2f2": { > "ifname": "ens2f2" > "interval": 5000, > "missed": 0, > "missed_max": 2, > "active_port": "eno3" > > Tue Feb 7 14:59:25 UTC 2017 > "eno3": { > "ifname": "eno3" > "interval": 5000, > "missed": 3, > "missed_max": 2, > "ens2f2": { > "ifname": "ens2f2" > "interval": 5000, > "missed": 0, > "missed_max": 2, > "active_port": "ens2f2" <---- failover occurs > > Now the missed counter wrongly increases for active port > Tue Feb 7 14:59:30 UTC 2017 > "eno3": { > "ifname": "eno3" > "interval": 5000, > "missed": 4, > "missed_max": 2, > "ens2f2": { > "ifname": "ens2f2" > "interval": 5000, > "missed": 1, <---- > "missed_max": 2, > "active_port": "ens2f2" > Tue Feb 7 14:59:35 UTC 2017 > "eno3": { > "ifname": "eno3" > "interval": 5000, > "missed": 5, > "missed_max": 2, > "ens2f2": { > "ifname": "ens2f2" > "interval": 5000, > "missed": 2, <----- > "missed_max": 2, > "active_port": "ens2f2" > Tue Feb 7 14:59:41 UTC 2017 > "eno3": { > "ifname": "eno3" > "interval": 5000, > "missed": 6, > "missed_max": 2, > "ens2f2": { > "ifname": "ens2f2" > "interval": 5000, > "missed": 3, <------ > "missed_max": 2, > "active_port": "" > > > # tcpdump -nlpev -i ens2f2 arp > # 14:59:40.342416 28:80:23:b3:98:96 > Broadcast, ethertype ARP (0x0806), > length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.254.35.1 > (Broadcast) tell 0.0.0.0, length 28 > # 14:59:40.342906 00:00:5e:00:01:23 > Broadcast, ethertype ARP (0x0806), > length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.254.35.1 is-at > 00:00:5e:00:01:23, length 46 > > > In the case that both validate_active and validate_inactive are enabled, the > failover does not work at. > > Any idea why there is such behaviour? I couldn't reproduce this issue in my env, pls ensure target_host 10.254.35.1 is still reachable when this issue happens. can you also give the team1 and all ports' mac address when this issue happens ? have you tried with both validate_active and validate_inactive are false ? btw > > Thank you! Fani and I have solved this issue in customer's env, so just close it. Thanks Fani, nice collaboration! |