Bug 1830592
| Summary: | [3.11] OVS flows dont seem to be getting updated with the info on ETCD | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Paul Gozart <pgozart> |
| Component: | Networking | Assignee: | Juan Luis de Sousa-Valadas <jdesousa> |
| Networking sub component: | ovn-kubernetes | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | aconstan, mfojtik |
| Version: | 3.11.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-27 22:57:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Hi Paul, Can you please get a core dump of the SDN pod of the node where the flows aren't getting updated while this is happening? In order to get it you have to ssh into a node and do: # docker cp $(docker ps --filter label=io.kubernetes.container.name=sdn -q):/usr/bin/openshift /usr/bin/openshift # gcore -o sdn.core $(pgrep -f 'openshift start network') This may be a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1824203 Which should is likely to be fixed on the next release. (In reply to Juan Luis de Sousa-Valadas from comment #4) > Hi Paul, > Can you please get a core dump of the SDN pod of the node where the flows > aren't getting updated while this is happening? > In order to get it you have to ssh into a node and do: > > # docker cp $(docker ps --filter label=io.kubernetes.container.name=sdn > -q):/usr/bin/openshift /usr/bin/openshift > # gcore -o sdn.core $(pgrep -f 'openshift start network') > > This may be a duplicate of: > https://bugzilla.redhat.com/show_bug.cgi?id=1824203 > Which should is likely to be fixed on the next release. Hi Juan, I talked to the customer today and he said the issue seems to be resolved after upgrading to the latest 3.11.z. This bug can be closed. Thanks, Paul Hi Paul, Thanks for the update. As they use egressIPs someone should track BZ#1824243 and make sure they update as soon as there is errata for it. It's really hard to reproduce and only happens on very specific conditions, but it's important that they get this update because if it happens it's fairly severe. And they'll see the symptoms you just reported again. My expectation is that it will be released either in the next z stream or the following one. |
OVS flows dont seem to be getting updated with the info reflecting on ETCD. Below are outputs that provide necessary information [xgk9kosa@p01apl881 ~]$ oc get hostsubnets | grep 172.26.230.123 p01osl0300 p01osl0300 172.26.230.24 10.55.4.0/23 [172.26.230.0/23] [172.26.230.158, 172.26.230.165, 172.26.230.101, 172.26.230.123] [root@p01osl0303 ~]# ip a s eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:92:a6:cf brd ff:ff:ff:ff:ff:ff inet 172.26.230.27/23 brd 172.26.231.255 scope global noprefixroute eth0 valid_lft forever preferred_lft forever inet 172.26.230.148/23 brd 172.26.231.255 scope global secondary eth0 valid_lft forever preferred_lft forever inet 172.26.231.233/23 brd 172.26.231.255 scope global secondary eth0 valid_lft forever preferred_lft forever inet 172.26.230.180/23 brd 172.26.231.255 scope global secondary eth0 valid_lft forever preferred_lft forever inet 172.26.230.123/23 brd 172.26.231.255 scope global secondary eth0 valid_lft forever preferred_lft forever inet6 fe80::250:56ff:fe92:a6cf/64 scope link valid_lft forever preferred_lft forever [root@p01osl0303 ~]# [root@p01osl0300 ~]# ip a s eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:92:b9:e8 brd ff:ff:ff:ff:ff:ff inet 172.26.230.24/23 brd 172.26.231.255 scope global noprefixroute eth0 valid_lft forever preferred_lft forever inet 172.26.230.97/23 brd 172.26.231.255 scope global secondary eth0 valid_lft forever preferred_lft forever inet 172.26.230.158/23 brd 172.26.231.255 scope global secondary eth0 valid_lft forever preferred_lft forever inet 172.26.231.230/23 brd 172.26.231.255 scope global secondary eth0 valid_lft forever preferred_lft forever inet6 fe80::250:56ff:fe92:b9e8/64 scope link valid_lft forever preferred_lft forever [root@p01osl0300 ~]# [xgk9kosa@p01apl881 frestdta]$ oc exec -n openshift-sdn ovs-f542b -- ovs-ofctl -O OpenFlow13 dump-flows br0 table=100 | grep -i 0xE71A34 cookie=0x0, duration=31130.268s, table=100, n_packets=163329, n_bytes=25918040, priority=100,ip,reg0=0xe71a34 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.26.230.27->tun_dst,output:1 [xgk9kosa@p01apl881 frestdta]$ oc exec -n openshift-sdn ovs-c9crb -- ovs-ofctl -O OpenFlow13 dump-flows br0 table=100 | grep -i 0xE71A34 cookie=0x0, duration=27515.680s, table=100, n_packets=20702, n_bytes=1532308, priority=100,ip,reg0=0xe71a34 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.26.230.24->tun_dst,output:1 From the provided packet traces, we were quite able to conclude that the node from which we were able to connect egress IP "172.26.230.123" is sending the traffic to node "172.26.230.27" and the node (pod) which was failing to connect was sending the traffic to node "172.26.230.24". // TO node - 172.26.230.27 : Good packet trace is showing, when we are trying to reach egressIP (172.26.230.123) then the source node "172.26.230.37" is sending the traffic to node "172.26.230.27" and surprisingly we are getting the response too. $ tshark -r good-pod-host.cap -Y "icmp and ip.addr==172.26.230.123" -T fields -e frame.time -e ip.src -e ip.dst -e _ws.col.Info May 2, 2020 12:38:36.946902000 IST 172.26.230.37,10.52.9.151 172.26.230.27,172.26.230.123 Echo (ping) request id=0x8610, seq=0/0, ttl=64 May 2, 2020 12:38:36.947693000 IST 172.26.230.27,172.26.230.123 172.26.230.37,10.52.9.151 Echo (ping) reply id=0x8610, seq=0/0, ttl=64 (request in 1767) // TO node - 172.26.230.24 : Bad packet trace is showing, when we are trying to reach egressIP (172.26.230.123) then the source node "172.26.230.34" is sending the traffic to right egress node "172.26.230.24" and there is no response. Resulting into issue. $ tshark -r bad-pod-host.cap -Y "icmp and ip.addr==172.26.230.123" -T fields -e frame.time -e ip.src -e ip.dst -e _ws.col.Info May 2, 2020 12:30:15.932049000 IST 172.26.230.34,10.52.2.121 172.26.230.24,172.26.230.123 Echo (ping) request id=0x3601, seq=0/0, ttl=64 May 2, 2020 12:30:16.931987000 IST 172.26.230.34,10.52.2.121 172.26.230.24,172.26.230.123 Echo (ping) request id=0x3601, seq=1/256, ttl=64 From this we concluded 2 things : [1]. The egress IP address "172.26.230.123" is physically present on node "172.26.230.27" as this node is responding to the requests. [2]. The ovs flows on few of the nodes has "172.26.230.24" as egress node marked and some of them as "172.26.230.27". This is likely a OVS flow corruption. Then we have physically logged in node "172.26.230.24" & "172.26.230.27" and verified the existence of egress IP "172.26.230.123". This concluded our point [1] and we observed it was attached to node "172.26.230.27". $ cat sosreport-p01osl0300-02644530-2020-05-01-zzgxyim/sos_commands/networking/ip_-o_addr 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever 2: eth0 inet 172.26.230.24/23 brd 172.26.231.255 scope global noprefixroute eth0\ valid_lft forever preferred_lft forever 2: eth0 inet 172.26.230.97/23 brd 172.26.231.255 scope global secondary eth0\ valid_lft forever preferred_lft forever 2: eth0 inet 172.26.230.158/23 brd 172.26.231.255 scope global secondary eth0\ valid_lft forever preferred_lft forever 2: eth0 inet 172.26.231.230/23 brd 172.26.231.255 scope global secondary eth0\ valid_lft forever preferred_lft forever 2: eth0 inet6 fe80::250:56ff:fe92:b9e8/64 scope link \ valid_lft forever preferred_lft forever 3: docker0 inet 172.17.0.1/16 scope global docker0\ valid_lft forever preferred_lft forever However the etcd database was pointing this egress IP to node "172.26.230.24". NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS p01osl0300 p01osl0300 172.26.230.24 10.55.4.0/23 [172.26.230.0/23] [172.26.230.158, 172.26.230.165, 172.26.230.101, 172.26.230.123] So all the node where the OVS & SDN was restarted, we observed that the new rules were generating as per the above database information and it was marking node "172.26.230.24" as egress node. This resulted in restarting all the OVS & SDN pods, which has populated the proper rules. We then observed that the egress IP "172.26.230.123" was still present on node "172.26.230.27" and so we manually removed it from there. # ip addr del 172.26.230.24/23 dev eth0 Much more data is attached to case 02644530 if needed.