Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
This update introduces a deterministic relationship between IP and MAC addresses dynamically allocated by OVN. As a result, the POD is always reachable even if it gets a new IP address from OVN.
In OpenShift we've seen a problem where when pods are being created and destroyed at a high rate, you eventually end up with a scenario where:
- pod A is talking to pod B, which has, say, IP 10.0.1.5 and
MAC bb:bb:bb:bb:bb:bb
- pod A ends up with an entry 10.0.1.2 -> bb:bb:bb:bb:bb:bb in
its ARP cache
- pod B exits / is destroyed
- Around 255 other pods on pod B's node are created/destroyed in a
short amount of time, and the IP address assignment range wraps
around back to the beginning again.
- pod C is created and gets assigned IP 10.0.1.5 and
MAC cc:cc:cc:cc:cc:cc
- pod A tries to talk to pod C, finds that it already has an ARP
cache entry for 10.0.1.5, and so tries to send packets to
IP 10.0.1.5, MAC bb:bb:bb:bb:bb:bb
- These packets go nowhere because nobody currently has that MAC
- pod A's attempt to talk to pod C eventually times out. Things
start failing
(This is not a problem in a VM-based world because of a combination of (a) VMs come and go less quickly than containers, so other VMs are less likely to still have stale ARP cache mappings when an IP gets reused again; and (b) VMs, like bare metal hosts, tend to have startup scripts that send out gratuitous ARPs when they bring up their network connection, so anyone who did have a stale ARP cache entry would get fixed.)
In OpenShift SDN, our fix for this was to just assign pods deterministic MAC addresses that were based on their IPs; specifically they get 0a:58:ww:xx:yy:zz, where ww:xx:yy:zz is the IP converted to hex. (The code for this comes from CNI and is used by some other plugins as well. I don't know who chose the prefix "0a:58" or why.)
With ovn-kubernetes, we will need to either
1. also have deterministic IP-to-MAC mappings, OR
2. send out ARP announcements whenever a pod is created
The latter would be possible, but is less inefficient if lots of pods are being created, especially if they are attached to logical switches that are spread across multiple hosts.
We don't handle IPv6 yet, and I'm not sure what the situation is there; in theory the kernel automatically handles the "announcement" part, so there might not be a problem. Unless the announcements get sent out before OVN is ready to forward them to other ports, which might be the case. Also, even if the announcements do get sent out, and do work, it would still be more efficient to *not* forward them, if they were known to be unnecessary.
Comment 2Daniel Alvarez Sanchez
2018-10-29 12:28:43 UTC
No, it's the pods that are caching old MAC addresses in this case, not OVN. Though the bug you pointed out might cause additional OVN-level problems on top of the pod-level problems too I guess. (We haven't actually gotten to the point in OVN testing where we've encountered this issue yet.)
Comment 4Daniel Alvarez Sanchez
2018-10-29 13:38:41 UTC
Yeah got it now, thanks Dan! It perhaps can cause additional OVN problems as you point out. Something to have in mind now is that if we generate the mac address deterministically based on the IP address, then stale MAC_Binding entries have to be removed when updating/upgrading OVS :)
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2019:0014
In OpenShift we've seen a problem where when pods are being created and destroyed at a high rate, you eventually end up with a scenario where: - pod A is talking to pod B, which has, say, IP 10.0.1.5 and MAC bb:bb:bb:bb:bb:bb - pod A ends up with an entry 10.0.1.2 -> bb:bb:bb:bb:bb:bb in its ARP cache - pod B exits / is destroyed - Around 255 other pods on pod B's node are created/destroyed in a short amount of time, and the IP address assignment range wraps around back to the beginning again. - pod C is created and gets assigned IP 10.0.1.5 and MAC cc:cc:cc:cc:cc:cc - pod A tries to talk to pod C, finds that it already has an ARP cache entry for 10.0.1.5, and so tries to send packets to IP 10.0.1.5, MAC bb:bb:bb:bb:bb:bb - These packets go nowhere because nobody currently has that MAC - pod A's attempt to talk to pod C eventually times out. Things start failing (This is not a problem in a VM-based world because of a combination of (a) VMs come and go less quickly than containers, so other VMs are less likely to still have stale ARP cache mappings when an IP gets reused again; and (b) VMs, like bare metal hosts, tend to have startup scripts that send out gratuitous ARPs when they bring up their network connection, so anyone who did have a stale ARP cache entry would get fixed.) In OpenShift SDN, our fix for this was to just assign pods deterministic MAC addresses that were based on their IPs; specifically they get 0a:58:ww:xx:yy:zz, where ww:xx:yy:zz is the IP converted to hex. (The code for this comes from CNI and is used by some other plugins as well. I don't know who chose the prefix "0a:58" or why.) With ovn-kubernetes, we will need to either 1. also have deterministic IP-to-MAC mappings, OR 2. send out ARP announcements whenever a pod is created The latter would be possible, but is less inefficient if lots of pods are being created, especially if they are attached to logical switches that are spread across multiple hosts. We don't handle IPv6 yet, and I'm not sure what the situation is there; in theory the kernel automatically handles the "announcement" part, so there might not be a problem. Unless the announcements get sent out before OVN is ready to forward them to other ports, which might be the case. Also, even if the announcements do get sent out, and do work, it would still be more efficient to *not* forward them, if they were known to be unnecessary.