Description of problem:
On customer environment, we find out that during (usually massive) egress IP migrations (moved from one hostsubnet to another, either with or without egress CIDRs being involved), some of the changes in the host subnets are not reflected on the node (IPs are not added/removed from the interfaces, OVS and iptables are not updated, etc.).
After examining some coredumps, it was possible to find a deadlock in both of them. I'll post complete coredump analysis in subsequent internal comments, but quick summary is:
- A goroutine (either the one processing events from netnamespaces informer or hostsubnets informer) holds a lock on the egress IP tracker but is waiting on a lock on the egress vxlan monitor.
- VXLAN monitor poll goroutine is holding the lock on the egress vxlan monitor but is waiting on a write on the "updates" channel (which is an unbuffered channel, so writes block until the receiver reads).
- Goroutine in charge of reading from "updates" channel is blocked waiting to acquire the lock on the egress IP tracker, so it cannot read from "updates" channel and we are in a deadlock.
- In such deadlock, when a change event on a hostsubnet is processed, the goroutine doing it can either become blocked waiting to acquire the lock on the egress IP tracker or be the goroutine holding the lock on the egress IP tracking but waiting on the egress vxlan monitor lock (the one from first point of this list). This, in turn, makes no other hostsubnet change to be processed, so SDN would not update the node with egress IP changes.
- A possible side effect is that other nodes may be considered mistakenly offline because the "Ping" goroutines launched by vxlan monitor poll can also be blocked waiting on the lock on the egress IP tracker, but this effect has not been confirmed by the customer.
Please bear with me while I upload full coredump analysis, because I bet everything can be better understood on them.
Version-Release number of selected component (if applicable):
3.11.188 (but differences with 3.11.200 would not make a difference).
Not consistently. Moving many egress IPs from one hostsubnet to another may make this more likely, but no clear pattern. Still working to get a more consistent reproducer.
Steps to Reproduce:
Egress IP changes not applied on node.
Egress IP changes applied on node.
(edits problem description only consisted in minor typos fixing)
Assigning to 4.5 and adding a 3.11 clone to track the backport (when ready).
Pablo, I looked at the code and I think it's safe to make the channel buffered, however I'm concerned about two things.
When they make these massive egress IP migrations, how much namespaces and how many nodes are we speaking about? oc get netnamespace,hostsubnet. I'm asking so that I can size the buffer accordingly.
Most usual massive egress IP migration scenario would be when one of the nodes is updated or lost and its IPs are moved to another node (either because it is deemed down or egress CIDR has been removed to force the IPs to move to other nodes).
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.