Bug 1747532
Summary: | incorrect service iptables rules on one node | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dan Winship <danw> | |
Component: | Networking | Assignee: | Dan Winship <danw> | |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aos-bugs, bbennett, cdc, dcbw, deads, eparis, knewcome, rhowe | |
Version: | 4.2.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Upstream had an error in the DelaFIFO code.
Consequence: Updates were dropped causing us to miss events, so the corresponding rules weren't present.
Fix: Rebase to a later upstream version.
Result: Events aren't dropped.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1812063 (view as bug list) | Environment: | ||
Last Closed: | 2019-11-08 11:27:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1812063 |
Description
Dan Winship
2019-08-30 17:49:45 UTC
Are there metrics we can use to detect this? If iptables itself is failing to run, we can catch this metric. However, if the change is ignored, then we're out of luck. It's not that iptables is failing to run; if iptables was temporarily failing to run then everything would be fixed once it did start running again (since we resync the entire table when we resync). And if it was permanently failing to run then we'd see a set of rules that had been correct at some point in the past but was no longer correct, whereas what we actually see if a set of rules that are mostly correct for the present, but randomly missing a few things. It appears that the proxy's internal state does not reflect all of the notifications it should have received. Either because it didn't receive all of the notifications that it should have, or because it updated its internal state incorrectly at some point. I did a lot of debugging for a similar issue - it came up when we were working on disaster recovery, and the SDN watch failed to react to an endpoint update for default/kubernetes. The basic gist is this: The sdn watches Endpoints. As with all watches, it gets a Update(old, new) change. As updates come in, it maintains a list of services / endpoints that need to be changed, which will happen the next time the proxy sync loop is called. The goal is to minimize real changes. As a shortcut, if it sees Update(A, A), then it short-circuits and throws the update away. Now, to be fair, this is "guaranteed" not to happen by the informer code. However, it does happen in times of etcd disruption. I did a bunch of debugging, and never could pin down the issue. I put it down to guarantees being forfeit when etcd is recovered. However, this seems to show that perhaps this bug *does* exist in normal cases. We should consider getting a change upstream in the EndpointChangeTracker that compares New against the current known state, rather than the Old as generated by the informer. Based on the logs, that's not what happened in this case though. We got an Add for the Endpoints, and then did not log any further events for it, but then somehow ended up with no Endpoints. I don't see the Add for the Service. Did we miss that, or did you just elide it from the logs? I just didn't mention it: I0830 01:44:22.612458 2535 service.go:332] Adding new service port "openshift-ingress/router-default:http" at 172.30.2.238:80/TCP I0830 01:44:22.612483 2535 service.go:332] Adding new service port "openshift-ingress/router-default:https" at 172.30.2.238:443/TCP I0830 01:44:22.618573 2535 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.128.2.4:80 10.131.0.6:80] I0830 01:44:22.618634 2535 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:https to [10.128.2.4:443 10.131.0.6:443] I0830 01:44:22.774292 2535 proxier.go:1464] Opened local port "nodePort for openshift-ingress/router-default:http" (:31392/tcp) I0830 01:44:22.774453 2535 proxier.go:1464] Opened local port "nodePort for openshift-ingress/router-default:https" (:30152/tcp) I0830 01:44:22.806218 2535 healthcheck.go:151] Opening healthcheck "openshift-ingress/router-default" on port 31765 oh... huh, there's an "Adding new service port" from vendor/k8s.io/kubernetes/pkg/proxy/service.go, but no "sdn proxy: add svc ..." from pkg/network/proxy/proxy.go... So yeah, we didn't get an Add for the Service in this case. (Though we got an Update for it I guess?) > (Though we got an Update for it I guess?)
Yeah, that's related to what's been on my mind: when we see Update(A1, A2), we should compare A2 against A_cached, not A1.
Of course, the informers should "never" have this happen. But clearly it does.
If you suspect that your informer isn't working, you could add debugging like https://github.com/openshift/origin/pull/21851 to the informers in question. That output would be enough to represent a minimum reproducer of bad notification. We are fairly confident that informers aren't dropping notifications since every controller and operator is built on them, but that output could provide proof. Ah, that's a good suggestion. Eric, any hints as how to reproduce this? should be fixed by the DeltaFIFO fix from upstream |