Investigation of the semi-rare errors from the cmd tests (about 1-2% of all failures) on GCP with symptom: Unable to connect to the server: dial tcp 10.0.0.2:6443: i/o timeout https://search.ci.openshift.org/?search=dial+tcp+10.0.0.2%3A6443%3A+i%2Fo+timeout&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/553/pull-ci-openshift-kubernetes-master-e2e-gcp/1357762933496483840 10.0.0.2 is the internal API load balancer from GCP fronting the kube-apiservers, so this issue is either in the LB or in the kube-apiserver (there is nothing in between). To eliminate the LB, we turned on health check logging for the api-internal TCP load balancer which logs to the GCP logs endpoint led to the discovery that the kube-apiservers on GCP are failing their /readyz check very frequently - one failure every 10-15s The failure looks like: { "insertId": "w74ydeb7g", "jsonPayload": { "healthCheckProbeResult": { "previousHealthState": "HEALTHY", "previousDetailedHealthState": "HEALTHY", "healthCheckProtocol": "HTTPS", "targetIp": "10.0.0.2", "responseLatency": "2.001107s", "healthState": "UNHEALTHY", "probeRequest": "/readyz", "detailedHealthState": "TIMEOUT", "probeSourceIp": "35.191.11.243", "probeResultText": "HTTP response: , Error: Timeout waiting for connect", "probeCompletionTimestamp": "2021-02-05T21:22:55.905374767Z", "targetPort": 6443, "ipAddress": "10.0.0.5" } }, "resource": { "type": "gce_instance_group", "labels": { "location": "us-east1-d", "project_id": "openshift-gce-devel-ci", "instance_group_name": "ci-ln-cpp7klk-f76d1-8wqdr-master-us-east1-d", "instance_group_id": "7037050892581872689" } }, "timestamp": "2021-02-05T21:22:57.386000690Z", "severity": "INFO", "logName": "projects/openshift-gce-devel-ci/logs/compute.googleapis.com%2Fhealthchecks", "receiveTimestamp": "2021-02-05T21:22:59.196131192Z" } The errors from the test cluster were all TIMEOUT. The logs are attached to the bug, but roughly we make 3 checks every second, about half of them are failing with timeout (2s) and the rest finish in about 10ms healthy. We need to investigate several areas: 1. Are other clouds also encountering these errors? Generally GCP has the best networking performance, so it's likely that AWS or Azure is experiencing this problem worse if their health check tunings are tight. We need to know why the google health check is failing >2s - it may be that prioritization of the ready endpoint is wrong (which would be a global problem). 2. Has the /readyz endpoint or the kube-apiserver regressed p99 latency on a single new connection and timeout? In general even a moderately highly loaded apiserver should have a p99 SSL connection latency of 300-400ms, and ready should respond in <50ms. So total p99 (maybe even p999) for readiness checks should be < 500ms, but 2s is WAY outside the allowed bounds. We need to know why latency is so high on this call that we would timeout 50% of the time.
Created attachment 1755305 [details] Health check logs on GCP cluster
The bimodal distribution (super fast, and super slow) implies something like p&f is holding some /ready requests, or the control plane is saturated in some other way (new connections?).
Tried to simulate this from a pod on a worker in a 4.7 AWS cluster: $ while true; do curl https://api-int.build01.ci.devcluster.openshift.com:6443/readyz -k -m 2 -o /dev/null -sw "%{http_code} %{time_total}s\n"; sleep 1; done and was seeing roughly 20ms on average, with no failures. My test scenario may not be truly apples to apples though and there may be subtleties of how health checking works on GCP.
Tested on a 4.6 GCP cluster. Results are more variable than AWS but are all within the 2s timeout: 200 1.729996s 200 1.345735s 200 1.531996s 200 0.147908s 200 0.369398s 200 0.309155s 200 1.160747s 200 0.151819s 200 0.050943s 200 0.037192s 200 0.084660s 200 0.690691s 200 0.035556s 200 0.173668s 200 0.049101s 200 0.102726s 200 0.741536s
After a while it smoothed out: 200 0.024646s 200 0.029601s 200 0.048048s 200 0.024754s 200 0.026038s 200 0.042908s 200 0.026675s 200 0.027466s 200 0.034512s 200 0.048435s 200 0.032679s So this looks more significant in 4.7 than 4.6 on GCP, and 4.6 AWS (in the one cluster I checked) did seem to be faster. However, we need to measure this from the cloud consoles where possible to get the numbers the LB themselves see.
To turn on health check logging on GCP, find the api-internal health check created by the installer and set Logging to Enabled. Then go to the cloud console logs viewer in the right project and filter by the log query: resource.type="gce_instance_group" jsonPayload.healthCheckProbeResult.detailedHealthState != "" Which shows all health check records (further filtering is available via the query language if multiple clusters have this logging enabled)
Relevant 4.7 GCP internal-API-LB config [1] and docs on those settings [2]. > ...but roughly we make 3 checks every second... Config has 'check_interval_sec = 2' [1], so... I'm not entirely clear on why the logs are banging away 6x faster. Terraform docs are not super helpful [3]. Google docs [4] say: How often (in seconds) to send a health check. The default value is 5 seconds. [1]: https://github.com/openshift/installer/blob/c0489117068cb00c5222bb0762a87605f41ebe04/data/data/gcp/network/lb-private.tf#L7-L20 [2]: https://github.com/openshift/installer/blob/c0489117068cb00c5222bb0762a87605f41ebe04/docs/dev/kube-apiserver-health-check.md [3]: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_health_check [4]: https://cloud.google.com/compute/docs/reference/rest/v1/healthChecks
I have checked kubectl exec -it -n openshift-kube-apiserver kube-apiserver-ip-10-0-172-126.us-west-1.compute.internal -c kube-apiserver -- /bin/bash -c 'while true; do curl https://localhost:6443/readyz -k -m 2 -o /dev/null -sw "%{http_code} %{time_total}s\n"; done' on a random 4.7 cluster and found 15+- ms latency, no throttling, no p&f. With that I have doubts this is a general problem. Will try on GCP soon.
Checked on 4.7 GCP cluster, same behaviour: $ kubectl exec -it -n openshift-kube-apiserver kube-apiserver-ci-ln-r5yzrpb-f76d1-w4j9h-master-0 -c kube-apiserver -- /bin/bash -c 'while true; do curl https://localhost:6443/readyz -k -m 2 -o /dev/null -sw "%{http_code} %{time_total}s\n"; done' 200 0.018925s 200 0.016981s 200 0.017841s 200 0.017474s 200 0.016498s 200 0.021132s 200 0.018844s 200 0.017204s 200 0.022064s 200 0.020833s 200 0.021037s 200 0.018644s 200 0.017278s 200 0.016655s 200 0.019013s 200 0.019125s 200 0.018999s 200 0.016962s 200 0.016327s 200 0.017600s With that I would expect that the Golang apiserver process is fine.
The intervals are fine and were chosen because every LB checker in the cloud provider needs to get successes, so this is not a "frequency of check" problem. There is no reason why /readyz should ever take more than 10ms to respond, and should always be successful.
One unresolved issue we faced previously is how iptables rewrites TCP source/destination on GCP due to how their loadbalancers route traffic. This was due to a node (and a pod) communicating to a master via both the VIP and the real IP. The masters rewrite the incoming VIPs to the real IP, and this causes TCP connection resets. This shouldn't affect traffic coming from the VIP itself to the master unless IPtables rules have been changed to rewrite responses to always egress as VIP (or some other rule I haven't thought of). In any case, it is probably worth it to take a packet capture and look for TCP connection resets.
Created attachment 1755999 [details] health-check-probe-connect-timeout-errors.json
Created attachment 1756011 [details] health-check-probe-connect-timeout-errors.png
Created attachment 1756020 [details] all-health-probes.json
Created attachment 1756021 [details] readyz.log
I used the following query to retrieve all 'Timeout waiting for connect' errors for a 24h window. https://console.cloud.google.com/logs/query;query=resource.type%3D%22gce_instance_group%22%0AjsonPayload.healthCheckProbeResult.healthState!%3D%22HEALTHY%22%0AjsonPayload.healthCheckProbeResult.probeResultText%3D%22HTTP%20response:%20,%20Error:%20Timeout%20waiting%20for%20connect%22%0A;timeRange=PT24H?project=openshift-gce-devel >resource.type="gce_instance_group" >jsonPayload.healthCheckProbeResult.healthState!="HEALTHY" >jsonPayload.healthCheckProbeResult.probeResultText="HTTP response: , Error: Timeout waiting for connect" attachment: - https://bugzilla.redhat.com/attachment.cgi?id=1755999 - https://bugzilla.redhat.com/attachment.cgi?id=1756011 There are a total of 1527 of these errors (unhealthy probes) and looks like they have a unique path: > $ cat downloaded-logs-20210209-131119.json | jq -r '.[] | "group_name:" + .resource.labels.instance_group_name + " location:" + .resource.labels.location + " ipAddress:" + .jsonPayload.healthCheckProbeResult.ipAddress + " probeSourceIp:" + .jsonPayload.healthCheckProbeResult.probeSourceIp' | sort | uniq > > group_name:akashem47-wffmk-master-us-central1-c location:us-central1-c ipAddress:10.0.0.4 probeSourceIp:35.191.10.91 All these health probes target master '10.0.04' and are originating from `35.191.10.91` So I also downloaded all (healthy + unhealthy) health probes for the 24h window, and I see the following unique addresses for 'probeSourceIp' > $ cat downloaded-logs-20210209-133634.json | jq -r '.[] | .jsonPayload.healthCheckProbeResult.probeSourceIp' | sort | uniq > 130.211.0.207 > 130.211.0.213 > 130.211.0.23 > 35.191.10.117 > 35.191.10.91 > 35.191.11.53 > 35.191.1.235 > 35.191.9.217 > 35.191.9.223 attachment: https://bugzilla.redhat.com/attachment.cgi?id=1756020 This is the overall distribution of health probes: > $ cat downloaded-logs-20210209-133634.json | jq -r '.[] | "group_name:" + .resource.labels.instance_group_name + " location:" + .resource.labels.location + " ipAddress:" + .jsonPayload.healthCheckProbeResult.ipAddress + " probeSourceIp:" + .jsonPayload.healthCheckProbeResult.probeSourceIp' | sort | uniq > group_name:akashem47-wffmk-master-us-central1-a location:us-central1-a ipAddress:10.0.0.3 probeSourceIp:130.211.0.207 > group_name:akashem47-wffmk-master-us-central1-a location:us-central1-a ipAddress:10.0.0.3 probeSourceIp:130.211.0.213 > group_name:akashem47-wffmk-master-us-central1-a location:us-central1-a ipAddress:10.0.0.3 probeSourceIp:130.211.0.23 > group_name:akashem47-wffmk-master-us-central1-b location:us-central1-b ipAddress:10.0.0.5 probeSourceIp:35.191.10.117 > group_name:akashem47-wffmk-master-us-central1-b location:us-central1-b ipAddress:10.0.0.5 probeSourceIp:35.191.9.217 > group_name:akashem47-wffmk-master-us-central1-b location:us-central1-b ipAddress:10.0.0.5 probeSourceIp:35.191.9.223 > group_name:akashem47-wffmk-master-us-central1-c location:us-central1-c ipAddress:10.0.0.4 probeSourceIp:35.191.10.91 > group_name:akashem47-wffmk-master-us-central1-c location:us-central1-c ipAddress:10.0.0.4 probeSourceIp:35.191.11.53 > group_name:akashem47-wffmk-master-us-central1-c location:us-central1-c ipAddress:10.0.0.4 probeSourceIp:35.191.1.235 Now, let's investigate whether these unhealthy health probes are landing in the apiserver. These are the steps: - enabled http logging for the apiservers > oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/logLevel", "value": "Debug" } ]' - grep the '/readyz' requests for the instance '10.0.04'. health probe from GCP has the user agent "GoogleHC/1.0" > $ oc -n openshift-kube-apiserver logs kube-apiserver-akashem47-wffmk-master-2.c.openshift-gce-devel.internal -c kube-apiserver | grep 'readyz' | grep 'userAgent="GoogleHC/1.0"' > readyz.txt Here is a log entry for a `/readyz` health probe from GCP > I0209 18:56:00.009136 18 httplog.go:94] "HTTP" verb="GET" URI="/readyz" latency="4.014562ms" userAgent="GoogleHC/1.0" srcIP="35.191.1.235:32768" resp=200 attachment: https://bugzilla.redhat.com/attachment.cgi?id=1756021 after grepping the `readyz.txt`, I found that it is responding to health probe requests from these `probeSourceIp`: > 35.191.11.53 > 35.191.1.235 > 35.191.10.91 (this is the `probeSourceIp` with all the unhealthy probes) Summary: - no record of the unhealthy probes in the apiserver http log - all these unhealthy probes seem to have a unique pattern - they are all coming from the same source IP and targeting the same master IP. Based on the above, I am incline to rule apiserver out of this issue. Follow up: - Can we take this data to GCP team? They may have some insights. - For now, assign this to the cloud team to figure out where the health probe requests are being dropped.
setting the target release to -- for now, once we have some upate from GCP support we can revisit it if need be.
Did we verify whether this happens in 4.6 as well? This would be an OSD upgrade blocker if so. Some additional symptoms of this showing up in CI is anything like: read tcp 172.16.34.79:51058->35.237.1.31:6443: read: connection reset by peer where it's likely the LB took that endpoint out of rotation at that time
So the GCP customer support is looking into the case and has the following question: > I see you have several firewall rules set up for akashem47, however you have a rule called akashem47-wffmk-api that grants 0.0.0.0 for 6443, which means traffic never uses the akashem47-wffmk-health-checks rule since they have the same priority. All traffic for 6443 is allowed using this rule no matter the source address. I'm not sure if this was expected behavior. I will try to talk to the installer/GCP folks here and find out if this is expected.
(In reply to Clayton Coleman from comment #20) > Did we verify whether this happens in 4.6 as well? This would be an OSD > upgrade blocker if so. > > Some additional symptoms of this showing up in CI is anything like: > > read tcp 172.16.34.79:51058->35.237.1.31:6443: read: connection reset by > peer > > where it's likely the LB took that endpoint out of rotation at that time Connection reset by peer is TCP. I talked about this here: https://bugzilla.redhat.com/show_bug.cgi?id=1925698#c11 TCP dumps from the affected host would help. Isolate the issue. Likely culprit is iptables NAT'ing traffic.
Here's where we're at: The reproducer cluster setup by Abu is no longer having the timeout issues from the LB health checks. They abruptly stopped on 11 FEB. We should re-enable the logging in CI to see if we're still hitting it there to rule out any changes to the reproducer cluster. Also, some of the timeout issues from the CI jobs seem to be on-cluster components hitting the internal API (eg, ci-cluster-xyz:some-daemonset -> ci-cluster-xyz:api). This is matches the pattern that we discovered previously, mentioned here: https://bugzilla.redhat.com/show_bug.cgi?id=1925698#c11 I have written this elsewhere, but I can't find/remember where. Basically, it goes like this: VIP is 10.0.0.2. master-a has an ip of 10.0.0.3. node-b has an ip of 10.0.1.10. Some pod (possibly a network pod, IIRC) is connecting to a master on the private IP, 10.0.0.3, via node 10.0.1.10. The master's TCP stack sees this as 10.0.1.10:55555 -> 10.0.0.3:6443. The node's TCP stack sees this as 10.0.1.10:55555 -> 10.0.0.3:6443, as you would expect. Next, some other pod from the same node (10.0.1.10) connects to the master VIP, as it should. The master's TCP stack sees this as 10.0.1.10:55555 -> 10.0.0.3:6443. The node see this as 10.0.1.10:55555 -> 10.0.0.2:6443. The node assumes it's getting 2 independent TCP sessions, and it's not wrong. The master sees a duplicate TCP session and the traffic fails to connect. This is because there is an iptables NAT rule on the masters to translate the VIP into the private IP, and the reverse for the response. Previously, we found this wasn't happening all too frequently. If something else has changed to make this problem worse, it could explain the increased in tcp timeouts from CI. Here's an example of a recent CI run using the query Clayton shared originally: Feb 17 04:45:09.200 E clusteroperator/network changed Degraded to True: ApplyOperatorConfig: Error while updating operator configuration: could not apply (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster: could not retrieve existing (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster: Get "https://api-int.ci-op-lxnf2n4x-f3edd.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-network-operator/configmaps/applied-cluster": dial tcp 10.0.0.2:6443: i/o timeout Feb 17 04:45:10.402 E clusteroperator/network changed Degraded to True: ApplyOperatorConfig: Error while updating operator configuration: could not apply (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster: could not retrieve existing (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster: Get "https://api-int.ci-op-lxnf2n4x-f3edd.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-network-operator/configmaps/applied-cluster": dial tcp 10.0.0.2:6443: i/o timeout We can see, the tcp 10.0.0.2:6443 (VIP) timeout is originating on the cluster itself (we're checking the log of an operator status). This particular connection would be susceptible to the scenario I described, but I can't say that's what's happening without TCP dumps. There might be some other factor at play (such as the LB health checks) causing these instead.
Created attachment 1757874 [details] pcap from master with failing health checks Attaching PCAP from a master on a cluster with HC active failing. hc-tcp-no-connect.pcap
I limited the PCAP to just a single host. I'm pretty sure this is a load balancer health check host. So, we can see the same HC host reuses a handful of source ports. From time to time, it hits port X, and when viewed in wireshark, we get the indication that this port is already in use. We can see that no successful communication takes place when that source port is attempted. The same host uses another source port, and then it works fine. When it comes back to that source port, it fails again. While this was happening, I observed the states in conntrack: sh-4.4# cat /proc/net/ip_conntrack or /proc/net/nf_conntrack | grep '35\.' | grep 6443 cat: /proc/net/ip_conntrack: No such file or directory cat: or: No such file or directory ipv4 2 tcp 6 1 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=48210 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=48210 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 0 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=48996 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=48996 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 8 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=37860 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=37860 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 4 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 5 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=49104 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=49104 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 3 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=48814 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=48814 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 9 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=40288 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=40288 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 4 CLOSE src=35.191.1.235 dst=10.0.0.2 sport=36004 dport=6443 src=10.0.0.5 dst=35.191.1.235 sport=6443 dport=36004 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 117 SYN_SENT src=35.191.1.235 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.1.235 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 8 CLOSE src=35.191.1.235 dst=10.0.0.2 sport=46248 dport=6443 src=10.0.0.5 dst=35.191.1.235 sport=6443 dport=46248 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 7 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 6 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=41688 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=41688 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 We can see, there is one [UNREPLIED] connection. There are no other active connections for the same source.ip/source.port combo, there is one closed connection for a different source.ip but same source.port: ipv4 2 tcp 6 4 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 These are NAT connections. The following NAT rules are in effect that appear to influence these packets: -A gcp-vips -d 10.0.0.2/32 -j REDIRECT -A gcp-vips -d 35.184.123.49/32 -j REDIRECT -A gcp-vips -d 35.188.173.243/32 -j REDIRECT -A gcp-vips-local -d 35.184.123.49/32 -j REDIRECT -A gcp-vips-local -d 35.188.173.243/32 -j REDIRECT -A gcp-vips-local -d 10.0.0.2/32 -j REDIRECT In this case, 10.0.0.2 is the VIP, and the local iptables DNATs the VIP to the private IP address 10.0.0.5. We're not specifying which IP to use here with the NAT, maybe sometimes conntrack gets confused? Approximately 40 minutes after I captured the pcap and the conntrack output above, the issue seemed to self resolve. No actions were taken on the host that should have changed this, at least not by myself or anyone else looking at it. Perhaps some automated process did, but TBD. IMO, this explains the periodic timeouts we're observing in the GCP HC logs. We're seeing SYN packets arrive, but our host is not replying. So the problem is most likely on our end.
Interesting observation: even though this is a different cluster as the original reproducer cluster, the problematic host/port combo seems to be the same. Current cluster: ipv4 2 tcp 6 117 SYN_SENT src=35.191.1.235 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.1.235 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 Previous reproducer cluster described in the GCP support case: Here is a log entry for a `/readyz` health check probe from GCP > I0209 18:56:00.009136 18 httplog.go:94] "HTTP" verb="GET" URI="/readyz" latency="4.014562ms" userAgent="GoogleHC/1.0" srcIP="35.191.1.235:32768" resp=200 Also interesting, I no longer see probes from that host in nf_conntrack now that the problem has subsided: sh-4.4# cat /proc/net/ip_conntrack or /proc/net/nf_conntrack | grep '35\.' | grep 6443 cat: /proc/net/ip_conntrack: No such file or directory cat: or: No such file or directory ipv4 2 tcp 6 6 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=48660 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=48660 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 0 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 8 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=34370 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=34370 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 2 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=44848 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=44848 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 6 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 8 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=42358 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=42358 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 4 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=35702 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=35702 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 0 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=42480 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=42480 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 0 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=42304 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=42304 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 8 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 62 TIME_WAIT src=169.254.169.254 dst=35.184.123.49 sport=46443 dport=6080 src=10.0.0.5 dst=169.254.169.254 sport=6080 dport=46443 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 4 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=35312 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=35312 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 Possibly a coincidence, or possibly something related to 35.191.1.235:32768 specifically.
Okay, I looked closely at the problem nf_conntrack above and noticed this: ipv4 2 tcp 6 4 CLOSE src=35.191.1.235 dst=10.0.0.2 sport=36004 dport=6443 src=10.0.0.5 dst=35.191.1.235 sport=6443 dport=36004 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 ipv4 2 tcp 6 117 SYN_SENT src=35.191.1.235 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.1.235 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2 You can divide the ip/port groupings into two halves on each connection. "Inbound" and "Outbound". If we look closely at the first connection, the "Inbound" destination is the VIP. The "outbound" destination is the real IP. Looking at the broken connection, we can see the "Inbound" connection is the VIP, and the "outbound" connection is also the VIP. This is the only connection that has this. This seems like the connection could have been established prior to the application of the iptables rules, and we're possibly racing with conntrack.
Not a perfect match, but the upstream kernel has some fixes in a somewhat related area. This thread has the background: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1836816 Also, old issue that seems similar but not the same: https://github.com/projectcalico/felix/issues/1490 There seems to be a common intersection of conntrack + NAT and some broken connections.
I spun up a new cluster, enabled logging in GCP on the internal API, witnessed many timeout drops. Deleting the stale flows via conntrack on all masters immediately resolved the issue: sh-4.4# conntrack -D -r 10.0.0.2 tcp 6 119 SYN_SENT src=35.191.11.225 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.11.225 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 119 SYN_SENT src=35.191.2.141 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.2.141 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 110 SYN_SENT src=35.191.2.131 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.2.131 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1 conntrack v1.4.4 (conntrack-tools): 3 flow entries have been deleted. We can add a workaround to whatever is adding the iptables VIP rules, potentially a system.d unit that fires after the iptables rules (might want to add a small sleep so we don't race) to run 'conntrack -D -r 10.0.0.2' This should be safe, there should be no return flows with 10.0.0.2 source, we never want those. The ultimate fix may be a kernel patch, see above comment. It might even be fixed upstream already, TBD. Unfortunately, sufficient tooling to trace the kernel is not available in the debug image at this time.
sh-4.4# uname -a Linux ci-ln-sq4nzkk-f76d1-5dfd9-master-1 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
[1] softens one test that was failing on this. Once this issue is fixed, we should revert the softening. [1]: https://github.com/openshift/origin/pull/25906
Oops, origin#25906 was actually about bug 1928839. Please ignore comment 31.
Fix for the machine config-operator: - Simplify current implementation based on a pod, systemd unit file and bash script to a standalone pod - Flush stale conntrack entries https://github.com/openshift/machine-config-operator/pull/2422/
Moving to ON_QA because the PR merged.
Moving back to POST. This bug links two PRs [1,2], and only [2] has merged. If [1] is no longer needed, it should be closed and unlinked from the bug. After which this bug will be MODIFIED. ART's sweeper needs to move it from MODIFIED to ON_QA so it gets attached to an errata. [1]: https://github.com/openshift/machine-config-operator/pull/2422 [2]: https://github.com/openshift/machine-config-operator/pull/2477
Now links only [1], which has landed. Moving to MODIFIED, and ART's bot should be around shortly to move to ON_QA. [1]: https://coreos.slack.com/archives/CDCP2LA9L/p1619297552433100
First, tried to verify by: For my on hand env of version 4.8.0-fc.0, I made Logs Enabled for the health check api-internal by following comment 6, and queried jsonPayload.healthCheckProbeResult.healthState!="" by following comment 17. But it does not show any entry. I don't know why. Given I don't know other verification way, then I tried to verify by comment 0 search with dial tcp 10.0.0.2:6443: i/o timeout within 7 days (the PR merged already one month) in 4.8 non-upgrade jobs: w3m -dump -cols 200 "https://search.ci.openshift.org/?search=dial+tcp+10.0.0.2%3A6443%3A+i%2Fo+timeout&maxAge=168h&context=1&type=junit&name=4%5C.8&excludeName=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job" > bug_1925698_search_result.txt . See http://file.rdu.redhat.com/~xxia/bug_1925698_search_result.txt , still see not few error occurrences, and ALL jobs are gcp jobs, given my search does not specify gcp job. So moving to ASSIGNED for Dev double check, thanks!
This was already verified as you can see in following comments: https://github.com/openshift/machine-config-operator/pull/2477#issuecomment-808750066 https://github.com/openshift/machine-config-operator/pull/2477#issuecomment-808796084
(In reply to Xingxing Xia from comment #40) The CI results show that we're seeing considerably higher success rate even if it's not perfect. Do you mind if we move this to VERIFIED so we can get the changes back to 4.7 to improve things for customers there? Hopefully that will help us to continue to narrow in on any last areas to improve after that.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438