Bug 1925698

Summary: On GCP, load balancers report kube-apiserver fails its /readyz check 50% of the time, causing load balancer backend churn and disruptions to apiservers
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: kube-apiserverAssignee: Antonio Ojea <aojeagar>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.7CC: akashem, aojeagar, aos-bugs, dcbw, kewang, mfojtik, mgugino, miabbott, sdodson, sttts, wking, wlewis, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Google Cloud Loadbalancer healthcheckers leave stale conntrack entries on the hosts Consequence: Stale conntrack entries cause network interruptions to the apiserver traffic using the GCP loadbalancers Fix: Don't allow healthcheck traffic to loop through the host Result: No network disruption against the apiserver
Story Points: ---
Clone Of:
: 1930457 (view as bug list) Environment:
Last Closed: 2021-07-27 22:41:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1930457, 1949348    
Attachments:
Description Flags
Health check logs on GCP cluster
none
health-check-probe-connect-timeout-errors.json
none
health-check-probe-connect-timeout-errors.png
none
all-health-probes.json
none
readyz.log
none
pcap from master with failing health checks none

Description Clayton Coleman 2021-02-05 22:02:58 UTC
Investigation of the semi-rare errors from the cmd tests (about 1-2% of all failures) on GCP with symptom:

Unable to connect to the server: dial tcp 10.0.0.2:6443: i/o timeout

https://search.ci.openshift.org/?search=dial+tcp+10.0.0.2%3A6443%3A+i%2Fo+timeout&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/553/pull-ci-openshift-kubernetes-master-e2e-gcp/1357762933496483840

10.0.0.2 is the internal API load balancer from GCP fronting the kube-apiservers, so this issue is either in the LB or in the kube-apiserver (there is nothing in between). To eliminate the LB, we turned on health check logging for the api-internal TCP load balancer which logs to the GCP logs endpoint

led to the discovery that the kube-apiservers on GCP are failing their /readyz check very frequently - one failure every 10-15s

The failure looks like:

{
  "insertId": "w74ydeb7g",
  "jsonPayload": {
    "healthCheckProbeResult": {
      "previousHealthState": "HEALTHY",
      "previousDetailedHealthState": "HEALTHY",
      "healthCheckProtocol": "HTTPS",
      "targetIp": "10.0.0.2",
      "responseLatency": "2.001107s",
      "healthState": "UNHEALTHY",
      "probeRequest": "/readyz",
      "detailedHealthState": "TIMEOUT",
      "probeSourceIp": "35.191.11.243",
      "probeResultText": "HTTP response: , Error: Timeout waiting for connect",
      "probeCompletionTimestamp": "2021-02-05T21:22:55.905374767Z",
      "targetPort": 6443,
      "ipAddress": "10.0.0.5"
    }
  },
  "resource": {
    "type": "gce_instance_group",
    "labels": {
      "location": "us-east1-d",
      "project_id": "openshift-gce-devel-ci",
      "instance_group_name": "ci-ln-cpp7klk-f76d1-8wqdr-master-us-east1-d",
      "instance_group_id": "7037050892581872689"
    }
  },
  "timestamp": "2021-02-05T21:22:57.386000690Z",
  "severity": "INFO",
  "logName": "projects/openshift-gce-devel-ci/logs/compute.googleapis.com%2Fhealthchecks",
  "receiveTimestamp": "2021-02-05T21:22:59.196131192Z"
}

The errors from the test cluster were all TIMEOUT.

The logs are attached to the bug, but roughly we make 3 checks every second, about half of them are failing with timeout (2s) and the rest finish in about 10ms healthy.

We need to investigate several areas:

1. Are other clouds also encountering these errors?

Generally GCP has the best networking performance, so it's likely that AWS or Azure is experiencing this problem worse if their health check tunings are tight.  We need to know why the google health check is failing >2s - it may be that prioritization of the ready endpoint is wrong (which would be a global problem).  

2. Has the /readyz endpoint or the kube-apiserver regressed p99 latency on a single new connection and timeout?

In general even a moderately highly loaded apiserver should have a p99 SSL connection latency of 300-400ms, and ready should respond in <50ms.  So total p99 (maybe even p999) for readiness checks should be < 500ms, but 2s is WAY outside the allowed bounds.  We need to know why latency is so high on this call that we would timeout 50% of the time.

Comment 1 Clayton Coleman 2021-02-05 22:05:31 UTC
Created attachment 1755305 [details]
Health check logs on GCP cluster

Comment 2 Clayton Coleman 2021-02-05 22:34:43 UTC
The bimodal distribution (super fast, and super slow) implies something like p&f is holding some /ready requests, or the control plane is saturated in some other way (new connections?).

Comment 3 Clayton Coleman 2021-02-05 22:56:56 UTC
Tried to simulate this from a pod on a worker in a 4.7 AWS cluster:

$ while true; do curl https://api-int.build01.ci.devcluster.openshift.com:6443/readyz -k -m 2 -o /dev/null -sw "%{http_code} %{time_total}s\n"; sleep 1; done

and was seeing roughly 20ms on average, with no failures.  My test scenario may not be truly apples to apples though and there may be subtleties of how health checking works on GCP.

Comment 4 Clayton Coleman 2021-02-05 23:05:15 UTC
Tested on a 4.6 GCP cluster. Results are more variable than AWS but are all within the 2s timeout:

200 1.729996s
200 1.345735s
200 1.531996s
200 0.147908s
200 0.369398s
200 0.309155s
200 1.160747s
200 0.151819s
200 0.050943s
200 0.037192s
200 0.084660s
200 0.690691s
200 0.035556s
200 0.173668s
200 0.049101s
200 0.102726s
200 0.741536s

Comment 5 Clayton Coleman 2021-02-05 23:07:07 UTC
After a while it smoothed out: 

200 0.024646s
200 0.029601s
200 0.048048s
200 0.024754s
200 0.026038s
200 0.042908s
200 0.026675s
200 0.027466s
200 0.034512s
200 0.048435s
200 0.032679s

So this looks more significant in 4.7 than 4.6 on GCP, and 4.6 AWS (in the one cluster I checked) did seem to be faster.  However, we need to measure this from the cloud consoles where possible to get the numbers the LB themselves see.

Comment 6 Clayton Coleman 2021-02-05 23:10:13 UTC
To turn on health check logging on GCP, find the api-internal health check created by the installer and set Logging to Enabled. Then go to the cloud console logs viewer in the right project and filter by the log query: 

resource.type="gce_instance_group"
jsonPayload.healthCheckProbeResult.detailedHealthState != ""

Which shows all health check records (further filtering is available via the query language if multiple clusters have this logging enabled)

Comment 7 W. Trevor King 2021-02-06 02:06:24 UTC
Relevant 4.7 GCP internal-API-LB config [1] and docs on those settings [2].

> ...but roughly we make 3 checks every second...

Config has 'check_interval_sec  = 2' [1], so... I'm not entirely clear on why the logs are banging away 6x faster.  Terraform docs are not super helpful [3].  Google docs [4] say:

  How often (in seconds) to send a health check. The default value is 5 seconds.

[1]: https://github.com/openshift/installer/blob/c0489117068cb00c5222bb0762a87605f41ebe04/data/data/gcp/network/lb-private.tf#L7-L20
[2]: https://github.com/openshift/installer/blob/c0489117068cb00c5222bb0762a87605f41ebe04/docs/dev/kube-apiserver-health-check.md
[3]: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_health_check
[4]: https://cloud.google.com/compute/docs/reference/rest/v1/healthChecks

Comment 8 Stefan Schimanski 2021-02-08 10:04:56 UTC
I have checked 

  kubectl exec -it -n openshift-kube-apiserver kube-apiserver-ip-10-0-172-126.us-west-1.compute.internal -c kube-apiserver -- /bin/bash -c 'while true; do curl https://localhost:6443/readyz -k -m 2 -o /dev/null -sw "%{http_code} %{time_total}s\n"; done'

on a random 4.7 cluster and found 15+- ms latency, no throttling, no p&f. With that I have doubts this is a general problem. Will try on GCP soon.

Comment 9 Stefan Schimanski 2021-02-08 10:58:39 UTC
Checked on 4.7 GCP cluster, same behaviour:

$ kubectl exec -it -n openshift-kube-apiserver kube-apiserver-ci-ln-r5yzrpb-f76d1-w4j9h-master-0 -c kube-apiserver -- /bin/bash -c 'while true; do curl https://localhost:6443/readyz -k -m 2 -o /dev/null -sw "%{http_code} %{time_total}s\n"; done'
200 0.018925s
200 0.016981s
200 0.017841s
200 0.017474s
200 0.016498s
200 0.021132s
200 0.018844s
200 0.017204s
200 0.022064s
200 0.020833s
200 0.021037s
200 0.018644s
200 0.017278s
200 0.016655s
200 0.019013s
200 0.019125s
200 0.018999s
200 0.016962s
200 0.016327s
200 0.017600s

With that I would expect that the Golang apiserver process is fine.

Comment 10 Clayton Coleman 2021-02-08 14:38:35 UTC
The intervals are fine and were chosen because every LB checker in the cloud provider needs to get successes, so this is not a "frequency of check" problem.  There is no reason why /readyz should ever take more than 10ms to respond, and should always be successful.

Comment 11 Michael Gugino 2021-02-08 17:57:45 UTC
One unresolved issue we faced previously is how iptables rewrites TCP source/destination on GCP due to how their loadbalancers route traffic.  This was due to a node (and a pod) communicating to a master via both the VIP and the real IP.  The masters rewrite the incoming VIPs to the real IP, and this causes TCP connection resets.

This shouldn't affect traffic coming from the VIP itself to the master unless IPtables rules have been changed to rewrite responses to always egress as VIP (or some other rule I haven't thought of).  In any case, it is probably worth it to take a packet capture and look for TCP connection resets.

Comment 12 Abu Kashem 2021-02-09 18:27:57 UTC
Created attachment 1755999 [details]
health-check-probe-connect-timeout-errors.json

Comment 13 Abu Kashem 2021-02-09 18:30:34 UTC
Created attachment 1756011 [details]
health-check-probe-connect-timeout-errors.png

Comment 15 Abu Kashem 2021-02-09 19:17:20 UTC
Created attachment 1756020 [details]
all-health-probes.json

Comment 16 Abu Kashem 2021-02-09 19:17:54 UTC
Created attachment 1756021 [details]
readyz.log

Comment 17 Abu Kashem 2021-02-09 19:20:50 UTC
I used the following query to retrieve all 'Timeout waiting for connect' errors for a 24h window.
https://console.cloud.google.com/logs/query;query=resource.type%3D%22gce_instance_group%22%0AjsonPayload.healthCheckProbeResult.healthState!%3D%22HEALTHY%22%0AjsonPayload.healthCheckProbeResult.probeResultText%3D%22HTTP%20response:%20,%20Error:%20Timeout%20waiting%20for%20connect%22%0A;timeRange=PT24H?project=openshift-gce-devel

>resource.type="gce_instance_group"
>jsonPayload.healthCheckProbeResult.healthState!="HEALTHY"
>jsonPayload.healthCheckProbeResult.probeResultText="HTTP response: , Error: Timeout waiting for connect"

attachment:
- https://bugzilla.redhat.com/attachment.cgi?id=1755999
- https://bugzilla.redhat.com/attachment.cgi?id=1756011

There are a total of 1527 of these errors (unhealthy probes) and looks like they have a unique path:
> $ cat downloaded-logs-20210209-131119.json | jq -r '.[] | "group_name:" + .resource.labels.instance_group_name + " location:" + .resource.labels.location + " ipAddress:" + .jsonPayload.healthCheckProbeResult.ipAddress + " probeSourceIp:" + .jsonPayload.healthCheckProbeResult.probeSourceIp' | sort | uniq
>
> group_name:akashem47-wffmk-master-us-central1-c location:us-central1-c ipAddress:10.0.0.4 probeSourceIp:35.191.10.91

All these health probes target master '10.0.04' and are originating from `35.191.10.91`

So I also downloaded all (healthy + unhealthy) health probes for the 24h window, and I see the following unique addresses for 'probeSourceIp'
> $ cat downloaded-logs-20210209-133634.json | jq -r '.[] | .jsonPayload.healthCheckProbeResult.probeSourceIp' | sort | uniq
> 130.211.0.207
> 130.211.0.213
> 130.211.0.23
> 35.191.10.117
> 35.191.10.91
> 35.191.11.53
> 35.191.1.235
> 35.191.9.217
> 35.191.9.223

attachment: https://bugzilla.redhat.com/attachment.cgi?id=1756020

This is the overall distribution of health probes:
> $ cat downloaded-logs-20210209-133634.json | jq -r '.[] | "group_name:" + .resource.labels.instance_group_name + " location:" + .resource.labels.location + " ipAddress:" + .jsonPayload.healthCheckProbeResult.ipAddress + " probeSourceIp:" + .jsonPayload.healthCheckProbeResult.probeSourceIp' | sort | uniq
> group_name:akashem47-wffmk-master-us-central1-a location:us-central1-a ipAddress:10.0.0.3 probeSourceIp:130.211.0.207
> group_name:akashem47-wffmk-master-us-central1-a location:us-central1-a ipAddress:10.0.0.3 probeSourceIp:130.211.0.213
> group_name:akashem47-wffmk-master-us-central1-a location:us-central1-a ipAddress:10.0.0.3 probeSourceIp:130.211.0.23
> group_name:akashem47-wffmk-master-us-central1-b location:us-central1-b ipAddress:10.0.0.5 probeSourceIp:35.191.10.117
> group_name:akashem47-wffmk-master-us-central1-b location:us-central1-b ipAddress:10.0.0.5 probeSourceIp:35.191.9.217
> group_name:akashem47-wffmk-master-us-central1-b location:us-central1-b ipAddress:10.0.0.5 probeSourceIp:35.191.9.223
> group_name:akashem47-wffmk-master-us-central1-c location:us-central1-c ipAddress:10.0.0.4 probeSourceIp:35.191.10.91
> group_name:akashem47-wffmk-master-us-central1-c location:us-central1-c ipAddress:10.0.0.4 probeSourceIp:35.191.11.53
> group_name:akashem47-wffmk-master-us-central1-c location:us-central1-c ipAddress:10.0.0.4 probeSourceIp:35.191.1.235

Now, let's investigate whether these unhealthy health probes are landing in the apiserver. These are the steps:
- enabled http logging for the apiservers
> oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/logLevel", "value": "Debug" } ]'

- grep the '/readyz' requests for the instance '10.0.04'. health probe from GCP has the user agent "GoogleHC/1.0"
> $ oc -n openshift-kube-apiserver logs kube-apiserver-akashem47-wffmk-master-2.c.openshift-gce-devel.internal -c kube-apiserver | grep 'readyz' | grep 'userAgent="GoogleHC/1.0"' > readyz.txt

Here is a log entry for a `/readyz` health probe from GCP
> I0209 18:56:00.009136      18 httplog.go:94] "HTTP" verb="GET" URI="/readyz" latency="4.014562ms" userAgent="GoogleHC/1.0" srcIP="35.191.1.235:32768" resp=200

attachment: https://bugzilla.redhat.com/attachment.cgi?id=1756021

after grepping the `readyz.txt`, I found that it is responding to health probe requests from these `probeSourceIp`:
> 35.191.11.53
> 35.191.1.235
> 35.191.10.91 (this is the `probeSourceIp` with all the unhealthy probes)

Summary:
- no record of the unhealthy probes in the apiserver http log
- all these unhealthy probes seem to have a unique pattern - they are all coming from the same source IP and targeting the same master IP.
Based on the above, I am incline to rule apiserver out of this issue.

Follow up:
- Can we take this data to GCP team? They may have some insights.
- For now, assign this to the cloud team to figure out where the health probe requests are being dropped.

Comment 19 Abu Kashem 2021-02-10 16:24:08 UTC
setting the target release to -- for now, once we have some upate from GCP support we can revisit it if need be.

Comment 20 Clayton Coleman 2021-02-10 22:49:58 UTC
Did we verify whether this happens in 4.6 as well?  This would be an OSD upgrade blocker if so.

Some additional symptoms of this showing up in CI is anything like:

  read tcp 172.16.34.79:51058->35.237.1.31:6443: read: connection reset by peer

where it's likely the LB took that endpoint out of rotation at that time

Comment 21 Abu Kashem 2021-02-11 15:15:06 UTC
So the GCP customer support is looking into the case and has the following question:

> I see you have several firewall rules set up for akashem47, however you have a rule called akashem47-wffmk-api that grants 0.0.0.0 for 6443, which means traffic never uses the akashem47-wffmk-health-checks rule since they have the same priority. All traffic for 6443 is allowed using this rule no matter the source address. I'm not sure if this was expected behavior.

I will try to talk to the installer/GCP folks here and find out if this is expected.

Comment 22 Michael Gugino 2021-02-11 19:44:47 UTC
(In reply to Clayton Coleman from comment #20)
> Did we verify whether this happens in 4.6 as well?  This would be an OSD
> upgrade blocker if so.
> 
> Some additional symptoms of this showing up in CI is anything like:
> 
>   read tcp 172.16.34.79:51058->35.237.1.31:6443: read: connection reset by
> peer
> 
> where it's likely the LB took that endpoint out of rotation at that time

Connection reset by peer is TCP.  I talked about this here: https://bugzilla.redhat.com/show_bug.cgi?id=1925698#c11

TCP dumps from the affected host would help. Isolate the issue.  Likely culprit is iptables NAT'ing traffic.

Comment 23 Michael Gugino 2021-02-18 02:23:59 UTC
Here's where we're at:

The reproducer cluster setup by Abu is no longer having the timeout issues from the LB health checks.  They abruptly stopped on 11 FEB.

We should re-enable the logging in CI to see if we're still hitting it there to rule out any changes to the reproducer cluster.

Also, some of the timeout issues from the CI jobs seem to be on-cluster components hitting the internal API (eg, ci-cluster-xyz:some-daemonset -> ci-cluster-xyz:api).  This is matches the pattern that we discovered previously, mentioned here: https://bugzilla.redhat.com/show_bug.cgi?id=1925698#c11

I have written this elsewhere, but I can't find/remember where.  Basically, it goes like this:

VIP is 10.0.0.2.  master-a has an ip of 10.0.0.3.  node-b has an ip of 10.0.1.10.

Some pod (possibly a network pod, IIRC) is connecting to a master on the private IP, 10.0.0.3, via node 10.0.1.10.  The master's TCP stack sees this as 10.0.1.10:55555 -> 10.0.0.3:6443.  The node's TCP stack sees this as 10.0.1.10:55555 -> 10.0.0.3:6443, as you would expect.

Next, some other pod from the same node (10.0.1.10) connects to the master VIP, as it should.  The master's TCP stack sees this as 10.0.1.10:55555 -> 10.0.0.3:6443.  The node see this as 10.0.1.10:55555 -> 10.0.0.2:6443.  The node assumes it's getting 2 independent TCP sessions, and it's not wrong.  The master sees a duplicate TCP session and the traffic fails to connect.  This is because there is an iptables NAT rule on the masters to translate the VIP into the private IP, and the reverse for the response.

Previously, we found this wasn't happening all too frequently.  If something else has changed to make this problem worse, it could explain the increased in tcp timeouts from CI.

Here's an example of a recent CI run using the query Clayton shared originally:

Feb 17 04:45:09.200 E clusteroperator/network changed Degraded to True: ApplyOperatorConfig: Error while updating operator configuration: could not apply (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster: could not retrieve existing (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster: Get "https://api-int.ci-op-lxnf2n4x-f3edd.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-network-operator/configmaps/applied-cluster": dial tcp 10.0.0.2:6443: i/o timeout
Feb 17 04:45:10.402 E clusteroperator/network changed Degraded to True: ApplyOperatorConfig: Error while updating operator configuration: could not apply (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster: could not retrieve existing (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster: Get "https://api-int.ci-op-lxnf2n4x-f3edd.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-network-operator/configmaps/applied-cluster": dial tcp 10.0.0.2:6443: i/o timeout

We can see, the tcp 10.0.0.2:6443 (VIP) timeout is originating on the cluster itself (we're checking the log of an operator status).  This particular connection would be susceptible to the scenario I described, but I can't say that's what's happening without TCP dumps.  There might be some other factor at play (such as the LB health checks) causing these instead.

Comment 24 Michael Gugino 2021-02-18 18:12:09 UTC
Created attachment 1757874 [details]
pcap from master with failing health checks

Attaching PCAP from a master on a cluster with HC active failing.  hc-tcp-no-connect.pcap

Comment 25 Michael Gugino 2021-02-18 18:20:48 UTC
I limited the PCAP to just a single host.  I'm pretty sure this is a load balancer health check host.

So, we can see the same HC host reuses a handful of source ports.  From time to time, it hits port X, and when viewed in wireshark, we get the indication that this port is already in use.  We can see that no successful communication takes place when that source port is attempted.  The same host uses another source port, and then it works fine.  When it comes back to that source port, it fails again.

While this was happening, I observed the states in conntrack:

sh-4.4# cat /proc/net/ip_conntrack or /proc/net/nf_conntrack | grep '35\.' | grep 6443
cat: /proc/net/ip_conntrack: No such file or directory
cat: or: No such file or directory
ipv4     2 tcp      6 1 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=48210 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=48210 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 0 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=48996 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=48996 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 8 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=37860 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=37860 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 4 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 5 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=49104 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=49104 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 3 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=48814 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=48814 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 9 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=40288 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=40288 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 4 CLOSE src=35.191.1.235 dst=10.0.0.2 sport=36004 dport=6443 src=10.0.0.5 dst=35.191.1.235 sport=6443 dport=36004 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 117 SYN_SENT src=35.191.1.235 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.1.235 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 8 CLOSE src=35.191.1.235 dst=10.0.0.2 sport=46248 dport=6443 src=10.0.0.5 dst=35.191.1.235 sport=6443 dport=46248 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 7 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 6 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=41688 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=41688 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2


We can see, there is one [UNREPLIED] connection.  There are no other active connections for the same source.ip/source.port combo, there is one closed connection for a different source.ip but same source.port: ipv4     2 tcp      6 4 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

These are NAT connections.  The following NAT rules are in effect that appear to influence these packets:

-A gcp-vips -d 10.0.0.2/32 -j REDIRECT
-A gcp-vips -d 35.184.123.49/32 -j REDIRECT
-A gcp-vips -d 35.188.173.243/32 -j REDIRECT
-A gcp-vips-local -d 35.184.123.49/32 -j REDIRECT
-A gcp-vips-local -d 35.188.173.243/32 -j REDIRECT
-A gcp-vips-local -d 10.0.0.2/32 -j REDIRECT


In this case, 10.0.0.2 is the VIP, and the local iptables DNATs the VIP to the private IP address 10.0.0.5.  We're not specifying which IP to use here with the NAT, maybe sometimes conntrack gets confused?


Approximately 40 minutes after I captured the pcap and the conntrack output above, the issue seemed to self resolve.  No actions were taken on the host that should have changed this, at least not by myself or anyone else looking at it.  Perhaps some automated process did, but TBD.

IMO, this explains the periodic timeouts we're observing in the GCP HC logs.  We're seeing SYN packets arrive, but our host is not replying.  So the problem is most likely on our end.

Comment 26 Michael Gugino 2021-02-18 18:37:22 UTC
Interesting observation: even though this is a different cluster as the original reproducer cluster, the problematic host/port combo seems to be the same.  Current cluster:

ipv4     2 tcp      6 117 SYN_SENT src=35.191.1.235 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.1.235 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

Previous reproducer cluster described in the GCP support case:

Here is a log entry for a `/readyz` health check probe from GCP
> I0209 18:56:00.009136      18 httplog.go:94] "HTTP" verb="GET" URI="/readyz" latency="4.014562ms" userAgent="GoogleHC/1.0" srcIP="35.191.1.235:32768" resp=200


Also interesting, I no longer see probes from that host in nf_conntrack now that the problem has subsided:

sh-4.4# cat /proc/net/ip_conntrack or /proc/net/nf_conntrack | grep '35\.' | grep 6443
cat: /proc/net/ip_conntrack: No such file or directory
cat: or: No such file or directory
ipv4     2 tcp      6 6 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=48660 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=48660 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 0 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 8 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=34370 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=34370 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 2 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=44848 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=44848 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 6 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 8 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=42358 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=42358 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 4 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=35702 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=35702 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 0 CLOSE src=35.191.10.91 dst=10.0.0.2 sport=42480 dport=6443 src=10.0.0.5 dst=35.191.10.91 sport=6443 dport=42480 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 0 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=42304 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=42304 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 8 CLOSE src=35.191.1.225 dst=10.0.0.2 sport=32768 dport=6443 src=10.0.0.5 dst=35.191.1.225 sport=6443 dport=32768 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 62 TIME_WAIT src=169.254.169.254 dst=35.184.123.49 sport=46443 dport=6080 src=10.0.0.5 dst=169.254.169.254 sport=6080 dport=46443 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 4 CLOSE src=35.191.11.49 dst=10.0.0.2 sport=35312 dport=6443 src=10.0.0.5 dst=35.191.11.49 sport=6443 dport=35312 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2


Possibly a coincidence, or possibly something related to 35.191.1.235:32768 specifically.

Comment 27 Michael Gugino 2021-02-18 19:05:23 UTC
Okay, I looked closely at the problem nf_conntrack above and noticed this:

ipv4     2 tcp      6 4 CLOSE src=35.191.1.235 dst=10.0.0.2 sport=36004 dport=6443 src=10.0.0.5 dst=35.191.1.235 sport=6443 dport=36004 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 117 SYN_SENT src=35.191.1.235 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.1.235 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

You can divide the ip/port groupings into two halves on each connection.  "Inbound" and "Outbound".  If we look closely at the first connection, the "Inbound" destination is the VIP.  The "outbound" destination is the real IP.  Looking at the broken connection, we can see the "Inbound" connection is the VIP, and the "outbound" connection is also the VIP.  This is the only connection that has this.

This seems like the connection could have been established prior to the application of the iptables rules, and we're possibly racing with conntrack.

Comment 28 Michael Gugino 2021-02-18 21:14:21 UTC
Not a perfect match, but the upstream kernel has some fixes in a somewhat related area.  This thread has the background: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1836816

Also, old issue that seems similar but not the same: https://github.com/projectcalico/felix/issues/1490

There seems to be a common intersection of conntrack + NAT and some broken connections.

Comment 29 Michael Gugino 2021-02-18 22:45:36 UTC
I spun up a new cluster, enabled logging in GCP on the internal API, witnessed many timeout drops.

Deleting the stale flows via conntrack on all masters immediately resolved the issue:

sh-4.4# conntrack -D -r 10.0.0.2
tcp      6 119 SYN_SENT src=35.191.11.225 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.11.225 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 119 SYN_SENT src=35.191.2.141 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.2.141 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 110 SYN_SENT src=35.191.2.131 dst=10.0.0.2 sport=32768 dport=6443 [UNREPLIED] src=10.0.0.2 dst=35.191.2.131 sport=6443 dport=32768 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.4 (conntrack-tools): 3 flow entries have been deleted.


We can add a workaround to whatever is adding the iptables VIP rules, potentially a system.d unit that fires after the iptables rules (might want to add a small sleep so we don't race) to run

'conntrack -D -r 10.0.0.2'

This should be safe, there should be no return flows with 10.0.0.2 source, we never want those.

The ultimate fix may be a kernel patch, see above comment.  It might even be fixed upstream already, TBD.  Unfortunately, sufficient tooling to trace the kernel is not available in the debug image at this time.

Comment 30 Michael Gugino 2021-02-18 22:52:53 UTC
sh-4.4# uname -a
Linux ci-ln-sq4nzkk-f76d1-5dfd9-master-1 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Wed Dec 16 03:30:52 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

Comment 31 W. Trevor King 2021-02-19 00:03:19 UTC
[1] softens one test that was failing on this.  Once this issue is fixed, we should revert the softening.

[1]: https://github.com/openshift/origin/pull/25906

Comment 32 W. Trevor King 2021-02-19 01:18:28 UTC
Oops, origin#25906 was actually about bug 1928839.  Please ignore comment 31.

Comment 33 Antonio Ojea 2021-03-01 14:01:29 UTC
Fix for the machine config-operator:
- Simplify current implementation based on a pod, systemd unit file and bash script to a standalone pod
- Flush stale conntrack entries

https://github.com/openshift/machine-config-operator/pull/2422/

Comment 35 Dan Williams 2021-04-22 18:51:11 UTC
Moving to ON_QA because the PR merged.

Comment 36 W. Trevor King 2021-04-22 19:34:47 UTC
Moving back to POST.  This bug links two PRs [1,2], and only [2] has merged.  If [1] is no longer needed, it should be closed and unlinked from the bug.  After which this bug will be MODIFIED.  ART's sweeper needs to move it from MODIFIED to ON_QA so it gets attached to an errata.

[1]: https://github.com/openshift/machine-config-operator/pull/2422
[2]: https://github.com/openshift/machine-config-operator/pull/2477

Comment 38 W. Trevor King 2021-04-24 21:10:34 UTC
Now links only [1], which has landed.  Moving to MODIFIED, and ART's bot should be around shortly to move to ON_QA.

[1]: https://coreos.slack.com/archives/CDCP2LA9L/p1619297552433100

Comment 40 Xingxing Xia 2021-04-25 11:16:21 UTC
First, tried to verify by:
For my on hand env of version 4.8.0-fc.0, I made Logs Enabled for the health check api-internal by following comment 6, and queried jsonPayload.healthCheckProbeResult.healthState!="" by following comment 17. But it does not show any entry. I don't know why.

Given I don't know other verification way, then I tried to verify by comment 0 search with dial tcp 10.0.0.2:6443: i/o timeout within 7 days (the PR merged already one month) in 4.8 non-upgrade jobs:
w3m -dump -cols 200 "https://search.ci.openshift.org/?search=dial+tcp+10.0.0.2%3A6443%3A+i%2Fo+timeout&maxAge=168h&context=1&type=junit&name=4%5C.8&excludeName=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job" > bug_1925698_search_result.txt . See http://file.rdu.redhat.com/~xxia/bug_1925698_search_result.txt , still see not few error occurrences, and ALL jobs are gcp jobs, given my search does not specify gcp job. So moving to ASSIGNED for Dev double check, thanks!

Comment 42 Scott Dodson 2021-05-24 16:54:47 UTC
(In reply to Xingxing Xia from comment #40)

The CI results show that we're seeing considerably higher success rate even if it's not perfect. Do you mind if we move this to VERIFIED so we can get the changes back to 4.7 to improve things for customers there? Hopefully that will help us to continue to narrow in on any last areas to improve after that.

Comment 46 errata-xmlrpc 2021-07-27 22:41:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438