I debugged the cluster and found out that - the cluster-apiserver-operator connects to openshift-apiserver through kube-apiserver. I exec'ed into the operator pods and try: curl -k -i -H "Authorization: Bearer $(cat /var/run/ecrets/kubernetes.io/serviceaccount/token)" --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Host: kubernetes.default.svc.cluster.local" https://a.b.c.d:6443/apis/user.openshift.io/v1/users For 30% of the calls the requests for a.b.c.d=10.0.0.4 (that's master-0). The other instances are fine. We see this: HTTP/1.1 503 Service Unavailable Audit-Id: 36f7f163-8d7e-4e14-82ae-b6383d16661b Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Fri, 17 Apr 2020 15:04:51 GMT Content-Length: 64 Error trying to reach service: 'net/http: TLS handshake timeout - The message "Error trying to reach service" is from apimachinery/pkg/util/proxy/transport.go and used by the aggregator, i.e. the aggregator cannot reach openshift-apiserver. We randomly select an openshift-apiserver endpoint IP. So probably one is not reachable. - The openshift-apiserver pods don't show any trace of error. So probably requests never reach their target. - from inside msater-0 kube-apiserver logs I see E0417 14:59:11.519427 1 controller.go:114] loading OpenAPI spec for "v1.user.openshift.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error trying to reach service: 'net/http: TLS handshake timeout', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] So OpenAPI download fails as well. - I logged into this pods: kubectl exec -n openshift-kube-apiserver -it kube-apiserver-xxia-autbug-w6cn2-master-0 /bin/bash and checked the openshift-apiserver endpoints manually: kubectl get endpoints -n openshift-apiserver api 10.128.0.10:8443,10.129.0.8:8443,10.130.0.10:8443 35h and surprising none of them is reachable at all: curl -i -k https://10.130.0.10:8443 just blocks and eventually (after a minute?): HTTP/1.1 503 Service Unavailable Server: squid/4.9 Mime-Version: 1.0 Date: Fri, 17 Apr 2020 15:19:20 GMT Content-Type: text/html;charset=utf-8 Content-Length: 3588 X-Squid-Error: ERR_CONNECT_FAIL 110 Vary: Accept-Language Content-Language: en curl: (56) Received HTTP code 503 from proxy after CONNECT Which proxy? There shouldn't be a proxy in-between. - I double checked the same from the openshift-apiserver-operator, just to verify that the curl is supposed to work: kubectl exec -n openshift-apiserver-operator -it openshift-apiserver-operator-68858b89cb-5kqcp /bin/bash curl -i -k https://10.130.0.10:8443 HTTP/1.1 403 Forbidden Audit-Id: fa4c7d47-6cd7-4267-8aaa-2724b6af652f Cache-Control: no-store Content-Type: application/json X-Content-Type-Options: nosniff Date: Fri, 17 Apr 2020 15:16:04 GMT Content-Length: 233 { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 This is what is expected.
Moving this to SDN team as there is clearly something wrong with networking.
Moved to 4.5 to develop the fix, and then we can consider the backport from there. The problem seems to be between the squid proxy and the target. Dane, can you take a quick look and see if anything strange catches your eye. Thanks
As Stefan mentions, traffic between endpoints on the pod, service or machine networks should not be proxied. These networks are automatically added to NO_PROXY from the install-config ConfigMap when the cluster-wide egress proxy feature is enabled. Verify these networks are present in proxy.status.noProxy. For example: $ oc get cm/cluster-config-v1 -n kube-system -o yaml apiVersion: v1 data: install-config: | <SNIP> networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 <SNIP> If the pod, service and machine networks differ from your install-config, then you must update the configmap and force a reconciliation of the proxy object or update proxy.spec.noProxy with the appropriate network addresses. $ oc get proxy/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: name: cluster <SNIP> status: noProxy: <THE_NETWORKS_FROM_ABOVE>,<OTHER_SYSTEM_GENERATED_NOPROXIES>,<USER_PROVIDED_NOPROXIES> Configuring cluster-wide egress proxy for Azure is covered in detail at https://docs.openshift.com/container-platform/4.3/installing/installing_azure/installing-azure-private.html#installation-configure-proxy_installing-azure-private
Currently on hand I don't have an env of the comment 0 matrix "4_4/ipi-on-azure/versioned-installer-ovn-customer_vpc-http_proxy". But I have an env of matrix 4_5/upi-on-aws/versioned-installer-http_proxy-ovn-ci . Against this env, I checked like comment 7, the proxy.spec.noProxy includes the networks of install-config "networking" part. Then I checked like comment 4: $ oc rsh -n openshift-kube-apiserver -it kube-apiserver-ip-10-0-52-18.us-east-2.compute.internal [root@ip-10-0-52-18 /]# env | grep -i proxy NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.wzheng-share.qe.devcluster.openshift.com,etcd-0.wzheng-share.qe.devcluster.openshift.com,etcd-1.wzheng-share.qe.devcluster.openshift.com,etcd-2.wzheng-share.qe.devcluster.openshift.com,localhost,test.no-proxy.com HTTPS_PROXY=http://<proxy user>:<proxy pass>@ec2-3-***.amazonaws.com:3128 HTTP_PROXY=http://<proxy user>:<proxy pass>@ec2-3-***.amazonaws.com:3128 # though 10.128.0.0/14 is included in NO_PROXY, why below curl still go through the proxy server? [root@ip-10-0-52-18 /]# curl -v -k https://10.128.0.10:8443 * About to connect() to proxy ec2-3-*** port 3128 (#0) * Trying 10.0.11.102... * Connected to ec2-3-*** (10.0.11.102) port 3128 (#0) * Establish HTTP proxy tunnel to 10.128.0.10:8443 * Proxy auth using Basic with user '<proxy user>' > CONNECT 10.128.0.10:8443 HTTP/1.1 > Host: 10.128.0.10:8443 ... < HTTP/1.1 503 Service Unavailable < Server: squid/4.9 < Mime-Version: 1.0 < Date: Sat, 09 May 2020 03:40:41 GMT ... * Received HTTP code 503 from proxy after CONNECT This is expected, because curl does not support CIDR in NO_PROXY, see https://curl.haxx.se/docs/manual.html "A comma-separated list of host names that shouldn't go through any proxy is set in ... NO_PROXY". Then I tried appending above IP by ` export NO_PROXY="$NO_PROXY,10.128.0.10" `, then the following ` curl -i -k https://10.128.0.10:8443 ` worked without the 503 issue. Though, this env does not have the "openshift-apiserver False" issue of this bug.
Now that it's clear the calls to apiserver are not being proxied, I'm reassigning to the SDN team.
This morning, checked the env again, `oc get co openshift-apiserver` is fine now: openshift-apiserver 4.5.0-0.nightly-2020-06-01-165039 True False False 82m > the master-0 KAS no route to the master-1 OAS pod > the master-1 KAS no route to the master-0 OAS pod But this "no route to host" issue in comment 13 still exists. And the OAS-O logs can show "no route to host" and thus switch between False and True: oc logs -n openshift-apiserver-operator openshift-apiserver-operator-7b598687b-9pfjm | grep "clusteroperator/openshift-apiserver changed: Available changed from" I0603 00:53:36.975214 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"9e404cb5-5bac-47a3-a382-af659698de50", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.47:8443/apis/apps.openshift.io/v1: Get https://10.129.0.47:8443/apis/apps.openshift.io/v1: dial tcp 10.129.0.47:8443: connect: no route to host") I0603 00:53:39.166852 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"9e404cb5-5bac-47a3-a382-af659698de50", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from False to True ("") I0603 00:53:39.174551 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"9e404cb5-5bac-47a3-a382-af659698de50", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from False to True ("") I0603 01:17:06.974338 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"9e404cb5-5bac-47a3-a382-af659698de50", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.47:8443/apis/apps.openshift.io/v1: Get https://10.129.0.47:8443/apis/apps.openshift.io/v1: dial tcp 10.129.0.47:8443: connect: no route to host") I0603 01:17:09.220269 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"9e404cb5-5bac-47a3-a382-af659698de50", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from False to True ("")
Checked more, found openshift-apiserver-operator is on master-0 (comment 16 10.129.0.47 is the IP of openshift-apiserver pod on master-1): [xxia@pres 2020-06-03 11:08:57 CST my]$ oc get po -n openshift-apiserver-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-apiserver-operator-7b598687b-9pfjm 1/1 Running 2 26h 10.130.0.2 hongli-pl039-hmf46-master-0 <none> <none>
Note: as comment 13 shows, comment 13's env does not have http(s)_proxy but hit the issue. Today hit it again which is upi-on-azure http_proxy env this time. Debugging found more clue about network: Communication between pod on master-1 to any pod on master-2 fails with "Unable to connect to the server". Other communication, e.g. between pod on master-1 to any pod on master-0, does not fail with it: # check master-2 pods $ oc get po -A -o wide | grep master-2 | grep -v "10\.0\.0" | grep -v Completed openshift-apiserver apiserver-86b47c6dcf-r6nvf 1/1 Running 0 20h 10.129.0.9 qe-jiazha-up3-06040541-master-2 <none> <none> openshift-controller-manager controller-manager-bxhnp 1/1 Running 0 119m 10.129.0.22 qe-jiazha-up3-06040541-master-2 <none> <none> ... openshift-multus multus-admission-controller-gdzfj 2/2 Running 0 20h 10.129.0.7 qe-jiazha-up3-06040541-master-2 <none> <none> # check master-0 pods [xxia@pres 2020-06-05 16:18:14 CST my]$ oc get po -A -o wide | grep master-0 | grep -v "10\.0\.0" | grep -v Completed openshift-apiserver apiserver-86b47c6dcf-tfspw 1/1 Running 0 21h 10.130.0.17 qe-jiazha-up3-06040541-master-0 <none> <none> ... openshift-controller-manager controller-manager-jw8p6 1/1 Running 0 150m 10.130.0.34 qe-jiazha-up3-06040541-master-0 <none> <none> ... openshift-multus multus-admission-controller-2k8hj 2/2 Running 0 21h 10.130.0.9 qe-jiazha-up3-06040541-master-0 <none> <none> # ssh to master-1, communication with any above pod on master-2 fails with "Unable to connect to the server" [core@qe-jiazha-up3-06040541-master-1 ~]$ oc get --insecure-skip-tls-verify --raw "/" --server https://10.129.0.9:8443/ Unable to connect to the server: net/http: TLS handshake timeout [core@qe-jiazha-up3-06040541-master-1 ~]$ oc get --insecure-skip-tls-verify --raw "/" --server https://10.129.0.22:8443/ Unable to connect to the server: net/http: TLS handshake timeout [core@qe-jiazha-up3-06040541-master-1 ~]$ oc get --insecure-skip-tls-verify --raw "/" --server https://10.129.0.7:8443/ Unable to connect to the server: net/http: TLS handshake timeout # However, master-1 communication with any above pod on master-0 does not fail with "Unable to connect to the server" [core@qe-jiazha-up3-06040541-master-1 ~]$ oc get --insecure-skip-tls-verify --raw "/" --server https://10.130.0.17:8443/ Error from server (Forbidden): forbidden: User "system:anonymous" cannot get path "/" [core@qe-jiazha-up3-06040541-master-1 ~]$ oc get --insecure-skip-tls-verify --raw "/" --server https://10.130.0.34:8443/ Error from server (Forbidden): forbidden: User "system:anonymous" cannot get path "/" [core@qe-jiazha-up3-06040541-master-1 ~]$ oc get --insecure-skip-tls-verify --raw "/" --server https://10.130.0.9:8443/ error: You must be logged in to the server (the server has asked for the client to provide credentials) Per above clue, checked logs of above pods on master-2, all found many: I0605 08:17:56.860067 1 log.go:172] http: TLS handshake error from 10.128.0.1:47786: EOF I0605 08:18:44.713085 1 log.go:172] http: TLS handshake error from 10.128.0.1:48376: EOF While pods on master-0 and master-1, don't have such logs. 10.128.0.1 seems related to above "cidr: 10.128.0.0/14"
I debugged more from the cluster with reproduced this issue in below https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/96986/artifact/workdir/install-dir/auth/kubeconfig/*view*/ Found only all https services CANNOT be accessed from master0 hostnetwork pod to master1 container pod. but http works well. and all https services can be accessed from master0 hostnetwork pod to master1 container pod see. ###there is one test pod I created on master-1######## oc get pod hello-pod2 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-pod2 1/1 Running 0 6m4s 10.130.0.28 qe-yapei68sh2-06080632-master-1 <none> <none> ##### try to access above test pod with https with port 8443 on master-0 hostnetwork pod##### $oc exec multus-7c2m9 -n openshift-multus -- curl --connect-timeout 5 https://10.130.0.28:8443 -k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0 curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received command terminated with exit code 28 ######try to access above test pod with http with port 8080 on master-0 hostnetwork pod##### $oc exec multus-7c2m9 -n openshift-multus -- curl --connect-timeout 5 http://10.130.0.28:8080 -k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 28 100 28 0 0 10241 0 Hello-OpenShift-1 http-8080 0 #### try to access the https 8443 on another pod which on master-2 #### $oc exec multus-rf4qw -n openshift-multus -- curl https://10.130.0.28:8443 -k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 29 100 29 0 0 829 0 --:--:-- --:--:-- --:--:-- 852 Hello-OpenShift-1 https-8443
> > Found only all https services CANNOT be accessed from master0 hostnetwork > pod to master1 container pod. but http works well. and all https services > can be accessed from master0 hostnetwork pod to master1 container pod > sorry, typo here. and all https services can be accessed from master2 hostnetwork pod to master1 container pod
Though hit frequently last two weeks, comment 39 didn't reproduce, removing the keyword unless hit again.
Since we have been unable to reproduce this for the past three days, the severity has been lowered. We will continue to investigate this and we can consider a backport once the real issue is understood.
(In reply to Anurag saxena from comment #44) > Apparently on above debug cluster ,kubeapiserver on master2 is continuously complaining timeouts Yeah it shows timeouts because, as above comments found the cause, communication between one master (here master 2 as you pointed) to pods on another master fails (note, all pods on another master, including the openshift-apiserver pod there; here openshift-apiserver pods host v1.xxx.openshift.io resources, thus the communication returns 503 when querying the openshift-apiserver endpoint on that another master)
This is a testblocker for now specially blocking OVN Hybrid windows cluster on Azure. These clusters seems to degrade with apiserver within 12 hours, blocking further testing.
@mcambria, ip route show cache on one of 6 nodes says (That one node is one of the master) # oc debug node/reliab453ovn2-kv2xm-master-0 -- chroot /host ip route show cache Starting pod/reliab453ovn2-kv2xm-master-0-debug ... To use host binaries, run `chroot /host` 10.0.0.8 dev eth0 cache expires 317sec mtu 1400 Removing debug pod ... Kubeconfig : https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/103439/artifact/workdir/install-dir/auth/kubeconfig
*** Bug 1861359 has been marked as a duplicate of this bug. ***
Created attachment 1712185 [details] corrected second packets
*** Bug 1840112 has been marked as a duplicate of this bug. ***
Knowledge Base article https://access.redhat.com/solutions/5252831 describes the workaround. But the article doesn't describe how to apply it to the nodes. It would be best to use a daemonset to make sure it runs on all nodes. You can just have the daemonset run a bash loop forever as they are doing at https://github.com/Azure/ARO-RP/blob/master/pkg/routefix/routefix.go#L31. The daemonset they use is at https://github.com/Azure/ARO-RP/blob/master/pkg/routefix/routefix.go#L147, but it is embedded in a go program.
Here is how to get the image to use. First get the name of the network-operator pod: $ oc get pods --namespace openshift-network-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES network-operator-8c7746884-2mm7p 1/1 Running 0 3d1h 10.0.0.6 qe-anurag54-hmprt-master-0 <none> <none> $ Describe this pod looking for Image: $ oc describe pod --namespace openshift-network-operator network-operator-8c7746884-2mm7p | grep Image Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c11ebce7a9c619e0585c10b3a4cbc6f81c3c82670677587fa3e18525e1dc276 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c11ebce7a9c619e0585c10b3a4cbc6f81c3c82670677587fa3e18525e1dc276 $ Use this image in the daemonset (also attached): kind: DaemonSet apiVersion: apps/v1 metadata: name: cachefix namespace: openshift-network-operator annotations: kubernetes.io/description: | This daemonset will flush route cache entries created with mtu of 1450. See https://bugzilla.redhat.com/show_bug.cgi?id=1825219 release.openshift.io/version: "{{.ReleaseVersion}}" spec: selector: matchLabels: app: cachefix template: metadata: labels: app: cachefix component: network type: infra openshift.io/component: network kubernetes.io/os: "linux" spec: hostNetwork: true priorityClassName: "system-cluster-critical" containers: # - name: cachefix image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c11ebce7a9c619e0585c10b3a4cbc6f81c3c82670677587fa3e18525e1dc276 command: - /bin/bash - -c - | set -xe echo "I$(date "+%m%d %H:%M:%S.%N") - cachefix - start cachefix ${K8S_NODE}" for ((;;)) do if ip route show cache | grep -q 'mtu 14'; then ip route show cache ip route flush cache fi sleep 60 done lifecycle: preStop: exec: command: ["/bin/bash", "-c", "echo cachefix done"] securityContext: privileged: true env: - name: K8S_NODE valueFrom: fieldRef: fieldPath: spec.nodeName nodeSelector: beta.kubernetes.io/os: "linux" tolerations: - operator: "Exists" effect: "NoExecute" - operator: "Exists" effect: "NoSchedule"
Created attachment 1717284 [details] daemonset to clear route cache entries
The workaround Mike provided should get QE unblocked while we wait for Azure to work out what is really wrong, and for the kernel change that works around the issue.
*** Bug 1886141 has been marked as a duplicate of this bug. ***
*** Bug 1890341 has been marked as a duplicate of this bug. ***
*** Bug 1899349 has been marked as a duplicate of this bug. ***
Please ignore the comment above. Wrong ticket. Sorry.
*** Bug 1940706 has been marked as a duplicate of this bug. ***
*** Bug 1921797 has been marked as a duplicate of this bug. ***
Pulling this back until we get https://github.com/openshift/cluster-network-operator/pull/1107 merged too.
Hi, Michael I have one cluster with version 4.8.0-0.nightly-2021-06-03-221810 on azure and working 2 days. and this issue did not be happen. but cannot confirm the fixed PR is working well. Do you have a better way to provide the fixed PR can fix this issue? volumeMounts: - mountPath: /etc/pki/tls/metrics-certs name: sdn-metrics-certs readOnly: true - command: - /bin/bash - -c - | set -xe touch /var/run/add_iptables.sh chmod 0755 /var/run/add_iptables.sh cat <<'EOF' > /var/run/add_iptables.sh #!/bin/sh if [ -z "$3" ] then echo "Called with host address missing, ignore" exit 0 fi echo "Adding ICMP drop rule for '$3' " if iptables -C CHECK_ICMP_SOURCE -p icmp -s $3 -j ICMP_ACTION then echo "iptables already set for $3" else iptables -A CHECK_ICMP_SOURCE -p icmp -s $3 -j ICMP_ACTION fi EOF echo "I$(date "+%m%d %H:%M:%S.%N") - drop-icmp - start drop-icmp ${K8S_NODE}" iptables -X CHECK_ICMP_SOURCE || true iptables -N CHECK_ICMP_SOURCE || true iptables -F CHECK_ICMP_SOURCE iptables -D INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE || true iptables -I INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE iptables -N ICMP_ACTION || true iptables -F ICMP_ACTION iptables -A ICMP_ACTION -j LOG iptables -A ICMP_ACTION -j DROP oc observe pods -n openshift-sdn -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh env: - name: K8S_NODE valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dd70a5200b6de5bc872b2424701a81031bc212453b6d8b4d11e04995054ca952 imagePullPolicy: IfNotPresent lifecycle: preStop: exec: command: - /bin/bash - -c - echo drop-icmp done name: drop-icmp resources: requests: cpu: 5m memory: 20Mi securityContext: privileged: true
(In reply to zhaozhanqi from comment #180) > Do you have a better way to provide the fixed PR can fix this issue? No. The issue takes 2 to 14 days to even show up. The best I can suggest is to check the iptables counters to see if any of the `ICMP_ACTION -j DROP` rules are non-zero.
I see the master-2 is capture the packet. sdn-knq8n :ICMP_ACTION - [0:0] [0:0] -A CHECK_ICMP_SOURCE -s 10.0.0.7/32 -p icmp -j ICMP_ACTION [0:0] -A CHECK_ICMP_SOURCE -s 10.0.0.8/32 -p icmp -j ICMP_ACTION [11:6336] -A CHECK_ICMP_SOURCE -s 10.0.0.6/32 -p icmp -j ICMP_ACTION [0:0] -A CHECK_ICMP_SOURCE -s 10.0.32.5/32 -p icmp -j ICMP_ACTION [0:0] -A CHECK_ICMP_SOURCE -s 10.0.32.4/32 -p icmp -j ICMP_ACTION [0:0] -A CHECK_ICMP_SOURCE -s 10.0.32.6/32 -p icmp -j ICMP_ACTION [11:6336] -A ICMP_ACTION -j LOG [11:6336] -A ICMP_ACTION -j DROP and check operators are working well $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-06-08-161629 True False False 17h baremetal 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h cloud-credential 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h cluster-autoscaler 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h config-operator 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h console 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h csi-snapshot-controller 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h dns 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h etcd 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h image-registry 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h ingress 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h insights 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h kube-apiserver 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h kube-controller-manager 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h kube-scheduler 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h kube-storage-version-migrator 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h machine-api 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h machine-approver 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h machine-config 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h marketplace 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h monitoring 4.8.0-0.nightly-2021-06-08-161629 True False False 3h42m network 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h node-tuning 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h openshift-apiserver 4.8.0-0.nightly-2021-06-08-161629 True False False 29h openshift-controller-manager 4.8.0-0.nightly-2021-06-08-161629 True False False 4d23h openshift-samples 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h operator-lifecycle-manager 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h service-ca 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h storage 4.8.0-0.nightly-2021-06-08-161629 True False False 5d23h Move this bug to 'verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
This bug shipped with some linked pull requests shipping in 4.8.2. It is important to track that product change. If you see similar issues in 4.8.2 or later releases, please open a new bug, which may link its own product changing PRs, and ship in some subsequent release.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days