Hide Forgot
openshift-multus/ip-reconciler* pods fail with "context deadline exceeded" on OCP 4.10 cluster Cluster History : ================= The cluster was upgraded from 4.10.0-0.nightly-ppc64le-2021-11-17-063238 to 4.10.0-0.nightly-ppc64le-2022-01-13-022003 The cluster is 97 days old ( counted from the day of creation ). Before upgrade Memory% on worker-1 had exceeded 100% on worker-1 node. Upgrade operation was successful after rebooting worker-1 node. After Upgrade : =============== All co's are good. All nodes are in Ready state. NAME STATUS ROLES AGE VERSION rdr-ap-longev-410-mon01-master-0 Ready master 97d v1.23.0+50f645e rdr-ap-longev-410-mon01-master-1 Ready master 97d v1.23.0+50f645e rdr-ap-longev-410-mon01-master-2 Ready master 97d v1.23.0+50f645e rdr-ap-longev-410-mon01-worker-0 Ready worker 97d v1.23.0+50f645e rdr-ap-longev-410-mon01-worker-1 Ready worker 97d v1.23.0+50f645e Issue seen : ============ ip-reconciler-* pods are in Error state. # oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-multus ip-reconciler-27375270-2sdrp 0/1 Error 0 40h 10.*.*.* rdr-ap-longev-410-mon01-worker-0 <none> <none> openshift-multus ip-reconciler-27375270-7h2mj 0/1 Error 0 40h 10.*.*.* rdr-ap-longev-410-mon01-worker-0 <none> <none> openshift-multus ip-reconciler-27375270-8gjts 0/1 Error 0 41h 10.*.*.* rdr-ap-longev-410-mon01-worker-0 <none> <none> openshift-multus ip-reconciler-27375270-c2tg8 0/1 Error 0 40h 10.*.*.* rdr-ap-longev-410-mon01-worker-0 <none> <none> openshift-multus ip-reconciler-27375270-fwpjf 0/1 Error 0 41h 10.*.*.* rdr-ap-longev-410-mon01-worker-0 <none> <none> openshift-multus ip-reconciler-27375270-j25s5 0/1 Error 0 40h 10.*.*.* rdr-ap-longev-410-mon01-worker-0 <none> <none> openshift-multus ip-reconciler-27375270-pjfdx 0/1 Error 0 40h 10.*.*.* rdr-ap-longev-410-mon01-worker-0 <none> <none> # oc logs pod/ip-reconciler-27375270-2sdrp -n openshift-multus I0118 14:31:22.335494 1 request.go:655] Throttling request took 1.079150668s, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s 2022-01-18T14:31:32Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-01-18T14:31:32Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded Described the pod in error state : ================================== # oc describe po ip-reconciler-27375270-2sdrp -n openshift-multus Name: ip-reconciler-27375270-2sdrp Namespace: openshift-multus Priority: 2000000000 Priority Class Name: system-cluster-critical Node: rdr-ap-longev-410-mon01-worker-0/193.*.*.* Start Time: Tue, 18 Jan 2022 14:31:17 +0000 Labels: controller-uid=0b2202e8-57a9-48f7-b8f5-05c0b3fc89ec job-name=ip-reconciler-27375270 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.*.*.*" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.*.*.*" ], "default": true, "dns": {} }] Status: Failed IP: 10.*.*.* IPs: IP: 10.*.*.* Controlled By: Job/ip-reconciler-27375270 Containers: whereabouts: Container ID: cri-o://5f96ee2124cb80623b01119928e491a4119ecd12db7cebb23f10942698e64283 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a01c5d18877255a6f2266d3188817bfdf36f234463b576a5c527877fbc711901 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a01c5d18877255a6f2266d3188817bfdf36f234463b576a5c527877fbc711901 Port: <none> Host Port: <none> Command: /ip-reconciler -log-level=verbose State: Terminated Reason: Error Exit Code: 2 Started: Tue, 18 Jan 2022 14:31:20 +0000 Finished: Tue, 18 Jan 2022 14:31:32 +0000 Ready: False Restart Count: 0 Requests: cpu: 25m memory: 25Mi Environment: <none> Mounts: /host/etc/cni/net.d from cni-net-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zpvq5 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: cni-net-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/cni/net.d HostPathType: kube-api-access-zpvq5: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: <none> Jobs : ====== # oc get job -A NAMESPACE NAME COMPLETIONS DURATION AGE openshift-cluster-version version--47p2h 1/1 15s 63d openshift-cluster-version version--pm7tv 1/1 10s 6d13h openshift-image-registry image-pruner-27329760 0/1 33d 33d openshift-image-registry image-pruner-27374400 1/1 6s 2d7h openshift-image-registry image-pruner-27375840 1/1 6s 31h openshift-image-registry image-pruner-27377280 1/1 6s 7h35m openshift-multus ip-reconciler-27375270 0/1 41h 41h openshift-operator-lifecycle-manager collect-profiles-27377700 1/1 12s 35m openshift-operator-lifecycle-manager collect-profiles-27377715 1/1 10s 20m openshift-operator-lifecycle-manager collect-profiles-27377730 1/1 12s 5m48s
This is likely a network or LB problem. You can prove the opposite by looking into kube-apiserver audit log to find the request that times out. Moving to multus until such a proof is available that the request actually arrives at the apiserver.
Could this be related to https://github.com/k8snetworkplumbingwg/whereabouts/issues/172 ?
Please find below the must-gather tar from the above cluster where the issue is seen : Link : https://drive.google.com/file/d/1AZDEWbp2fUUGo52zmmL4blluRx0tplvs/view?usp=sharing
Thank you Stefan for the recommendations - It looks like this is related to Etcd load balancer but I'm not sure what to do next. On master-0 I’m hitting this error (added a couple months ago): https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/monitor/monitor.go#L174 Here is some logs from /var/log/etcd/etcd-health-probe.log {"level":"error","ts":"2022-01-19T17:00:18.975Z","caller":"monitor/monitor.go:175","msg":"failed to create new health monitor","error":"context deadline exceeded","stacktrace":"github.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:175\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.NewMonitorCommand.func1.1\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:87\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.NewMonitorCommand.func1\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:95\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra.1/command.go:860\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra.1/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra.1/command.go:902\nmain.main\n\tgithub.com/openshift/cluster-etcd-operator/cmd/cluster-etcd-operator/main.go:45\nruntime.main\n\truntime/proc.go:255"} And about 3 minutes later in same file I see "error":"slow request" {"level":"info","ts":"2022-01-19T17:03:15.497Z","caller":"monitor/monitor.go:179","msg":"health monitor is starting","pod":"etcd-master-0.sonig-f38d.ibm.com","static-pod-version":7} {"level":"info","ts":"2022-01-19T17:03:16.001Z","caller":"health/log.go:41","msg":"possible service disruption detected","addresses":["https://9.47.89.104:2379"],"check":"QuorumReadSingleTarget","health":false,"start":"2022-01-19T17:03:15.501Z","error":"slow request"} Which is from port 2379 So then looking over in /var/log/kube-apiserver logs I searched for request timeouts which returned 1 result in termination.log [root@master-0 kube-apiserver]# grep -rnw ./ -e "Timeout: request" ./termination.log:21:E0119 17:03:03.087992 16 available_controller.go:546] v1.security.openshift.io failed with: Timeout: request did not complete within requested timeout - context deadline exceeded Here's the timeline: 1. At T17:00:18.975Z Etcd reports "msg":"failed to create new health monitor","error":"context deadline exceeded" 2. At T17:03:03.087992 Kube-apiserver reports request timeout 3. At T17:03:15.497Z Etcd reports "msg":"health monitor is starting" 4. At T17:03:16.001Z in the next line in Etcd there is "msg":"possible service disruption detected","addresses":["https://9.47.89.104:2379"],"check":"QuorumReadSingleTarget","health":false,"start":"2022-01-19T17:03:15.501Z","error":"slow request"} There is a lot of repeating logs at the bottom and unfortunately I cannot go any earlier in any of logs to investigate, this is the earliest logs I got. I will attach the beginning of the kube-apiserver termination.log because there is a lot more errors but then it just starts repeating at the bottom. From master-0 I started toolbox thinking nmap was available but just did this: ⬢[root@toolbox ~]# cat /etc/services | grep 2379 etcd-client 2379/tcp # etcd client communication I still do not know why Etcd could not create a health monitor. I’m not sure how to proceed or further debug this. Thanks
Created attachment 1852620 [details] kube-apiserver termination log
Reproducing the issue upon clean deployment of SNO Version: 4.10.0-fc.2 with ipv6 oc get pod -A|grep -v Run|grep -v Comple NAMESPACE NAME READY STATUS RESTARTS AGE openshift-multus ip-reconciler-27385455-24pp8 0/1 Error 0 11m openshift-multus ip-reconciler-27385455-bcllq 0/1 Error 0 8m36s openshift-multus ip-reconciler-27385455-bqb8p 0/1 Error 0 8m53s openshift-multus ip-reconciler-27385455-dp46g 0/1 Error 0 9m25s openshift-multus ip-reconciler-27385455-kmknm 0/1 Error 0 9m9s openshift-multus ip-reconciler-27385455-krpdk 0/1 Error 0 8m20s openshift-multus ip-reconciler-27385455-xmn7g 0/1 Error 0 9m41s If I delete the pods with error - they don't attempt to restart. If I reboot the SNO - the issue reproduces.
My assessment is two fold: 1. The error encountered by the Whereabouts IPAM IP reconciliation process is impacted negatively by some other event on the cluster -- be it API server load, or a network problem. However, this failure of the IP reconciler is a symptom, not a cause. 2. This does bring to light that the reconciler could use improvements in order as to not cause so much commotion when a problem does occur, as it should be able to cleanly handle this, and if it doesn't complete a run properly in a short period, it's ok if it runs later. Miguel Duarte has put forward changes upstream, and has also implemented a downstream CNO patch which should mitigate this noisiness.
Tested and verified when upgrading from 4.10.0-0.nightly-2022-01-24-020644 to 4.10.0-0.nightly-2022-01-27-104747 [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-24-020644 True True 10m Working towards 4.10.0-0.nightly-2022-01-27-104747: 96 of 769 done (12% complete) [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-24-020644 True True 13m Working towards 4.10.0-0.nightly-2022-01-27-104747: 116 of 769 done (15% complete) [weliang@weliang ~]$ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES [weliang@weliang ~]$ oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-27-104747 True False 28m Cluster version is 4.10.0-0.nightly-2022-01-27-104747 [weliang@weliang ~]$
Met the same issue after restarting all master/worker machines: 1. create an IPI cluster on AWS 2. restart all master and worker ec2 instances $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-26-234447 True False 14h Cluster version is 4.10.0-0.nightly-2022-01-26-234447 $ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-multus ip-reconciler-27388905-gh7c5 0/1 Error 0 4h24m 10.129.2.11 ip-10-0-156-72.us-east-2.compute.internal <none> <none> $ oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE $ oc logs -n openshift-multus ip-reconciler-27388905-gh7c5 I0128 01:46:45.330141 1 request.go:655] Throttling request took 1.119493351s, request: GET:https://172.30.0.1:443/apis/policy/v1?timeout=32s I0128 01:46:55.330278 1 request.go:655] Throttling request took 11.118673701s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s 2022-01-28T01:46:55Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-01-28T01:46:55Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded must gather logs https://drive.google.com/file/d/1eo12aqGQlapbEQYAsfs31vhKSiLiKf_L/view?usp=sharing
The issue is NOT reproduced with 4.10.0-fc.3 on Power platform. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-fc.3 True False 3h48m Cluster version is 4.10.0-fc.3 # oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE # oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES # oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE multus-47qkj 1/1 Running 0 4h8m multus-55b22 1/1 Running 0 3h56m multus-66zws 1/1 Running 0 3h57m multus-additional-cni-plugins-524bj 1/1 Running 0 3h56m multus-additional-cni-plugins-5q8z7 1/1 Running 0 3h57m multus-additional-cni-plugins-t5lsx 1/1 Running 0 4h8m multus-additional-cni-plugins-wswbb 1/1 Running 0 4h8m multus-additional-cni-plugins-xkqjl 1/1 Running 0 4h8m multus-admission-controller-2crrp 2/2 Running 0 4h8m multus-admission-controller-klgll 2/2 Running 0 4h8m multus-admission-controller-r8h4p 2/2 Running 0 4h8m multus-b86hs 1/1 Running 0 4h8m multus-qr2sf 1/1 Running 0 4h8m network-metrics-daemon-4sw5p 2/2 Running 0 3h56m network-metrics-daemon-5fnkb 2/2 Running 0 4h8m network-metrics-daemon-5qrsq 2/2 Running 0 3h57m network-metrics-daemon-6l5f9 2/2 Running 0 4h8m network-metrics-daemon-lmkjk 2/2 Running 0 4h8m
Likely needs fixes related to https://bugzilla.redhat.com/show_bug.cgi?id=2048575 as well, which changes to use the api-int load balancer.
(In reply to Alisha from comment #14) > The issue is NOT reproduced with 4.10.0-fc.3 on Power platform. > > # oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING SINCE STATUS > version 4.10.0-fc.3 True False 3h48m Cluster version is > 4.10.0-fc.3 > > # oc get co | grep -v "True.*False.*False" > NAME VERSION AVAILABLE > PROGRESSING DEGRADED SINCE MESSAGE > > # oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed" > NAMESPACE NAME > READY STATUS RESTARTS AGE IP NODE > NOMINATED NODE READINESS GATES > > # oc get pods -n openshift-multus > NAME READY STATUS RESTARTS AGE > multus-47qkj 1/1 Running 0 4h8m > multus-55b22 1/1 Running 0 3h56m > multus-66zws 1/1 Running 0 3h57m > multus-additional-cni-plugins-524bj 1/1 Running 0 3h56m > multus-additional-cni-plugins-5q8z7 1/1 Running 0 3h57m > multus-additional-cni-plugins-t5lsx 1/1 Running 0 4h8m > multus-additional-cni-plugins-wswbb 1/1 Running 0 4h8m > multus-additional-cni-plugins-xkqjl 1/1 Running 0 4h8m > multus-admission-controller-2crrp 2/2 Running 0 4h8m > multus-admission-controller-klgll 2/2 Running 0 4h8m > multus-admission-controller-r8h4p 2/2 Running 0 4h8m > multus-b86hs 1/1 Running 0 4h8m > multus-qr2sf 1/1 Running 0 4h8m > network-metrics-daemon-4sw5p 2/2 Running 0 3h56m > network-metrics-daemon-5fnkb 2/2 Running 0 4h8m > network-metrics-daemon-5qrsq 2/2 Running 0 3h57m > network-metrics-daemon-6l5f9 2/2 Running 0 4h8m > network-metrics-daemon-lmkjk 2/2 Running 0 4h8m Issue did not reproduce after upgrading the above cluster ( 4.10.0-fc.3 ) to 4.10.0-rc.0 ======================================================================================== # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.0 True False 55m Cluster version is 4.10.0-rc.0 All pods are healthy. No ip-reconciler-* pods are found on the cluster. # oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE Also , manually rebooted both the master and worker nodes after upgrade. ========================================================================= The issue is not seen. The cluster is healthy.
Hi Yunfei, could you double check if you can reproduce the problem in your site? Otherwise, I will close this bug. Thanks!
Issues in comment 13 are not encountered every time, approximately once every three trials.
Issue is seen with the direct installation of 4.10.0-rc.1 on ppc64le : # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.1 True False 32h Cluster version is 4.10.0-rc.1 # oc get pods --all-namespaces -o=wide | grep -vi running | grep -vi completed NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-multus ip-reconciler-27404790-fz6c5 # oc get job -A NAMESPACE NAME COMPLETIONS DURATION AGE openshift-image-registry image-pruner-27404640 1/1 6s 16h openshift-multus ip-reconciler-27404790 0/1 14h 14h openshift-operator-lifecycle-manager collect-profiles-27405600 1/1 8s 33m openshift-operator-lifecycle-manager collect-profiles-27405615 1/1 12s 18m openshift-operator-lifecycle-manager collect-profiles-27405630 1/1 11s 3m34s # oc describe job ip-reconciler-27404790 -n openshift-multus Name: ip-reconciler-27404790 Namespace: openshift-multus Selector: controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9 Labels: controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9 job-name=ip-reconciler-27404790 Annotations: batch.kubernetes.io/job-tracking: Controlled By: CronJob/ip-reconciler Parallelism: 1 Completions: 1 Completion Mode: NonIndexed Start Time: Mon, 07 Feb 2022 21:30:00 -0500 Pods Statuses: 0 Active / 0 Succeeded / 1 Failed Pod Template: Labels: controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9 job-name=ip-reconciler-27404790 Service Account: multus Containers: whereabouts: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:254efd7dbb69e80ca2899aed5fbbc0bf334c90950090ed7b59ab13fa45089e06 Port: <none> Host Port: <none> Command: /ip-reconciler -log-level=verbose Requests: cpu: 25m memory: 25Mi Environment: <none> Mounts: /host/etc/cni/net.d from cni-net-dir (rw) Volumes: cni-net-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/cni/net.d HostPathType: Priority Class Name: system-cluster-critical Events: <none> # oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
I've got the very same issue with rc.1 on Azure OpenShift: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.1 True False 46h Cluster version is 4.10.0-rc.1 $ oc get job -A NAMESPACE NAME COMPLETIONS DURATION AGE openshift-cluster-version version-4.10.0-rc.1-wpszt 1/1 8s 46h openshift-image-registry image-pruner-27406080 1/1 4s 8h openshift-marketplace 33e0b5bfab49b1802c8543c13593c4ab6e56c087bf00159a8bd5859141091af 1/1 8s 6d13h openshift-multus ip-reconciler-27401085 0/1 3d20h 3d20h openshift-operator-lifecycle-manager collect-profiles-27406605 1/1 9s 14m $ oc describe job ip-reconciler-27401085 -n openshift-multus Name: ip-reconciler-27401085 Namespace: openshift-multus Selector: controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e Labels: controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e job-name=ip-reconciler-27401085 Annotations: batch.kubernetes.io/job-tracking: Controlled By: CronJob/ip-reconciler Parallelism: 1 Completions: 1 Completion Mode: NonIndexed Start Time: Sat, 05 Feb 2022 13:45:00 +0100 Pods Statuses: 0 Active / 0 Succeeded / 1 Failed Pod Template: Labels: controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e job-name=ip-reconciler-27401085 Service Account: multus Containers: whereabouts: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8a6ac1fb6f73f65a36a15c8ca3cdab182def90289e74d7bd1692d44cb8a3b7e0 Port: <none> Host Port: <none> Command: /ip-reconciler -log-level=verbose Requests: cpu: 25m memory: 25Mi Environment: <none> Mounts: /host/etc/cni/net.d from cni-net-dir (rw) Volumes: cni-net-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/cni/net.d HostPathType: Priority Class Name: system-cluster-critical Events: <none>
(In reply to Alisha from comment #20) > Issue is seen with the direct installation of 4.10.0-rc.1 on ppc64le : > > # oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING SINCE STATUS > version 4.10.0-rc.1 True False 32h Cluster version is > 4.10.0-rc.1 > > # oc get pods --all-namespaces -o=wide | grep -vi running | grep -vi > completed > NAMESPACE NAME > READY STATUS RESTARTS AGE IP NODE > NOMINATED NODE READINESS GATES > openshift-multus > ip-reconciler-27404790-fz6c5 > > > # oc get job -A > NAMESPACE NAME > COMPLETIONS DURATION AGE > openshift-image-registry image-pruner-27404640 1/1 > 6s 16h > openshift-multus ip-reconciler-27404790 0/1 > 14h 14h > openshift-operator-lifecycle-manager collect-profiles-27405600 1/1 > 8s 33m > openshift-operator-lifecycle-manager collect-profiles-27405615 1/1 > 12s 18m > openshift-operator-lifecycle-manager collect-profiles-27405630 1/1 > 11s 3m34s > > > # oc describe job ip-reconciler-27404790 -n openshift-multus > Name: ip-reconciler-27404790 > Namespace: openshift-multus > Selector: controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9 > Labels: controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9 > job-name=ip-reconciler-27404790 > Annotations: batch.kubernetes.io/job-tracking: > Controlled By: CronJob/ip-reconciler > Parallelism: 1 > Completions: 1 > Completion Mode: NonIndexed > Start Time: Mon, 07 Feb 2022 21:30:00 -0500 > Pods Statuses: 0 Active / 0 Succeeded / 1 Failed > Pod Template: > Labels: controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9 > job-name=ip-reconciler-27404790 > Service Account: multus > Containers: > whereabouts: > Image: > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 254efd7dbb69e80ca2899aed5fbbc0bf334c90950090ed7b59ab13fa45089e06 > Port: <none> > Host Port: <none> > Command: > /ip-reconciler > -log-level=verbose > Requests: > cpu: 25m > memory: 25Mi > Environment: <none> > Mounts: > /host/etc/cni/net.d from cni-net-dir (rw) > Volumes: > cni-net-dir: > Type: HostPath (bare host directory volume) > Path: /etc/kubernetes/cni/net.d > HostPathType: > Priority Class Name: system-cluster-critical > Events: <none> > > > # oc get co | grep -v "True.*False.*False" > NAME VERSION AVAILABLE > PROGRESSING DEGRADED SINCE MESSAGE Link to the data gathered from above cluster using the oc adm must-gather command : https://drive.google.com/file/d/18gKBqwIAZBUVonX7hiW_T3MTG_vk2YxD/view?usp=sharing
Reproduced this (4/4) issue on AWS, same as comment 13. +dosmith +@weliang will this issue be fixed in 4.10.0? Is there any workaround? $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-02-26-230022 True False 106m Cluster version is 4.10.0-0.nightly-2022-02-26-230022 $ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-multus ip-reconciler-27433815-cs5bn 0/1 Error 0 31m 10.0.77.1 ip-10-0-77-1.us-east-2.compute.internal <none> <none> $ oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE $ oc logs -n openshift-multus ip-reconciler-27433815-cs5bn I0228 06:16:56.500105 1 request.go:665] Waited for 1.166754462s due to client-side throttling, not priority and fairness, request: GET:https://api-int.yunjiang-28u2.qe.devcluster.openshift.com:6443/apis/security.internal.openshift.io/v1?timeout=32s I0228 06:17:06.500159 1 request.go:665] Waited for 11.16662233s due to client-side throttling, not priority and fairness, request: GET:https://api-int.yunjiang-28u2.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1?timeout=32s 2022-02-28T06:17:06Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-02-28T06:17:06Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
yunjiang Can not reproduce the issue in latest v4.11 build, I will close this bug now. Please feel free to re open it, if you still see the problem in v4.11. [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-25-193227 True False 15m Cluster version is 4.11.0-0.nightly-2022-05-25-193227 [weliang@weliang ~]$ oc get pod -n openshift-multus NAME READY STATUS RESTARTS AGE multus-4pcgq 1/1 Running 0 26m multus-additional-cni-plugins-4cs5h 1/1 Running 0 26m multus-additional-cni-plugins-69bbg 1/1 Running 0 26m multus-additional-cni-plugins-gzvgg 1/1 Running 0 34m multus-additional-cni-plugins-jg8rx 1/1 Running 0 34m multus-additional-cni-plugins-sx9vw 1/1 Running 0 34m multus-additional-cni-plugins-xjv7h 1/1 Running 0 26m multus-admission-controller-kdcrt 2/2 Running 0 33m multus-admission-controller-p9v95 2/2 Running 0 34m multus-admission-controller-wpdvm 2/2 Running 0 33m multus-c5dnq 1/1 Running 0 34m multus-gl5kk 1/1 Running 0 26m multus-s8zzb 1/1 Running 0 34m multus-wk6dr 1/1 Running 0 26m multus-z5tjb 1/1 Running 0 34m network-metrics-daemon-bnz6g 2/2 Running 0 26m network-metrics-daemon-czhgp 2/2 Running 0 34m network-metrics-daemon-dlt2g 2/2 Running 0 26m network-metrics-daemon-hqk5p 2/2 Running 0 26m network-metrics-daemon-j524z 2/2 Running 0 34m network-metrics-daemon-pj4nm 2/2 Running 0 34m [weliang@weliang ~]$