2042796 – openshift-multus/ip-reconciler* pods fail with "context deadline exceeded" on OCP 4.10 cluster

Bug 2042796 - openshift-multus/ip-reconciler* pods fail with "context deadline exceeded" on OCP 4.10 cluster

Summary: openshift-multus/ip-reconciler* pods fail with "context deadline exceeded" o...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	ppc64le
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Miguel Duarte Barroso
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-20 07:16 UTC by Alisha
Modified:	2023-09-15 01:51 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-31 14:14:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kube-apiserver termination log (8.07 KB, text/plain) 2022-01-21 21:51 UTC, Mick Tarsel	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1290	0	None	open	Bug 2042796: whereabouts, reconciler: disable retries on failure	2022-01-25 19:08:12 UTC

Description Alisha 2022-01-20 07:16:30 UTC

Comment 1 Alisha 2022-01-20 07:36:51 UTC

openshift-multus/ip-reconciler* pods fail with "context deadline exceeded" on OCP 4.10 cluster


Cluster History : 
=================
The cluster was upgraded from 4.10.0-0.nightly-ppc64le-2021-11-17-063238 to 4.10.0-0.nightly-ppc64le-2022-01-13-022003
The cluster is 97 days old ( counted from the day of creation ).
Before upgrade Memory% on worker-1 had exceeded 100% on worker-1 node. Upgrade operation was successful after rebooting worker-1 node.


After Upgrade : 
===============
All co's are good. 
All nodes are in Ready state.
NAME                               STATUS   ROLES    AGE   VERSION
rdr-ap-longev-410-mon01-master-0   Ready    master   97d   v1.23.0+50f645e
rdr-ap-longev-410-mon01-master-1   Ready    master   97d   v1.23.0+50f645e
rdr-ap-longev-410-mon01-master-2   Ready    master   97d   v1.23.0+50f645e
rdr-ap-longev-410-mon01-worker-0   Ready    worker   97d   v1.23.0+50f645e
rdr-ap-longev-410-mon01-worker-1   Ready    worker   97d   v1.23.0+50f645e


Issue seen : 
============
ip-reconciler-* pods are in Error state.

# oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                              READY   STATUS      RESTARTS         AGE     IP                NODE                               NOMINATED NODE   READINESS GATES
openshift-multus                                   ip-reconciler-27375270-2sdrp                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-7h2mj                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-8gjts                                      0/1     Error       0                41h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-c2tg8                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-fwpjf                                      0/1     Error       0                41h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-j25s5                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-pjfdx                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>



# oc logs pod/ip-reconciler-27375270-2sdrp -n openshift-multus
I0118 14:31:22.335494       1 request.go:655] Throttling request took 1.079150668s, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
2022-01-18T14:31:32Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-18T14:31:32Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded


Described the pod in error state : 
==================================

# oc describe po ip-reconciler-27375270-2sdrp -n openshift-multus
Name:                 ip-reconciler-27375270-2sdrp
Namespace:            openshift-multus
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 rdr-ap-longev-410-mon01-worker-0/193.*.*.*
Start Time:           Tue, 18 Jan 2022 14:31:17 +0000
Labels:               controller-uid=0b2202e8-57a9-48f7-b8f5-05c0b3fc89ec
                      job-name=ip-reconciler-27375270
Annotations:          k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.*.*.*"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.*.*.*"
                            ],
                            "default": true,
                            "dns": {}
                        }]
Status:               Failed
IP:                   10.*.*.*
IPs:
  IP:           10.*.*.*
Controlled By:  Job/ip-reconciler-27375270
Containers:
  whereabouts:
    Container ID:  cri-o://5f96ee2124cb80623b01119928e491a4119ecd12db7cebb23f10942698e64283
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a01c5d18877255a6f2266d3188817bfdf36f234463b576a5c527877fbc711901
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a01c5d18877255a6f2266d3188817bfdf36f234463b576a5c527877fbc711901
    Port:          <none>
    Host Port:     <none>
    Command:
      /ip-reconciler
      -log-level=verbose
    State:          Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 18 Jan 2022 14:31:20 +0000
      Finished:     Tue, 18 Jan 2022 14:31:32 +0000
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        25m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zpvq5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/cni/net.d
    HostPathType:
  kube-api-access-zpvq5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>


Jobs :
======

# oc get job -A
NAMESPACE                              NAME                        COMPLETIONS   DURATION   AGE
openshift-cluster-version              version--47p2h              1/1           15s        63d
openshift-cluster-version              version--pm7tv              1/1           10s        6d13h
openshift-image-registry               image-pruner-27329760       0/1           33d        33d
openshift-image-registry               image-pruner-27374400       1/1           6s         2d7h
openshift-image-registry               image-pruner-27375840       1/1           6s         31h
openshift-image-registry               image-pruner-27377280       1/1           6s         7h35m
openshift-multus                       ip-reconciler-27375270      0/1           41h        41h
openshift-operator-lifecycle-manager   collect-profiles-27377700   1/1           12s        35m
openshift-operator-lifecycle-manager   collect-profiles-27377715   1/1           10s        20m
openshift-operator-lifecycle-manager   collect-profiles-27377730   1/1           12s        5m48s

Comment 2 Stefan Schimanski 2022-01-20 08:34:13 UTC

This is likely a network or LB problem. You can prove the opposite by looking into kube-apiserver audit log to find the request that times out. Moving to multus until such a proof is available that the request actually arrives at the apiserver.

Comment 3 Manoj Kumar 2022-01-21 13:51:50 UTC

Could this be related to https://github.com/k8snetworkplumbingwg/whereabouts/issues/172
?

Comment 4 Alisha 2022-01-21 15:04:15 UTC

Please find below the must-gather tar from the above cluster where the issue is seen : 

Link : https://drive.google.com/file/d/1AZDEWbp2fUUGo52zmmL4blluRx0tplvs/view?usp=sharing

Comment 5 Mick Tarsel 2022-01-21 21:50:39 UTC

Thank you Stefan for the recommendations - It looks like this is related to Etcd load balancer but I'm not sure what to do next.

On master-0 I’m hitting this error (added a couple months ago):
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/monitor/monitor.go#L174

Here is some logs from /var/log/etcd/etcd-health-probe.log

{"level":"error","ts":"2022-01-19T17:00:18.975Z","caller":"monitor/monitor.go:175","msg":"failed to create new health monitor","error":"context deadline exceeded","stacktrace":"github.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:175\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.NewMonitorCommand.func1.1\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:87\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.NewMonitorCommand.func1\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:95\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra.1/command.go:860\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra.1/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra.1/command.go:902\nmain.main\n\tgithub.com/openshift/cluster-etcd-operator/cmd/cluster-etcd-operator/main.go:45\nruntime.main\n\truntime/proc.go:255"}


And about 3 minutes later in same file I see "error":"slow request"

{"level":"info","ts":"2022-01-19T17:03:15.497Z","caller":"monitor/monitor.go:179","msg":"health monitor is starting","pod":"etcd-master-0.sonig-f38d.ibm.com","static-pod-version":7}
{"level":"info","ts":"2022-01-19T17:03:16.001Z","caller":"health/log.go:41","msg":"possible service disruption detected","addresses":["https://9.47.89.104:2379"],"check":"QuorumReadSingleTarget","health":false,"start":"2022-01-19T17:03:15.501Z","error":"slow request"}

Which is from port 2379

So then looking over in /var/log/kube-apiserver logs I searched for request timeouts which returned 1 result in termination.log

[root@master-0 kube-apiserver]# grep -rnw ./ -e "Timeout: request"
./termination.log:21:E0119 17:03:03.087992      16 available_controller.go:546] v1.security.openshift.io failed with: Timeout: request did not complete within requested timeout - context deadline exceeded

Here's the timeline:
1. At T17:00:18.975Z Etcd reports "msg":"failed to create new health monitor","error":"context deadline exceeded"
2. At T17:03:03.087992  Kube-apiserver reports request timeout
3. At T17:03:15.497Z Etcd reports "msg":"health monitor is starting"
4. At T17:03:16.001Z in the next line in Etcd there is "msg":"possible service disruption detected","addresses":["https://9.47.89.104:2379"],"check":"QuorumReadSingleTarget","health":false,"start":"2022-01-19T17:03:15.501Z","error":"slow request"}

There is a lot of repeating logs at the bottom and unfortunately I cannot go any earlier in any of logs to investigate, this is the earliest logs I got.

I will attach the beginning of the kube-apiserver termination.log because there is a lot more errors but then it just starts repeating at the bottom.

From master-0 I started toolbox thinking nmap was available but just did this:

⬢[root@toolbox ~]# cat /etc/services | grep 2379
etcd-client     2379/tcp                # etcd client communication


I still do not know why Etcd could not create a health monitor. I’m not sure how to proceed or further debug this. Thanks

Comment 6 Mick Tarsel 2022-01-21 21:51:48 UTC

Created attachment 1852620 [details]
kube-apiserver termination log

Comment 7 Alexander Chuzhoy 2022-01-25 16:48:59 UTC

Reproducing the issue upon clean deployment of SNO Version: 4.10.0-fc.2  with ipv6


oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE                                          NAME                                                                         READY   STATUS      RESTARTS       AGE
openshift-multus                                   ip-reconciler-27385455-24pp8                                                 0/1     Error       0              11m
openshift-multus                                   ip-reconciler-27385455-bcllq                                                 0/1     Error       0              8m36s
openshift-multus                                   ip-reconciler-27385455-bqb8p                                                 0/1     Error       0              8m53s
openshift-multus                                   ip-reconciler-27385455-dp46g                                                 0/1     Error       0              9m25s
openshift-multus                                   ip-reconciler-27385455-kmknm                                                 0/1     Error       0              9m9s
openshift-multus                                   ip-reconciler-27385455-krpdk                                                 0/1     Error       0              8m20s
openshift-multus                                   ip-reconciler-27385455-xmn7g                                                 0/1     Error       0              9m41s


If I delete the pods with error - they don't attempt to restart.
If I reboot the SNO - the issue reproduces.

Comment 8 Douglas Smith 2022-01-25 18:57:22 UTC

My assessment is two fold:

1. The error encountered by the Whereabouts IPAM IP reconciliation process is impacted negatively by some other event on the cluster -- be it API server load, or a network problem. However, this failure of the IP reconciler is a symptom, not a cause.

2. This does bring to light that the reconciler could use improvements in order as to not cause so much commotion when a problem does occur, as it should be able to cleanly handle this, and if it doesn't complete a run properly in a short period, it's ok if it runs later. Miguel Duarte has put forward changes upstream, and has also implemented a downstream CNO patch which should mitigate this noisiness.

Comment 12 Weibin Liang 2022-01-27 20:24:02 UTC

Tested and verified when upgrading from 4.10.0-0.nightly-2022-01-24-020644 to 4.10.0-0.nightly-2022-01-27-104747

[weliang@weliang ~]$ oc get clusterversion  
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-24-020644   True        True          10m     Working towards 4.10.0-0.nightly-2022-01-27-104747: 96 of 769 done (12% complete)
[weliang@weliang ~]$ oc get clusterversion      
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-24-020644   True        True          13m     Working towards 4.10.0-0.nightly-2022-01-27-104747: 116 of 769 done (15% complete)




[weliang@weliang ~]$ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                        READY   STATUS      RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
[weliang@weliang ~]$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-27-104747   True        False         28m     Cluster version is 4.10.0-0.nightly-2022-01-27-104747
[weliang@weliang ~]$

Comment 13 Yunfei Jiang 2022-01-28 06:28:47 UTC

Met the same issue after restarting all master/worker machines:
1. create an IPI cluster on AWS
2. restart all master and worker ec2 instances

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-26-234447   True        False         14h     Cluster version is 4.10.0-0.nightly-2022-01-26-234447

$ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                        READY   STATUS      RESTARTS        AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
openshift-multus                                   ip-reconciler-27388905-gh7c5                                                0/1     Error       0               4h24m   10.129.2.11    ip-10-0-156-72.us-east-2.compute.internal    <none>           <none>

$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

$ oc logs -n openshift-multus ip-reconciler-27388905-gh7c5
I0128 01:46:45.330141       1 request.go:655] Throttling request took 1.119493351s, request: GET:https://172.30.0.1:443/apis/policy/v1?timeout=32s
I0128 01:46:55.330278       1 request.go:655] Throttling request took 11.118673701s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
2022-01-28T01:46:55Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-28T01:46:55Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

must gather logs https://drive.google.com/file/d/1eo12aqGQlapbEQYAsfs31vhKSiLiKf_L/view?usp=sharing

Comment 14 Alisha 2022-01-28 13:55:18 UTC

The issue is NOT reproduced with 4.10.0-fc.3 on Power platform. 

# oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-fc.3   True        False         3h48m   Cluster version is 4.10.0-fc.3

# oc get co | grep -v "True.*False.*False"
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

# oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                   READY   STATUS      RESTARTS        AGE     IP                NODE                                    NOMINATED NODE   READINESS GATES

# oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-47qkj                          1/1     Running   0          4h8m
multus-55b22                          1/1     Running   0          3h56m
multus-66zws                          1/1     Running   0          3h57m
multus-additional-cni-plugins-524bj   1/1     Running   0          3h56m
multus-additional-cni-plugins-5q8z7   1/1     Running   0          3h57m
multus-additional-cni-plugins-t5lsx   1/1     Running   0          4h8m
multus-additional-cni-plugins-wswbb   1/1     Running   0          4h8m
multus-additional-cni-plugins-xkqjl   1/1     Running   0          4h8m
multus-admission-controller-2crrp     2/2     Running   0          4h8m
multus-admission-controller-klgll     2/2     Running   0          4h8m
multus-admission-controller-r8h4p     2/2     Running   0          4h8m
multus-b86hs                          1/1     Running   0          4h8m
multus-qr2sf                          1/1     Running   0          4h8m
network-metrics-daemon-4sw5p          2/2     Running   0          3h56m
network-metrics-daemon-5fnkb          2/2     Running   0          4h8m
network-metrics-daemon-5qrsq          2/2     Running   0          3h57m
network-metrics-daemon-6l5f9          2/2     Running   0          4h8m
network-metrics-daemon-lmkjk          2/2     Running   0          4h8m

Comment 16 Douglas Smith 2022-02-01 15:21:17 UTC

Likely needs fixes related to https://bugzilla.redhat.com/show_bug.cgi?id=2048575 as well, which changes to use the api-int load balancer.

Comment 17 Alisha 2022-02-04 19:06:32 UTC

(In reply to Alisha from comment #14)
> The issue is NOT reproduced with 4.10.0-fc.3 on Power platform. 
> 
> # oc get clusterversion
> NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.10.0-fc.3   True        False         3h48m   Cluster version is
> 4.10.0-fc.3
> 
> # oc get co | grep -v "True.*False.*False"
> NAME                                       VERSION       AVAILABLE  
> PROGRESSING   DEGRADED   SINCE   MESSAGE
> 
> # oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
> NAMESPACE                                          NAME                     
> READY   STATUS      RESTARTS        AGE     IP                NODE          
> NOMINATED NODE   READINESS GATES
> 
> # oc get pods -n openshift-multus
> NAME                                  READY   STATUS    RESTARTS   AGE
> multus-47qkj                          1/1     Running   0          4h8m
> multus-55b22                          1/1     Running   0          3h56m
> multus-66zws                          1/1     Running   0          3h57m
> multus-additional-cni-plugins-524bj   1/1     Running   0          3h56m
> multus-additional-cni-plugins-5q8z7   1/1     Running   0          3h57m
> multus-additional-cni-plugins-t5lsx   1/1     Running   0          4h8m
> multus-additional-cni-plugins-wswbb   1/1     Running   0          4h8m
> multus-additional-cni-plugins-xkqjl   1/1     Running   0          4h8m
> multus-admission-controller-2crrp     2/2     Running   0          4h8m
> multus-admission-controller-klgll     2/2     Running   0          4h8m
> multus-admission-controller-r8h4p     2/2     Running   0          4h8m
> multus-b86hs                          1/1     Running   0          4h8m
> multus-qr2sf                          1/1     Running   0          4h8m
> network-metrics-daemon-4sw5p          2/2     Running   0          3h56m
> network-metrics-daemon-5fnkb          2/2     Running   0          4h8m
> network-metrics-daemon-5qrsq          2/2     Running   0          3h57m
> network-metrics-daemon-6l5f9          2/2     Running   0          4h8m
> network-metrics-daemon-lmkjk          2/2     Running   0          4h8m


Issue did not reproduce after upgrading the above cluster ( 4.10.0-fc.3 ) to 4.10.0-rc.0
========================================================================================

# oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.0   True        False         55m     Cluster version is 4.10.0-rc.0

All pods are healthy. No ip-reconciler-* pods are found on the cluster.

# oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                   READY   STATUS      RESTARTS       AGE     IP                NODE


Also , manually rebooted both the master and worker nodes after upgrade.
=========================================================================

The issue is not seen. The cluster is healthy.

Comment 18 Weibin Liang 2022-02-07 15:19:26 UTC

Hi Yunfei, could you double check if you can reproduce the problem in your site?
Otherwise, I will close this bug. Thanks!

Comment 19 Yunfei Jiang 2022-02-08 02:49:29 UTC

Issues in comment 13 are not encountered every time, approximately once every three trials.

Comment 20 Alisha 2022-02-08 16:39:46 UTC

Issue is seen with the direct installation of 4.10.0-rc.1 on ppc64le : 

# oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        False         32h     Cluster version is 4.10.0-rc.1

# oc get pods --all-namespaces -o=wide | grep -vi running | grep -vi completed
NAMESPACE                                          NAME                                                                       READY   STATUS      RESTARTS      AGE     IP                NODE                                        NOMINATED NODE   READINESS GATES
openshift-multus                                   ip-reconciler-27404790-fz6c5  


# oc get job -A
NAMESPACE                              NAME                        COMPLETIONS   DURATION   AGE
openshift-image-registry               image-pruner-27404640       1/1           6s         16h
openshift-multus                       ip-reconciler-27404790      0/1           14h        14h
openshift-operator-lifecycle-manager   collect-profiles-27405600   1/1           8s         33m
openshift-operator-lifecycle-manager   collect-profiles-27405615   1/1           12s        18m
openshift-operator-lifecycle-manager   collect-profiles-27405630   1/1           11s        3m34s


# oc describe job ip-reconciler-27404790 -n openshift-multus
Name:             ip-reconciler-27404790
Namespace:        openshift-multus
Selector:         controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
Labels:           controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
                  job-name=ip-reconciler-27404790
Annotations:      batch.kubernetes.io/job-tracking:
Controlled By:    CronJob/ip-reconciler
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Mon, 07 Feb 2022 21:30:00 -0500
Pods Statuses:    0 Active / 0 Succeeded / 1 Failed
Pod Template:
  Labels:           controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
                    job-name=ip-reconciler-27404790
  Service Account:  multus
  Containers:
   whereabouts:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:254efd7dbb69e80ca2899aed5fbbc0bf334c90950090ed7b59ab13fa45089e06
    Port:       <none>
    Host Port:  <none>
    Command:
      /ip-reconciler
      -log-level=verbose
    Requests:
      cpu:        25m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
  Volumes:
   cni-net-dir:
    Type:               HostPath (bare host directory volume)
    Path:               /etc/kubernetes/cni/net.d
    HostPathType:
  Priority Class Name:  system-cluster-critical
Events:                 <none>


# oc get co | grep -v "True.*False.*False"
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

Comment 21 Leandro Rebosio 2022-02-09 09:02:59 UTC

I've got the very same issue with rc.1 on Azure OpenShift:

$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        False         46h     Cluster version is 4.10.0-rc.1


$ oc get job -A
NAMESPACE                              NAME                                                              COMPLETIONS   DURATION   AGE
openshift-cluster-version              version-4.10.0-rc.1-wpszt                                         1/1           8s         46h
openshift-image-registry               image-pruner-27406080                                             1/1           4s         8h
openshift-marketplace                  33e0b5bfab49b1802c8543c13593c4ab6e56c087bf00159a8bd5859141091af   1/1           8s         6d13h
openshift-multus                       ip-reconciler-27401085                                            0/1           3d20h      3d20h
openshift-operator-lifecycle-manager   collect-profiles-27406605                                         1/1           9s         14m


$ oc describe job ip-reconciler-27401085 -n openshift-multus
Name:             ip-reconciler-27401085
Namespace:        openshift-multus
Selector:         controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e
Labels:           controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e
                  job-name=ip-reconciler-27401085
Annotations:      batch.kubernetes.io/job-tracking:
Controlled By:    CronJob/ip-reconciler
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Sat, 05 Feb 2022 13:45:00 +0100
Pods Statuses:    0 Active / 0 Succeeded / 1 Failed
Pod Template:
  Labels:           controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e
                    job-name=ip-reconciler-27401085
  Service Account:  multus
  Containers:
   whereabouts:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8a6ac1fb6f73f65a36a15c8ca3cdab182def90289e74d7bd1692d44cb8a3b7e0
    Port:       <none>
    Host Port:  <none>
    Command:
      /ip-reconciler
      -log-level=verbose
    Requests:
      cpu:        25m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
  Volumes:
   cni-net-dir:
    Type:               HostPath (bare host directory volume)
    Path:               /etc/kubernetes/cni/net.d
    HostPathType:
  Priority Class Name:  system-cluster-critical
Events:                 <none>

Comment 22 Alisha 2022-02-09 15:44:04 UTC

(In reply to Alisha from comment #20)
> Issue is seen with the direct installation of 4.10.0-rc.1 on ppc64le : 
> 
> # oc get clusterversion
> NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.10.0-rc.1   True        False         32h     Cluster version is
> 4.10.0-rc.1
> 
> # oc get pods --all-namespaces -o=wide | grep -vi running | grep -vi
> completed
> NAMESPACE                                          NAME                     
> READY   STATUS      RESTARTS      AGE     IP                NODE            
> NOMINATED NODE   READINESS GATES
> openshift-multus                                  
> ip-reconciler-27404790-fz6c5  
> 
> 
> # oc get job -A
> NAMESPACE                              NAME                       
> COMPLETIONS   DURATION   AGE
> openshift-image-registry               image-pruner-27404640       1/1      
> 6s         16h
> openshift-multus                       ip-reconciler-27404790      0/1      
> 14h        14h
> openshift-operator-lifecycle-manager   collect-profiles-27405600   1/1      
> 8s         33m
> openshift-operator-lifecycle-manager   collect-profiles-27405615   1/1      
> 12s        18m
> openshift-operator-lifecycle-manager   collect-profiles-27405630   1/1      
> 11s        3m34s
> 
> 
> # oc describe job ip-reconciler-27404790 -n openshift-multus
> Name:             ip-reconciler-27404790
> Namespace:        openshift-multus
> Selector:         controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
> Labels:           controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
>                   job-name=ip-reconciler-27404790
> Annotations:      batch.kubernetes.io/job-tracking:
> Controlled By:    CronJob/ip-reconciler
> Parallelism:      1
> Completions:      1
> Completion Mode:  NonIndexed
> Start Time:       Mon, 07 Feb 2022 21:30:00 -0500
> Pods Statuses:    0 Active / 0 Succeeded / 1 Failed
> Pod Template:
>   Labels:           controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
>                     job-name=ip-reconciler-27404790
>   Service Account:  multus
>   Containers:
>    whereabouts:
>     Image:     
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 254efd7dbb69e80ca2899aed5fbbc0bf334c90950090ed7b59ab13fa45089e06
>     Port:       <none>
>     Host Port:  <none>
>     Command:
>       /ip-reconciler
>       -log-level=verbose
>     Requests:
>       cpu:        25m
>       memory:     25Mi
>     Environment:  <none>
>     Mounts:
>       /host/etc/cni/net.d from cni-net-dir (rw)
>   Volumes:
>    cni-net-dir:
>     Type:               HostPath (bare host directory volume)
>     Path:               /etc/kubernetes/cni/net.d
>     HostPathType:
>   Priority Class Name:  system-cluster-critical
> Events:                 <none>
> 
> 
> # oc get co | grep -v "True.*False.*False"
> NAME                                       VERSION       AVAILABLE  
> PROGRESSING   DEGRADED   SINCE   MESSAGE


Link to the data gathered from above cluster using the oc adm must-gather command : 
https://drive.google.com/file/d/18gKBqwIAZBUVonX7hiW_T3MTG_vk2YxD/view?usp=sharing

Comment 24 Yunfei Jiang 2022-02-28 06:59:08 UTC

Reproduced this (4/4) issue on AWS, same as comment 13.

+dosmith +@weliang will this issue be fixed in 4.10.0? Is there any workaround?


$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-02-26-230022   True        False         106m    Cluster version is 4.10.0-0.nightly-2022-02-26-230022

$ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                      READY   STATUS      RESTARTS      AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
openshift-multus                                   ip-reconciler-27433815-cs5bn                                              0/1     Error       0             31m     10.0.77.1     ip-10-0-77-1.us-east-2.compute.internal     <none>           <none>

$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

$ oc logs -n openshift-multus  ip-reconciler-27433815-cs5bn
I0228 06:16:56.500105       1 request.go:665] Waited for 1.166754462s due to client-side throttling, not priority and fairness, request: GET:https://api-int.yunjiang-28u2.qe.devcluster.openshift.com:6443/apis/security.internal.openshift.io/v1?timeout=32s
I0228 06:17:06.500159       1 request.go:665] Waited for 11.16662233s due to client-side throttling, not priority and fairness, request: GET:https://api-int.yunjiang-28u2.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1?timeout=32s
2022-02-28T06:17:06Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-02-28T06:17:06Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

Comment 27 Weibin Liang 2022-05-31 14:14:06 UTC

yunjiang Can not reproduce the issue in latest v4.11 build, I will close this bug now.
Please feel free to re open it, if you still see the problem in v4.11.

[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-25-193227   True        False         15m     Cluster version is 4.11.0-0.nightly-2022-05-25-193227
[weliang@weliang ~]$ oc get pod -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-4pcgq                          1/1     Running   0          26m
multus-additional-cni-plugins-4cs5h   1/1     Running   0          26m
multus-additional-cni-plugins-69bbg   1/1     Running   0          26m
multus-additional-cni-plugins-gzvgg   1/1     Running   0          34m
multus-additional-cni-plugins-jg8rx   1/1     Running   0          34m
multus-additional-cni-plugins-sx9vw   1/1     Running   0          34m
multus-additional-cni-plugins-xjv7h   1/1     Running   0          26m
multus-admission-controller-kdcrt     2/2     Running   0          33m
multus-admission-controller-p9v95     2/2     Running   0          34m
multus-admission-controller-wpdvm     2/2     Running   0          33m
multus-c5dnq                          1/1     Running   0          34m
multus-gl5kk                          1/1     Running   0          26m
multus-s8zzb                          1/1     Running   0          34m
multus-wk6dr                          1/1     Running   0          26m
multus-z5tjb                          1/1     Running   0          34m
network-metrics-daemon-bnz6g          2/2     Running   0          26m
network-metrics-daemon-czhgp          2/2     Running   0          34m
network-metrics-daemon-dlt2g          2/2     Running   0          26m
network-metrics-daemon-hqk5p          2/2     Running   0          26m
network-metrics-daemon-j524z          2/2     Running   0          34m
network-metrics-daemon-pj4nm          2/2     Running   0          34m
[weliang@weliang ~]$

Comment 28 Red Hat Bugzilla 2023-09-15 01:51:13 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.