Bug 2042796 - openshift-multus/ip-reconciler* pods fail with "context deadline exceeded" on OCP 4.10 cluster [NEEDINFO]
Summary: openshift-multus/ip-reconciler* pods fail with "context deadline exceeded" o...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: ppc64le
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Miguel Duarte Barroso
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-20 07:16 UTC by Alisha
Modified: 2022-05-31 14:14 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-31 14:14:06 UTC
Target Upstream Version:
yunjiang: needinfo? (dosmith)


Attachments (Terms of Use)
kube-apiserver termination log (8.07 KB, text/plain)
2022-01-21 21:51 UTC, Mick Tarsel
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1290 0 None open Bug 2042796: whereabouts, reconciler: disable retries on failure 2022-01-25 19:08:12 UTC

Description Alisha 2022-01-20 07:16:30 UTC

Comment 1 Alisha 2022-01-20 07:36:51 UTC
openshift-multus/ip-reconciler* pods fail with "context deadline exceeded" on OCP 4.10 cluster


Cluster History : 
=================
The cluster was upgraded from 4.10.0-0.nightly-ppc64le-2021-11-17-063238 to 4.10.0-0.nightly-ppc64le-2022-01-13-022003
The cluster is 97 days old ( counted from the day of creation ).
Before upgrade Memory% on worker-1 had exceeded 100% on worker-1 node. Upgrade operation was successful after rebooting worker-1 node.


After Upgrade : 
===============
All co's are good. 
All nodes are in Ready state.
NAME                               STATUS   ROLES    AGE   VERSION
rdr-ap-longev-410-mon01-master-0   Ready    master   97d   v1.23.0+50f645e
rdr-ap-longev-410-mon01-master-1   Ready    master   97d   v1.23.0+50f645e
rdr-ap-longev-410-mon01-master-2   Ready    master   97d   v1.23.0+50f645e
rdr-ap-longev-410-mon01-worker-0   Ready    worker   97d   v1.23.0+50f645e
rdr-ap-longev-410-mon01-worker-1   Ready    worker   97d   v1.23.0+50f645e


Issue seen : 
============
ip-reconciler-* pods are in Error state.

# oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                              READY   STATUS      RESTARTS         AGE     IP                NODE                               NOMINATED NODE   READINESS GATES
openshift-multus                                   ip-reconciler-27375270-2sdrp                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-7h2mj                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-8gjts                                      0/1     Error       0                41h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-c2tg8                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-fwpjf                                      0/1     Error       0                41h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-j25s5                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>
openshift-multus                                   ip-reconciler-27375270-pjfdx                                      0/1     Error       0                40h     10.*.*.*       rdr-ap-longev-410-mon01-worker-0   <none>           <none>



# oc logs pod/ip-reconciler-27375270-2sdrp -n openshift-multus
I0118 14:31:22.335494       1 request.go:655] Throttling request took 1.079150668s, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
2022-01-18T14:31:32Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-18T14:31:32Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded


Described the pod in error state : 
==================================

# oc describe po ip-reconciler-27375270-2sdrp -n openshift-multus
Name:                 ip-reconciler-27375270-2sdrp
Namespace:            openshift-multus
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 rdr-ap-longev-410-mon01-worker-0/193.*.*.*
Start Time:           Tue, 18 Jan 2022 14:31:17 +0000
Labels:               controller-uid=0b2202e8-57a9-48f7-b8f5-05c0b3fc89ec
                      job-name=ip-reconciler-27375270
Annotations:          k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.*.*.*"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.*.*.*"
                            ],
                            "default": true,
                            "dns": {}
                        }]
Status:               Failed
IP:                   10.*.*.*
IPs:
  IP:           10.*.*.*
Controlled By:  Job/ip-reconciler-27375270
Containers:
  whereabouts:
    Container ID:  cri-o://5f96ee2124cb80623b01119928e491a4119ecd12db7cebb23f10942698e64283
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a01c5d18877255a6f2266d3188817bfdf36f234463b576a5c527877fbc711901
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a01c5d18877255a6f2266d3188817bfdf36f234463b576a5c527877fbc711901
    Port:          <none>
    Host Port:     <none>
    Command:
      /ip-reconciler
      -log-level=verbose
    State:          Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 18 Jan 2022 14:31:20 +0000
      Finished:     Tue, 18 Jan 2022 14:31:32 +0000
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        25m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zpvq5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/cni/net.d
    HostPathType:
  kube-api-access-zpvq5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>


Jobs :
======

# oc get job -A
NAMESPACE                              NAME                        COMPLETIONS   DURATION   AGE
openshift-cluster-version              version--47p2h              1/1           15s        63d
openshift-cluster-version              version--pm7tv              1/1           10s        6d13h
openshift-image-registry               image-pruner-27329760       0/1           33d        33d
openshift-image-registry               image-pruner-27374400       1/1           6s         2d7h
openshift-image-registry               image-pruner-27375840       1/1           6s         31h
openshift-image-registry               image-pruner-27377280       1/1           6s         7h35m
openshift-multus                       ip-reconciler-27375270      0/1           41h        41h
openshift-operator-lifecycle-manager   collect-profiles-27377700   1/1           12s        35m
openshift-operator-lifecycle-manager   collect-profiles-27377715   1/1           10s        20m
openshift-operator-lifecycle-manager   collect-profiles-27377730   1/1           12s        5m48s

Comment 2 Stefan Schimanski 2022-01-20 08:34:13 UTC
This is likely a network or LB problem. You can prove the opposite by looking into kube-apiserver audit log to find the request that times out. Moving to multus until such a proof is available that the request actually arrives at the apiserver.

Comment 3 Manoj Kumar 2022-01-21 13:51:50 UTC
Could this be related to https://github.com/k8snetworkplumbingwg/whereabouts/issues/172
?

Comment 4 Alisha 2022-01-21 15:04:15 UTC
Please find below the must-gather tar from the above cluster where the issue is seen : 

Link : https://drive.google.com/file/d/1AZDEWbp2fUUGo52zmmL4blluRx0tplvs/view?usp=sharing

Comment 5 Mick Tarsel 2022-01-21 21:50:39 UTC
Thank you Stefan for the recommendations - It looks like this is related to Etcd load balancer but I'm not sure what to do next.

On master-0 I’m hitting this error (added a couple months ago):
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/monitor/monitor.go#L174

Here is some logs from /var/log/etcd/etcd-health-probe.log

{"level":"error","ts":"2022-01-19T17:00:18.975Z","caller":"monitor/monitor.go:175","msg":"failed to create new health monitor","error":"context deadline exceeded","stacktrace":"github.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.(*monitorOpts).Run\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:175\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.NewMonitorCommand.func1.1\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:87\ngithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor.NewMonitorCommand.func1\n\tgithub.com/openshift/cluster-etcd-operator/pkg/cmd/monitor/monitor.go:95\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra.1/command.go:860\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra.1/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra.1/command.go:902\nmain.main\n\tgithub.com/openshift/cluster-etcd-operator/cmd/cluster-etcd-operator/main.go:45\nruntime.main\n\truntime/proc.go:255"}


And about 3 minutes later in same file I see "error":"slow request"

{"level":"info","ts":"2022-01-19T17:03:15.497Z","caller":"monitor/monitor.go:179","msg":"health monitor is starting","pod":"etcd-master-0.sonig-f38d.ibm.com","static-pod-version":7}
{"level":"info","ts":"2022-01-19T17:03:16.001Z","caller":"health/log.go:41","msg":"possible service disruption detected","addresses":["https://9.47.89.104:2379"],"check":"QuorumReadSingleTarget","health":false,"start":"2022-01-19T17:03:15.501Z","error":"slow request"}

Which is from port 2379

So then looking over in /var/log/kube-apiserver logs I searched for request timeouts which returned 1 result in termination.log

[root@master-0 kube-apiserver]# grep -rnw ./ -e "Timeout: request"
./termination.log:21:E0119 17:03:03.087992      16 available_controller.go:546] v1.security.openshift.io failed with: Timeout: request did not complete within requested timeout - context deadline exceeded

Here's the timeline:
1. At T17:00:18.975Z Etcd reports "msg":"failed to create new health monitor","error":"context deadline exceeded"
2. At T17:03:03.087992  Kube-apiserver reports request timeout
3. At T17:03:15.497Z Etcd reports "msg":"health monitor is starting"
4. At T17:03:16.001Z in the next line in Etcd there is "msg":"possible service disruption detected","addresses":["https://9.47.89.104:2379"],"check":"QuorumReadSingleTarget","health":false,"start":"2022-01-19T17:03:15.501Z","error":"slow request"}

There is a lot of repeating logs at the bottom and unfortunately I cannot go any earlier in any of logs to investigate, this is the earliest logs I got.

I will attach the beginning of the kube-apiserver termination.log because there is a lot more errors but then it just starts repeating at the bottom.

From master-0 I started toolbox thinking nmap was available but just did this:

⬢[root@toolbox ~]# cat /etc/services | grep 2379
etcd-client     2379/tcp                # etcd client communication


I still do not know why Etcd could not create a health monitor. I’m not sure how to proceed or further debug this. Thanks

Comment 6 Mick Tarsel 2022-01-21 21:51:48 UTC
Created attachment 1852620 [details]
kube-apiserver termination log

Comment 7 Alexander Chuzhoy 2022-01-25 16:48:59 UTC
Reproducing the issue upon clean deployment of SNO Version: 4.10.0-fc.2  with ipv6


oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE                                          NAME                                                                         READY   STATUS      RESTARTS       AGE
openshift-multus                                   ip-reconciler-27385455-24pp8                                                 0/1     Error       0              11m
openshift-multus                                   ip-reconciler-27385455-bcllq                                                 0/1     Error       0              8m36s
openshift-multus                                   ip-reconciler-27385455-bqb8p                                                 0/1     Error       0              8m53s
openshift-multus                                   ip-reconciler-27385455-dp46g                                                 0/1     Error       0              9m25s
openshift-multus                                   ip-reconciler-27385455-kmknm                                                 0/1     Error       0              9m9s
openshift-multus                                   ip-reconciler-27385455-krpdk                                                 0/1     Error       0              8m20s
openshift-multus                                   ip-reconciler-27385455-xmn7g                                                 0/1     Error       0              9m41s


If I delete the pods with error - they don't attempt to restart.
If I reboot the SNO - the issue reproduces.

Comment 8 Douglas Smith 2022-01-25 18:57:22 UTC
My assessment is two fold:

1. The error encountered by the Whereabouts IPAM IP reconciliation process is impacted negatively by some other event on the cluster -- be it API server load, or a network problem. However, this failure of the IP reconciler is a symptom, not a cause.

2. This does bring to light that the reconciler could use improvements in order as to not cause so much commotion when a problem does occur, as it should be able to cleanly handle this, and if it doesn't complete a run properly in a short period, it's ok if it runs later. Miguel Duarte has put forward changes upstream, and has also implemented a downstream CNO patch which should mitigate this noisiness.

Comment 12 Weibin Liang 2022-01-27 20:24:02 UTC
Tested and verified when upgrading from 4.10.0-0.nightly-2022-01-24-020644 to 4.10.0-0.nightly-2022-01-27-104747

[weliang@weliang ~]$ oc get clusterversion  
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-24-020644   True        True          10m     Working towards 4.10.0-0.nightly-2022-01-27-104747: 96 of 769 done (12% complete)
[weliang@weliang ~]$ oc get clusterversion      
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-24-020644   True        True          13m     Working towards 4.10.0-0.nightly-2022-01-27-104747: 116 of 769 done (15% complete)




[weliang@weliang ~]$ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                        READY   STATUS      RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
[weliang@weliang ~]$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-27-104747   True        False         28m     Cluster version is 4.10.0-0.nightly-2022-01-27-104747
[weliang@weliang ~]$

Comment 13 Yunfei Jiang 2022-01-28 06:28:47 UTC
Met the same issue after restarting all master/worker machines:
1. create an IPI cluster on AWS
2. restart all master and worker ec2 instances

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-26-234447   True        False         14h     Cluster version is 4.10.0-0.nightly-2022-01-26-234447

$ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                        READY   STATUS      RESTARTS        AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
openshift-multus                                   ip-reconciler-27388905-gh7c5                                                0/1     Error       0               4h24m   10.129.2.11    ip-10-0-156-72.us-east-2.compute.internal    <none>           <none>

$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

$ oc logs -n openshift-multus ip-reconciler-27388905-gh7c5
I0128 01:46:45.330141       1 request.go:655] Throttling request took 1.119493351s, request: GET:https://172.30.0.1:443/apis/policy/v1?timeout=32s
I0128 01:46:55.330278       1 request.go:655] Throttling request took 11.118673701s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
2022-01-28T01:46:55Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-28T01:46:55Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

must gather logs https://drive.google.com/file/d/1eo12aqGQlapbEQYAsfs31vhKSiLiKf_L/view?usp=sharing

Comment 14 Alisha 2022-01-28 13:55:18 UTC
The issue is NOT reproduced with 4.10.0-fc.3 on Power platform. 

# oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-fc.3   True        False         3h48m   Cluster version is 4.10.0-fc.3

# oc get co | grep -v "True.*False.*False"
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

# oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                   READY   STATUS      RESTARTS        AGE     IP                NODE                                    NOMINATED NODE   READINESS GATES

# oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-47qkj                          1/1     Running   0          4h8m
multus-55b22                          1/1     Running   0          3h56m
multus-66zws                          1/1     Running   0          3h57m
multus-additional-cni-plugins-524bj   1/1     Running   0          3h56m
multus-additional-cni-plugins-5q8z7   1/1     Running   0          3h57m
multus-additional-cni-plugins-t5lsx   1/1     Running   0          4h8m
multus-additional-cni-plugins-wswbb   1/1     Running   0          4h8m
multus-additional-cni-plugins-xkqjl   1/1     Running   0          4h8m
multus-admission-controller-2crrp     2/2     Running   0          4h8m
multus-admission-controller-klgll     2/2     Running   0          4h8m
multus-admission-controller-r8h4p     2/2     Running   0          4h8m
multus-b86hs                          1/1     Running   0          4h8m
multus-qr2sf                          1/1     Running   0          4h8m
network-metrics-daemon-4sw5p          2/2     Running   0          3h56m
network-metrics-daemon-5fnkb          2/2     Running   0          4h8m
network-metrics-daemon-5qrsq          2/2     Running   0          3h57m
network-metrics-daemon-6l5f9          2/2     Running   0          4h8m
network-metrics-daemon-lmkjk          2/2     Running   0          4h8m

Comment 16 Douglas Smith 2022-02-01 15:21:17 UTC
Likely needs fixes related to https://bugzilla.redhat.com/show_bug.cgi?id=2048575 as well, which changes to use the api-int load balancer.

Comment 17 Alisha 2022-02-04 19:06:32 UTC
(In reply to Alisha from comment #14)
> The issue is NOT reproduced with 4.10.0-fc.3 on Power platform. 
> 
> # oc get clusterversion
> NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.10.0-fc.3   True        False         3h48m   Cluster version is
> 4.10.0-fc.3
> 
> # oc get co | grep -v "True.*False.*False"
> NAME                                       VERSION       AVAILABLE  
> PROGRESSING   DEGRADED   SINCE   MESSAGE
> 
> # oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
> NAMESPACE                                          NAME                     
> READY   STATUS      RESTARTS        AGE     IP                NODE          
> NOMINATED NODE   READINESS GATES
> 
> # oc get pods -n openshift-multus
> NAME                                  READY   STATUS    RESTARTS   AGE
> multus-47qkj                          1/1     Running   0          4h8m
> multus-55b22                          1/1     Running   0          3h56m
> multus-66zws                          1/1     Running   0          3h57m
> multus-additional-cni-plugins-524bj   1/1     Running   0          3h56m
> multus-additional-cni-plugins-5q8z7   1/1     Running   0          3h57m
> multus-additional-cni-plugins-t5lsx   1/1     Running   0          4h8m
> multus-additional-cni-plugins-wswbb   1/1     Running   0          4h8m
> multus-additional-cni-plugins-xkqjl   1/1     Running   0          4h8m
> multus-admission-controller-2crrp     2/2     Running   0          4h8m
> multus-admission-controller-klgll     2/2     Running   0          4h8m
> multus-admission-controller-r8h4p     2/2     Running   0          4h8m
> multus-b86hs                          1/1     Running   0          4h8m
> multus-qr2sf                          1/1     Running   0          4h8m
> network-metrics-daemon-4sw5p          2/2     Running   0          3h56m
> network-metrics-daemon-5fnkb          2/2     Running   0          4h8m
> network-metrics-daemon-5qrsq          2/2     Running   0          3h57m
> network-metrics-daemon-6l5f9          2/2     Running   0          4h8m
> network-metrics-daemon-lmkjk          2/2     Running   0          4h8m


Issue did not reproduce after upgrading the above cluster ( 4.10.0-fc.3 ) to 4.10.0-rc.0
========================================================================================

# oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.0   True        False         55m     Cluster version is 4.10.0-rc.0

All pods are healthy. No ip-reconciler-* pods are found on the cluster.

# oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                   READY   STATUS      RESTARTS       AGE     IP                NODE


Also , manually rebooted both the master and worker nodes after upgrade.
=========================================================================

The issue is not seen. The cluster is healthy.

Comment 18 Weibin Liang 2022-02-07 15:19:26 UTC
Hi Yunfei, could you double check if you can reproduce the problem in your site?
Otherwise, I will close this bug. Thanks!

Comment 19 Yunfei Jiang 2022-02-08 02:49:29 UTC
Issues in comment 13 are not encountered every time, approximately once every three trials.

Comment 20 Alisha 2022-02-08 16:39:46 UTC
Issue is seen with the direct installation of 4.10.0-rc.1 on ppc64le : 

# oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        False         32h     Cluster version is 4.10.0-rc.1

# oc get pods --all-namespaces -o=wide | grep -vi running | grep -vi completed
NAMESPACE                                          NAME                                                                       READY   STATUS      RESTARTS      AGE     IP                NODE                                        NOMINATED NODE   READINESS GATES
openshift-multus                                   ip-reconciler-27404790-fz6c5  


# oc get job -A
NAMESPACE                              NAME                        COMPLETIONS   DURATION   AGE
openshift-image-registry               image-pruner-27404640       1/1           6s         16h
openshift-multus                       ip-reconciler-27404790      0/1           14h        14h
openshift-operator-lifecycle-manager   collect-profiles-27405600   1/1           8s         33m
openshift-operator-lifecycle-manager   collect-profiles-27405615   1/1           12s        18m
openshift-operator-lifecycle-manager   collect-profiles-27405630   1/1           11s        3m34s


# oc describe job ip-reconciler-27404790 -n openshift-multus
Name:             ip-reconciler-27404790
Namespace:        openshift-multus
Selector:         controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
Labels:           controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
                  job-name=ip-reconciler-27404790
Annotations:      batch.kubernetes.io/job-tracking:
Controlled By:    CronJob/ip-reconciler
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Mon, 07 Feb 2022 21:30:00 -0500
Pods Statuses:    0 Active / 0 Succeeded / 1 Failed
Pod Template:
  Labels:           controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
                    job-name=ip-reconciler-27404790
  Service Account:  multus
  Containers:
   whereabouts:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:254efd7dbb69e80ca2899aed5fbbc0bf334c90950090ed7b59ab13fa45089e06
    Port:       <none>
    Host Port:  <none>
    Command:
      /ip-reconciler
      -log-level=verbose
    Requests:
      cpu:        25m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
  Volumes:
   cni-net-dir:
    Type:               HostPath (bare host directory volume)
    Path:               /etc/kubernetes/cni/net.d
    HostPathType:
  Priority Class Name:  system-cluster-critical
Events:                 <none>


# oc get co | grep -v "True.*False.*False"
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

Comment 21 Leandro Rebosio 2022-02-09 09:02:59 UTC
I've got the very same issue with rc.1 on Azure OpenShift:

$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.1   True        False         46h     Cluster version is 4.10.0-rc.1


$ oc get job -A
NAMESPACE                              NAME                                                              COMPLETIONS   DURATION   AGE
openshift-cluster-version              version-4.10.0-rc.1-wpszt                                         1/1           8s         46h
openshift-image-registry               image-pruner-27406080                                             1/1           4s         8h
openshift-marketplace                  33e0b5bfab49b1802c8543c13593c4ab6e56c087bf00159a8bd5859141091af   1/1           8s         6d13h
openshift-multus                       ip-reconciler-27401085                                            0/1           3d20h      3d20h
openshift-operator-lifecycle-manager   collect-profiles-27406605                                         1/1           9s         14m


$ oc describe job ip-reconciler-27401085 -n openshift-multus
Name:             ip-reconciler-27401085
Namespace:        openshift-multus
Selector:         controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e
Labels:           controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e
                  job-name=ip-reconciler-27401085
Annotations:      batch.kubernetes.io/job-tracking:
Controlled By:    CronJob/ip-reconciler
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Sat, 05 Feb 2022 13:45:00 +0100
Pods Statuses:    0 Active / 0 Succeeded / 1 Failed
Pod Template:
  Labels:           controller-uid=5627aa89-13c7-42ea-9c21-85b7039e0b2e
                    job-name=ip-reconciler-27401085
  Service Account:  multus
  Containers:
   whereabouts:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8a6ac1fb6f73f65a36a15c8ca3cdab182def90289e74d7bd1692d44cb8a3b7e0
    Port:       <none>
    Host Port:  <none>
    Command:
      /ip-reconciler
      -log-level=verbose
    Requests:
      cpu:        25m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
  Volumes:
   cni-net-dir:
    Type:               HostPath (bare host directory volume)
    Path:               /etc/kubernetes/cni/net.d
    HostPathType:
  Priority Class Name:  system-cluster-critical
Events:                 <none>

Comment 22 Alisha 2022-02-09 15:44:04 UTC
(In reply to Alisha from comment #20)
> Issue is seen with the direct installation of 4.10.0-rc.1 on ppc64le : 
> 
> # oc get clusterversion
> NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.10.0-rc.1   True        False         32h     Cluster version is
> 4.10.0-rc.1
> 
> # oc get pods --all-namespaces -o=wide | grep -vi running | grep -vi
> completed
> NAMESPACE                                          NAME                     
> READY   STATUS      RESTARTS      AGE     IP                NODE            
> NOMINATED NODE   READINESS GATES
> openshift-multus                                  
> ip-reconciler-27404790-fz6c5  
> 
> 
> # oc get job -A
> NAMESPACE                              NAME                       
> COMPLETIONS   DURATION   AGE
> openshift-image-registry               image-pruner-27404640       1/1      
> 6s         16h
> openshift-multus                       ip-reconciler-27404790      0/1      
> 14h        14h
> openshift-operator-lifecycle-manager   collect-profiles-27405600   1/1      
> 8s         33m
> openshift-operator-lifecycle-manager   collect-profiles-27405615   1/1      
> 12s        18m
> openshift-operator-lifecycle-manager   collect-profiles-27405630   1/1      
> 11s        3m34s
> 
> 
> # oc describe job ip-reconciler-27404790 -n openshift-multus
> Name:             ip-reconciler-27404790
> Namespace:        openshift-multus
> Selector:         controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
> Labels:           controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
>                   job-name=ip-reconciler-27404790
> Annotations:      batch.kubernetes.io/job-tracking:
> Controlled By:    CronJob/ip-reconciler
> Parallelism:      1
> Completions:      1
> Completion Mode:  NonIndexed
> Start Time:       Mon, 07 Feb 2022 21:30:00 -0500
> Pods Statuses:    0 Active / 0 Succeeded / 1 Failed
> Pod Template:
>   Labels:           controller-uid=64b3632a-e313-4fe6-a68c-dacabcfa88f9
>                     job-name=ip-reconciler-27404790
>   Service Account:  multus
>   Containers:
>    whereabouts:
>     Image:     
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 254efd7dbb69e80ca2899aed5fbbc0bf334c90950090ed7b59ab13fa45089e06
>     Port:       <none>
>     Host Port:  <none>
>     Command:
>       /ip-reconciler
>       -log-level=verbose
>     Requests:
>       cpu:        25m
>       memory:     25Mi
>     Environment:  <none>
>     Mounts:
>       /host/etc/cni/net.d from cni-net-dir (rw)
>   Volumes:
>    cni-net-dir:
>     Type:               HostPath (bare host directory volume)
>     Path:               /etc/kubernetes/cni/net.d
>     HostPathType:
>   Priority Class Name:  system-cluster-critical
> Events:                 <none>
> 
> 
> # oc get co | grep -v "True.*False.*False"
> NAME                                       VERSION       AVAILABLE  
> PROGRESSING   DEGRADED   SINCE   MESSAGE


Link to the data gathered from above cluster using the oc adm must-gather command : 
https://drive.google.com/file/d/18gKBqwIAZBUVonX7hiW_T3MTG_vk2YxD/view?usp=sharing

Comment 24 Yunfei Jiang 2022-02-28 06:59:08 UTC
Reproduced this (4/4) issue on AWS, same as comment 13.

+dosmith +@weliang will this issue be fixed in 4.10.0? Is there any workaround?


$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-02-26-230022   True        False         106m    Cluster version is 4.10.0-0.nightly-2022-02-26-230022

$ oc get pods --all-namespaces -o=wide | grep -v "Running\|Completed"
NAMESPACE                                          NAME                                                                      READY   STATUS      RESTARTS      AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
openshift-multus                                   ip-reconciler-27433815-cs5bn                                              0/1     Error       0             31m     10.0.77.1     ip-10-0-77-1.us-east-2.compute.internal     <none>           <none>

$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

$ oc logs -n openshift-multus  ip-reconciler-27433815-cs5bn
I0228 06:16:56.500105       1 request.go:665] Waited for 1.166754462s due to client-side throttling, not priority and fairness, request: GET:https://api-int.yunjiang-28u2.qe.devcluster.openshift.com:6443/apis/security.internal.openshift.io/v1?timeout=32s
I0228 06:17:06.500159       1 request.go:665] Waited for 11.16662233s due to client-side throttling, not priority and fairness, request: GET:https://api-int.yunjiang-28u2.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1?timeout=32s
2022-02-28T06:17:06Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-02-28T06:17:06Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

Comment 27 Weibin Liang 2022-05-31 14:14:06 UTC
yunjiang Can not reproduce the issue in latest v4.11 build, I will close this bug now.
Please feel free to re open it, if you still see the problem in v4.11.

[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-25-193227   True        False         15m     Cluster version is 4.11.0-0.nightly-2022-05-25-193227
[weliang@weliang ~]$ oc get pod -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-4pcgq                          1/1     Running   0          26m
multus-additional-cni-plugins-4cs5h   1/1     Running   0          26m
multus-additional-cni-plugins-69bbg   1/1     Running   0          26m
multus-additional-cni-plugins-gzvgg   1/1     Running   0          34m
multus-additional-cni-plugins-jg8rx   1/1     Running   0          34m
multus-additional-cni-plugins-sx9vw   1/1     Running   0          34m
multus-additional-cni-plugins-xjv7h   1/1     Running   0          26m
multus-admission-controller-kdcrt     2/2     Running   0          33m
multus-admission-controller-p9v95     2/2     Running   0          34m
multus-admission-controller-wpdvm     2/2     Running   0          33m
multus-c5dnq                          1/1     Running   0          34m
multus-gl5kk                          1/1     Running   0          26m
multus-s8zzb                          1/1     Running   0          34m
multus-wk6dr                          1/1     Running   0          26m
multus-z5tjb                          1/1     Running   0          34m
network-metrics-daemon-bnz6g          2/2     Running   0          26m
network-metrics-daemon-czhgp          2/2     Running   0          34m
network-metrics-daemon-dlt2g          2/2     Running   0          26m
network-metrics-daemon-hqk5p          2/2     Running   0          26m
network-metrics-daemon-j524z          2/2     Running   0          34m
network-metrics-daemon-pj4nm          2/2     Running   0          34m
[weliang@weliang ~]$


Note You need to log in before you can comment on or make changes to this bug.