Bug 1852802

Summary:

Unable to update OCP4.5 in disconnected env: cluster operator openshift-apiserver is degraded

Product:

OpenShift Container Platform

Reporter:

Shelly Miron <smiron>

Component:

Networking

Assignee:

Douglas Smith <dosmith>

Networking sub component:

multus

QA Contact:

Weibin Liang <weliang>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

aos-bugs, athomas, augol, bbennett, beth.white, bparees, dhansen, dmellado, eparis, jhou, lmohanty, mfojtik, omichael, scuppett, stbenjam, sttts, xxia, zzhao

Version:

4.5

Keywords:

Reopened, TestBlocker, Upgrades

Target Milestone:

---

Target Release:

4.6.0

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1862865 1867718 (view as bug list)

Environment:

Last Closed:

2020-08-10 15:25:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1862865, 1867718

Attachments:

Description	Flags
before update - operators 4.5-rc.2	none
openshift apiserver error msg	none
oc descripe clusterversion	none

Description Shelly Miron 2020-07-01 10:48:49 UTC

Created attachment 1699473 [details]
before update - operators 4.5-rc.2

Description of problem:
--------------------------

Tried to update from 4.5.0-rc.2 to 4.5.0-rc.4 in disconnected env without
using force flag- when the process came to 84%, it failed with this error:


[kni@provisionhost-0-0 ~]$ oc get clusterversion

NAME       VERSION     AVAILABLE   PROGRESSING   SINCE                     STATUS
------------------------------------------------------------------------------------------------------------------------------------------
version   4.5.0-rc.2     True        True        145m     Unable to apply 4.5.0-rc.4: the cluster operator openshift-apiserver is degraded


Also, some operators degraded after the process failed:

[kni@provisionhost-0-0 ~]$ oc get co

NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
------------------------------------------------------------------------------------------------------
authentication                             4.5.0-rc.4   True        False         False      17h
cloud-credential                           4.5.0-rc.4   True        False         False      18h
cluster-autoscaler                         4.5.0-rc.4   True        False         False      17h
config-operator                            4.5.0-rc.4   True        False         False      17h
console                                    4.5.0-rc.4   True        False         False      112m
csi-snapshot-controller                    4.5.0-rc.4   True        False         False      112m
dns                                        4.5.0-rc.4   True        True          False      17h
etcd                                       4.5.0-rc.4   True        False         False      17h
image-registry                             4.5.0-rc.4   True        False         False      113m
ingress                                    4.5.0-rc.4   True        False         False      17h
insights                                   4.5.0-rc.4   True        False         False      17h
kube-apiserver                             4.5.0-rc.4   True        False         False      17h
kube-controller-manager                    4.5.0-rc.4   True        False         False      17h
kube-scheduler                             4.5.0-rc.4   True        False         False      17h
kube-storage-version-migrator              4.5.0-rc.4   True        False         False      113m
machine-api                                4.5.0-rc.4   True        False         False      17h
machine-approver                           4.5.0-rc.4   True        False         False      17h
machine-config                             4.5.0-rc.2   False       True          True       100m
marketplace                                4.5.0-rc.4   True        False         False      111m
monitoring                                 4.5.0-rc.4   True        False         False      17h
network                                    4.5.0-rc.4   True        True          True       17h
node-tuning                                4.5.0-rc.4   True        False         False      139m
openshift-apiserver                        4.5.0-rc.4   True        False         True       0s
openshift-controller-manager               4.5.0-rc.4   True        False         False      17h
openshift-samples                          4.5.0-rc.4   True        False         False      130m
operator-lifecycle-manager                 4.5.0-rc.4   True        False         False      17h
operator-lifecycle-manager-catalog         4.5.0-rc.4   True        False         False      17h
operator-lifecycle-manager-packageserver   4.5.0-rc.4   True        False         False      93m
service-ca                                 4.5.0-rc.4   True        False         False      17h
storage                                    4.5.0-rc.4   True        False         False      139m


In addition, when i tried to update again it shows:

[kni@provisionhost-0-0 ~]$ oc adm upgrade --to 4.5.0-rc.4
info: Cluster is already at version 4.5.0-rc.4

Although the cluster is still in version 4.5.0-rc.2

[kni@provisionhost-0-0 ~]$ oc describe clusterversion
Name:         version
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterVersion
Metadata:
  Creation Timestamp:  2020-06-30T14:19:30Z
  Generation:          6
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1

......
............
.................
Spec:
  Channel:     candidate-4.5
  Cluster ID:  0dce70b4-e916-43b2-979d-ab264cd2a1bc
  Desired Update:
    Force:    false
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:acdef4d62b87c5a1e256084c55b5cfaae5ca42b7f2c49b69913a509b8954c798
    Version:  4.5.0-rc.4
  Upstream:   http://registry.ocp-edge-cluster-0.qe.lab.redhat.com:8080/images/update_graph
Status:
  Available Updates:  <nil>
  Conditions:
    Last Transition Time:  2020-06-30T15:10:03Z
    Message:               Done applying 4.5.0-rc.2
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-07-01T06:41:55Z
    Message:               Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver
    Reason:                ClusterOperatorDegraded
    Status:                True
    Type:                  Failing
    Last Transition Time:  2020-07-01T06:03:55Z
    Message:               Unable to apply 4.5.0-rc.4: the cluster operator openshift-apiserver is degraded
    Reason:                ClusterOperatorDegraded
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-07-01T05:57:10Z
    Status:                True
    Type:                  RetrievedUpdates
  Desired:
    Force:    false
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:acdef4d62b87c5a1e256084c55b5cfaae5ca42b7f2c49b69913a509b8954c798
    Version:  4.5.0-rc.4
  History:
    Completion Time:    <nil>
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:acdef4d62b87c5a1e256084c55b5cfaae5ca42b7f2c49b69913a509b8954c798
    Started Time:       2020-07-01T06:03:55Z
    State:              Partial
    Verified:           true
    Version:            4.5.0-rc.4
    Completion Time:    2020-06-30T15:10:03Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:986674e3202ab46c944b02b44c8b836bd0b52372195fa1526cb3d7291579d79a
    Started Time:       2020-06-30T14:19:39Z
    State:              Completed
    Verified:           false
    Version:            4.5.0-rc.2
  Observed Generation:  6
  Version Hash:         -TbXeAYSQ04=
Events:                 <none>



Version-Release number of selected component (if applicable):
---------------------------------------------------------------

Current version:
------------------
quay.io/openshift-release-dev/ocp-release:4.5.0-rc.2-x86_64

Target version:
------------------
quay.io/openshift-release-dev/ocp-release:4.5.0-rc.4-x86_64



How reproducible:
------------------


Steps to Reproduce:
---------------------
1. Run update on OCP4.5 without using force flag:
       - Mirror the image for update using oc adm release mirror
       - Create ImageContentSourcePolicy
       - Create custom update graph in /opt/cached_disconnected_images
       - Create config map
       - Point CVO to update graph
       - oc patch clusterversion/version --patch '{"spec": {"channel": 
         "candidate-4.5"}}' --type=merge
       - oc adm upgrade --to 4.5.0-rc.4


Actual results:
----------------------
Cluster failed to update to version 4.5.0-rc.4, not all operators are available, cluster is not stable.



Expected results:
----------------------

Cluster updated successfully to version 4.5.0-rc.4, all operators are available, cluster is stable.


Additional info:
----------------------
Must-gather: https://drive.google.com/file/d/1hw_ylqz2L7zXuYBzQg2agvQf4FqiFrUj/view?usp=sharing
Images added.

Comment 1 Shelly Miron 2020-07-01 10:51:24 UTC

Created attachment 1699474 [details]
openshift apiserver error msg

Comment 2 Shelly Miron 2020-07-01 10:52:44 UTC

Created attachment 1699475 [details]
oc descripe clusterversion

Comment 3 Antonio Murdaca 2020-07-06 13:42:16 UTC

The MCO has a huge list of failure related to reaching the API:

2020-07-01T06:55:55.585755596Z E0701 06:55:55.585646    4995 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
2020-07-01T06:56:24.98593869Z I0701 06:56:24.985814    4995 trace.go:116] Trace[919889828]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-07-01 06:55:54.984097113 +0000 UTC m=+762.181157899) (total time: 30.001675941s):
2020-07-01T06:56:24.98593869Z Trace[919889828]: [30.001675941s] [30.001675941s] END
2020-07-01T06:56:24.98593869Z E0701 06:56:24.985841    4995 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
2020-07-01T06:57:01.074614704Z I0701 06:57:01.074458    4995 trace.go:116] Trace[1465987202]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2020-07-01 06:56:31.073558086 +0000 UTC m=+798.270618844) (total time: 30.000861779s):
2020-07-01T06:57:01.074614704Z Trace[1465987202]: [30.000861779s] [30.000861779s] END
2020-07-01T06:57:01.074614704Z E0701 06:57:01.074507    4995 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
2020-07-01T06:57:51.896917749Z I0701 06:57:51.896831    4995 trace.go:116] Trace[1980435746]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-07-01 06:57:21.895861641 +0000 UTC m=+849.092922432) (total time: 30.000934291s):
2020-07-01T06:57:51.896917749Z Trace[1980435746]: [30.000934291s] [30.000934291s] END
2020-07-01T06:57:51.897039438Z E0701 06:57:51.897021    4995 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
2020-07-01T06:58:30.44373067Z I0701 06:58:30.443647    4995 trace.go:116] Trace[1059014376]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2020-07-01 06:58:00.442651145 +0000 UTC m=+887.639711898) (total time: 30.000952587s):
2020-07-01T06:58:30.44373067Z Trace[1059014376]: [30.000952587s] [30.000952587s] END
2020-07-01T06:58:30.44373067Z E0701 06:58:30.443678    4995 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
2020-07-01T06:58:54.623398605Z I0701 06:58:54.623263    4995 trace.go:116] Trace[2050729718]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-07-01 06:58:24.622346502 +0000 UTC m=+911.819407282) (total time: 30.000866922s):
2020-07-01T06:58:54.623398605Z Trace[2050729718]: [30.000866922s] [30.000866922s] END

Comment 6 Ben Parees 2020-07-06 13:50:42 UTC

preliminary findings:

1) openshift-apiserver is reporting degraded because not all its pods could be scheduled
2) pods could not be scheduled because not all master nodes are available
3) not all master nodes are available because of issues contacting the k8s apiserver (see MCO errors in comment 3)
4) MCO + Networking are also reporting degraded
5) the k8s apiserver itself is reporting available but degraded:

conditions:
  - lastTransitionTime: "2020-07-01T09:32:15Z"
    message: |-
      InstallerPodContainerWaitingDegraded: Pod "installer-9-master-0-2" on node "master-0-2" container "installer" is waiting for 13m31.141901586s because ""
      InstallerPodNetworkingDegraded: Pod "installer-9-master-0-2" on node "master-0-2" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-9-master-0-2_openshift-kube-apiserver_e80c4c50-23e5-4332-ba9f-ab705ec3df67_0(b99028b166da8edf9beaa3d25b38251bb5b5b574c0e7a98d31a6d392eb42a054): Multus: [openshift-kube-apiserver/installer-9-master-0-2]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
    reason: InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-07-01T09:24:38Z"
    message: 'NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved
      new revision 10'
    reason: NodeInstaller
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-06-30T14:38:08Z"
    message: 'StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8;
      0 nodes have achieved new revision 10'
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2020-06-30T14:36:05Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable

Comment 7 Amit Ugol 2020-07-06 14:16:40 UTC

Seeing as rc.5 to rc.6 is updateable I'm moving this to ON_QA ti indicate its being re-tested. If it works, we can close this one.

Comment 8 Xingxing Xia 2020-07-07 03:43:40 UTC

Per comment 6, the cause is network. Checked the must-gather (via cm/cluster-config-v1 in namespaces/kube-system/core/configmaps.yaml), it is baremetal disconnected OVN env. Moving to Networking component. BTW I already triggered four envs of baremetal disconnected envs with and without OVN for reproducing later.

Comment 9 Shelly Miron 2020-07-07 05:51:40 UTC

After retesting again - updating from rc.5 to rc.6 without force flag, this happend:

[kni@provisionhost-0-0 ~]$ oc get clusterversion

NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
------    --------     ---------   -----------   ------  -------
version   4.5.0-rc.5   True        True          11h     Unable to apply 4.5.0-rc.6: the image may not be safe to use


[kni@provisionhost-0-0 ~]$ oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
-----                                     -----------  ----------  ------------  ---------- -------
authentication                             4.5.0-rc.5   True        False         False      13h
cloud-credential                           4.5.0-rc.5   True        False         False      14h
cluster-autoscaler                         4.5.0-rc.5   True        False         False      13h
config-operator                            4.5.0-rc.5   True        False         False      13h
console                                    4.5.0-rc.5   True        False         False      13h
csi-snapshot-controller                    4.5.0-rc.5   True        False         False      13h
dns                                        4.5.0-rc.5   True        False         False      13h
etcd                                       4.5.0-rc.5   True        False         False      13h
image-registry                             4.5.0-rc.5   True        False         False      13h
ingress                                    4.5.0-rc.5   True        False         False      13h
insights                                   4.5.0-rc.5   True        False         False      13h
kube-apiserver                             4.5.0-rc.5   True        False         False      13h
kube-controller-manager                    4.5.0-rc.5   True        False         False      13h
kube-scheduler                             4.5.0-rc.5   True        False         False      13h
kube-storage-version-migrator              4.5.0-rc.5   True        False         False      13h
machine-api                                4.5.0-rc.5   True        False         False      13h
machine-approver                           4.5.0-rc.5   True        False         False      13h
machine-config                             4.5.0-rc.5   True        False         False      13h
marketplace                                4.5.0-rc.5   True        False         False      13h
monitoring                                 4.5.0-rc.5   True        False         False      13h
network                                    4.5.0-rc.5   True        False         False      13h
node-tuning                                4.5.0-rc.5   True        False         False      13h
openshift-apiserver                        4.5.0-rc.5   True        False         False      72m
openshift-controller-manager               4.5.0-rc.5   True        False         False      13h
openshift-samples                          4.5.0-rc.5   True        False         False      13h
operator-lifecycle-manager                 4.5.0-rc.5   True        False         False      13h
operator-lifecycle-manager-catalog         4.5.0-rc.5   True        False         False      13h
operator-lifecycle-manager-packageserver   4.5.0-rc.5   True        False         False      13h
service-ca                                 4.5.0-rc.5   True        False         False      13h
storage                                    4.5.0-rc.5   True        False         False      13h


but when updated with the force flag, the updating succeed.

Comment 17 Ben Parees 2020-07-13 17:14:34 UTC

InstallerPodNetworkingDegraded: Pod "installer-11-master-0-2" on node "master-0-2" observed degraded networking: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-11-master-0-2_openshift-kube-apiserver_569a19e5-fe46-4e34-9f5e-0ae67b259786_0(c4275101c2593ab24480e17d6b7d36b2b4001a16974d073633f948ffda0cbf11): Multus: [openshift-kube-apiserver/installer-11-master-0-2]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition


sounds like networking pod failed to get created due to a node/crio issue?


Networking itself reports:
status:
  conditions:
  - lastTransitionTime: "2020-07-09T14:50:38Z"
    message: DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making
      progress - last change 2020-07-09T14:39:36Z
    reason: RolloutHung
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-07-09T11:29:21Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2020-07-09T14:38:30Z"
    message: |-
      DaemonSet "openshift-multus/multus-admission-controller" is not available (awaiting 1 nodes)
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    reason: Deploying
    status: "True"
    type: Progressing


networking pod ovnkube-node-lqphr is showing:

  - containerID: cri-o://d6fde6e77032e51c11a18e3e27440b684dea8256fb4fb80a9b44f63c0227a81f
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f3c2711b2f0e762862981c97143e2871b39af1bcde90fdbd5d7147b4a91b764
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f3c2711b2f0e762862981c97143e2871b39af1bcde90fdbd5d7147b4a91b764
    lastState:
      terminated:
        containerID: cri-o://d6fde6e77032e51c11a18e3e27440b684dea8256fb4fb80a9b44f63c0227a81f
        exitCode: 1
        finishedAt: "2020-07-12T14:02:44Z"
        message: |
          + [[ -f /env/master-0-2 ]]
          + cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
          + ovn_config_namespace=openshift-ovn-kubernetes
          + retries=0
          + true
          ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
          Unable to connect to the server: dial tcp: lookup api-int.ocp-edge-cluster-0.qe.lab.redhat.com on 192.168.123.1:53: no such host
          + db_ip=
        reason: Error
        startedAt: "2020-07-12T14:02:44Z"
    name: ovnkube-node

so I guess agree that this seems DNS related.

Comment 19 errata-xmlrpc 2020-07-13 17:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 20 Eric Paris 2020-07-13 23:38:45 UTC

I think we accidentally forgot to pull this BZ from the errata. Re-openning.

Comment 21 Daneyon Hansen 2020-07-21 23:30:18 UTC

The DNS Operator is available but indicates a progressing condition:

status:
  conditions:
  - lastTransitionTime: "2020-07-09T14:39:52Z"
    message: All desired DNS DaemonSets available and operand Namespace exists
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-07-09T14:38:30Z"
    message: At least 1 DNS DaemonSet is progressing.
    reason: Reconciling
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-07-09T11:35:36Z"
    message: At least 1 DNS DaemonSet available
    reason: AsExpected
    status: "True"
    type: Available

# One of the dns dameonset pods ("dns-default-4dbgg") is unavailable:

  status:
    currentNumberScheduled: 5
    desiredNumberScheduled: 5
    numberAvailable: 4
    numberMisscheduled: 0
    numberReady: 4
    numberUnavailable: 1
    observedGeneration: 2
    updatedNumberScheduled: 5

# All containers in pod "dns-default-4dbgg" are not ready:

  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2020-07-09T13:32:46Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2020-07-09T14:39:05Z"
      message: 'containers with unready status: [dns kube-rbac-proxy dns-node-resolver]'
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2020-07-09T14:40:24Z"
      message: 'containers with unready status: [dns kube-rbac-proxy dns-node-resolver]'
      reason: ContainersNotReady
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2020-07-09T13:32:46Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c09633512a460fda547cd079565554ab79cbfbe767c827bba075f05b47e71d4a
      imageID: ""
      lastState: {}
      name: dns
      ready: false
      restartCount: 0
      started: false
      state:
        waiting:
          reason: ContainerCreating
    - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b93a3f13057991466caf3ba6517493015299a856c6b752bd49b7d4c294312177
      imageID: ""
      lastState: {}
      name: dns-node-resolver
      ready: false
      restartCount: 0
      started: false
      state:
        waiting:
          reason: ContainerCreating
    - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0b9be905dc8404760427a4bfbb9274545b2fb03774d85cd8ee5d93f847c69293
      imageID: ""
      lastState: {}
      name: kube-rbac-proxy
      ready: false
      restartCount: 0
      started: false
      state:
        waiting:
          reason: ContainerCreating

192.168.123.114 is the InternalIP address of node "master-0-2' where pod "dns-default-4dbgg" was scheduled. The node conditions are as expected:

  conditions:
  - lastHeartbeatTime: "2020-07-12T14:05:56Z"
    lastTransitionTime: "2020-07-09T14:40:16Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2020-07-12T14:05:56Z"
    lastTransitionTime: "2020-07-09T14:40:16Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2020-07-12T14:05:56Z"
    lastTransitionTime: "2020-07-09T14:40:16Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2020-07-12T14:05:56Z"
    lastTransitionTime: "2020-07-09T14:40:16Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready

Enets indicate an issue creating the pod network sandbox for pod "dns-default-4dbgg":

  message: '(combined from similar events): Failed to create pod sandbox: rpc error:
    code = Unknown desc = failed to create pod network sandbox k8s_dns-default-4dbgg_openshift-dns_df0adbd5-dc00-4367-b02e-07c62a925a4b_0(f771c552839c5276e622b6f0980a84f0ae496a90c39bab1b1157f7dc8d357a6d):
    Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile:
    timed out waiting for the condition'

CRIO logs indicate the same error for dns pod "dns-default-4dbgg":

Jul 11 05:41:37.152475 master-0-2 crio[1821]: 2020-07-11T05:41:37Z [error] Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition
Jul 11 05:41:37.154347 master-0-2 crio[1821]: time="2020-07-11 05:41:37.154247983Z" level=error msg="Error deleting network: Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition"
Jul 11 05:41:37.154347 master-0-2 crio[1821]: time="2020-07-11 05:41:37.154332566Z" level=error msg="Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition"
Jul 11 05:41:37.154557 master-0-2 crio[1821]: time="2020-07-11 05:41:37.154451137Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_dns-default-4dbgg_openshift-dns_df0adbd5-dc00-4367-b02e-07c62a925a4b_0(9064208bb220d12adb8a12c24492db4aea36419f66f0f6b932a065925429ffb2): Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition" id=e6548c60-b32b-41dd-a4a7-ebde016067e7 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox

This BZ appears to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1805444. Maybe the fix in BZ 1805444 needs to be ported to OVN? Reassigning to the SDN team for confirmation and further investigation.

Comment 22 Ben Bennett 2020-07-29 12:29:47 UTC

Assigned to Doug to see if it's the same (or similar) to the other issue that Dane found.

Comment 24 Douglas Smith 2020-07-30 15:58:05 UTC

A `PollImmediate error waiting for ReadinessIndicatorFile` means that (in the context of ovn-kubernetes in OCP) the file `/var/run/multus/cni/net.d/10-ovn-kubernetes.conf` was not found by Multus CNI.

This is the "readiness indicator file" -- and indicates the readiness of the default network (in this case, ovn-kubernetes) is not ready, and there may be some failure of the process that writes that CNI configuration file to disk. Without this, we can't be certain that OVN is ready to handle network traffic from workloads, so Multus waits for this readiness indication from the default network CNI configuration file.

Comment 25 errata-xmlrpc 2020-08-04 18:04:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 27 Luke Meyer 2020-08-10 15:25:31 UTC

(In reply to Eric Paris from comment #20)
> I think we accidentally forgot to pull this BZ from the errata. Re-openning.

Don't do that. Once it's been shipped in an errata it can never be removed or shipped again. Bugs shipped by errata are intended to be immutable. It needs to be cloned to proceed. As the ET comment indicates:

> If the solution does not work for you, open a new bug report.

I've cloned it as https://bugzilla.redhat.com/show_bug.cgi?id=1867718

I'll return the bug to CLOSED ERRATA although it is clearly not actually fixed.