Bug 1852802
| Summary: | Unable to update OCP4.5 in disconnected env: cluster operator openshift-apiserver is degraded | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Shelly Miron <smiron> | ||||||||
| Component: | Networking | Assignee: | Douglas Smith <dosmith> | ||||||||
| Networking sub component: | multus | QA Contact: | Weibin Liang <weliang> | ||||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||||
| Severity: | high | ||||||||||
| Priority: | high | CC: | aos-bugs, athomas, augol, bbennett, beth.white, bparees, dhansen, dmellado, eparis, jhou, lmohanty, mfojtik, omichael, scuppett, stbenjam, sttts, xxia, zzhao | ||||||||
| Version: | 4.5 | Keywords: | Reopened, TestBlocker, Upgrades | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | 4.6.0 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1862865 1867718 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2020-08-10 15:25:51 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1862865, 1867718 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Shelly Miron
2020-07-01 10:48:49 UTC
Created attachment 1699474 [details]
openshift apiserver error msg
Created attachment 1699475 [details]
oc descripe clusterversion
The MCO has a huge list of failure related to reaching the API: 2020-07-01T06:55:55.585755596Z E0701 06:55:55.585646 4995 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout 2020-07-01T06:56:24.98593869Z I0701 06:56:24.985814 4995 trace.go:116] Trace[919889828]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-07-01 06:55:54.984097113 +0000 UTC m=+762.181157899) (total time: 30.001675941s): 2020-07-01T06:56:24.98593869Z Trace[919889828]: [30.001675941s] [30.001675941s] END 2020-07-01T06:56:24.98593869Z E0701 06:56:24.985841 4995 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout 2020-07-01T06:57:01.074614704Z I0701 06:57:01.074458 4995 trace.go:116] Trace[1465987202]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2020-07-01 06:56:31.073558086 +0000 UTC m=+798.270618844) (total time: 30.000861779s): 2020-07-01T06:57:01.074614704Z Trace[1465987202]: [30.000861779s] [30.000861779s] END 2020-07-01T06:57:01.074614704Z E0701 06:57:01.074507 4995 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout 2020-07-01T06:57:51.896917749Z I0701 06:57:51.896831 4995 trace.go:116] Trace[1980435746]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-07-01 06:57:21.895861641 +0000 UTC m=+849.092922432) (total time: 30.000934291s): 2020-07-01T06:57:51.896917749Z Trace[1980435746]: [30.000934291s] [30.000934291s] END 2020-07-01T06:57:51.897039438Z E0701 06:57:51.897021 4995 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout 2020-07-01T06:58:30.44373067Z I0701 06:58:30.443647 4995 trace.go:116] Trace[1059014376]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2020-07-01 06:58:00.442651145 +0000 UTC m=+887.639711898) (total time: 30.000952587s): 2020-07-01T06:58:30.44373067Z Trace[1059014376]: [30.000952587s] [30.000952587s] END 2020-07-01T06:58:30.44373067Z E0701 06:58:30.443678 4995 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout 2020-07-01T06:58:54.623398605Z I0701 06:58:54.623263 4995 trace.go:116] Trace[2050729718]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-07-01 06:58:24.622346502 +0000 UTC m=+911.819407282) (total time: 30.000866922s): 2020-07-01T06:58:54.623398605Z Trace[2050729718]: [30.000866922s] [30.000866922s] END preliminary findings: 1) openshift-apiserver is reporting degraded because not all its pods could be scheduled 2) pods could not be scheduled because not all master nodes are available 3) not all master nodes are available because of issues contacting the k8s apiserver (see MCO errors in comment 3) 4) MCO + Networking are also reporting degraded 5) the k8s apiserver itself is reporting available but degraded: conditions: - lastTransitionTime: "2020-07-01T09:32:15Z" message: |- InstallerPodContainerWaitingDegraded: Pod "installer-9-master-0-2" on node "master-0-2" container "installer" is waiting for 13m31.141901586s because "" InstallerPodNetworkingDegraded: Pod "installer-9-master-0-2" on node "master-0-2" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-9-master-0-2_openshift-kube-apiserver_e80c4c50-23e5-4332-ba9f-ab705ec3df67_0(b99028b166da8edf9beaa3d25b38251bb5b5b574c0e7a98d31a6d392eb42a054): Multus: [openshift-kube-apiserver/installer-9-master-0-2]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition reason: InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox status: "True" type: Degraded - lastTransitionTime: "2020-07-01T09:24:38Z" message: 'NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 10' reason: NodeInstaller status: "True" type: Progressing - lastTransitionTime: "2020-06-30T14:38:08Z" message: 'StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8; 0 nodes have achieved new revision 10' reason: AsExpected status: "True" type: Available - lastTransitionTime: "2020-06-30T14:36:05Z" reason: AsExpected status: "True" type: Upgradeable Seeing as rc.5 to rc.6 is updateable I'm moving this to ON_QA ti indicate its being re-tested. If it works, we can close this one. Per comment 6, the cause is network. Checked the must-gather (via cm/cluster-config-v1 in namespaces/kube-system/core/configmaps.yaml), it is baremetal disconnected OVN env. Moving to Networking component. BTW I already triggered four envs of baremetal disconnected envs with and without OVN for reproducing later. After retesting again - updating from rc.5 to rc.6 without force flag, this happend: [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS ------ -------- --------- ----------- ------ ------- version 4.5.0-rc.5 True True 11h Unable to apply 4.5.0-rc.6: the image may not be safe to use [kni@provisionhost-0-0 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ----- ----------- ---------- ------------ ---------- ------- authentication 4.5.0-rc.5 True False False 13h cloud-credential 4.5.0-rc.5 True False False 14h cluster-autoscaler 4.5.0-rc.5 True False False 13h config-operator 4.5.0-rc.5 True False False 13h console 4.5.0-rc.5 True False False 13h csi-snapshot-controller 4.5.0-rc.5 True False False 13h dns 4.5.0-rc.5 True False False 13h etcd 4.5.0-rc.5 True False False 13h image-registry 4.5.0-rc.5 True False False 13h ingress 4.5.0-rc.5 True False False 13h insights 4.5.0-rc.5 True False False 13h kube-apiserver 4.5.0-rc.5 True False False 13h kube-controller-manager 4.5.0-rc.5 True False False 13h kube-scheduler 4.5.0-rc.5 True False False 13h kube-storage-version-migrator 4.5.0-rc.5 True False False 13h machine-api 4.5.0-rc.5 True False False 13h machine-approver 4.5.0-rc.5 True False False 13h machine-config 4.5.0-rc.5 True False False 13h marketplace 4.5.0-rc.5 True False False 13h monitoring 4.5.0-rc.5 True False False 13h network 4.5.0-rc.5 True False False 13h node-tuning 4.5.0-rc.5 True False False 13h openshift-apiserver 4.5.0-rc.5 True False False 72m openshift-controller-manager 4.5.0-rc.5 True False False 13h openshift-samples 4.5.0-rc.5 True False False 13h operator-lifecycle-manager 4.5.0-rc.5 True False False 13h operator-lifecycle-manager-catalog 4.5.0-rc.5 True False False 13h operator-lifecycle-manager-packageserver 4.5.0-rc.5 True False False 13h service-ca 4.5.0-rc.5 True False False 13h storage 4.5.0-rc.5 True False False 13h but when updated with the force flag, the updating succeed. InstallerPodNetworkingDegraded: Pod "installer-11-master-0-2" on node "master-0-2" observed degraded networking: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-11-master-0-2_openshift-kube-apiserver_569a19e5-fe46-4e34-9f5e-0ae67b259786_0(c4275101c2593ab24480e17d6b7d36b2b4001a16974d073633f948ffda0cbf11): Multus: [openshift-kube-apiserver/installer-11-master-0-2]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
sounds like networking pod failed to get created due to a node/crio issue?
Networking itself reports:
status:
conditions:
- lastTransitionTime: "2020-07-09T14:50:38Z"
message: DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making
progress - last change 2020-07-09T14:39:36Z
reason: RolloutHung
status: "True"
type: Degraded
- lastTransitionTime: "2020-07-09T11:29:21Z"
status: "True"
type: Upgradeable
- lastTransitionTime: "2020-07-09T14:38:30Z"
message: |-
DaemonSet "openshift-multus/multus-admission-controller" is not available (awaiting 1 nodes)
DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
reason: Deploying
status: "True"
type: Progressing
networking pod ovnkube-node-lqphr is showing:
- containerID: cri-o://d6fde6e77032e51c11a18e3e27440b684dea8256fb4fb80a9b44f63c0227a81f
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f3c2711b2f0e762862981c97143e2871b39af1bcde90fdbd5d7147b4a91b764
imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f3c2711b2f0e762862981c97143e2871b39af1bcde90fdbd5d7147b4a91b764
lastState:
terminated:
containerID: cri-o://d6fde6e77032e51c11a18e3e27440b684dea8256fb4fb80a9b44f63c0227a81f
exitCode: 1
finishedAt: "2020-07-12T14:02:44Z"
message: |
+ [[ -f /env/master-0-2 ]]
+ cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
+ ovn_config_namespace=openshift-ovn-kubernetes
+ retries=0
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
Unable to connect to the server: dial tcp: lookup api-int.ocp-edge-cluster-0.qe.lab.redhat.com on 192.168.123.1:53: no such host
+ db_ip=
reason: Error
startedAt: "2020-07-12T14:02:44Z"
name: ovnkube-node
so I guess agree that this seems DNS related.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 I think we accidentally forgot to pull this BZ from the errata. Re-openning. The DNS Operator is available but indicates a progressing condition:
status:
conditions:
- lastTransitionTime: "2020-07-09T14:39:52Z"
message: All desired DNS DaemonSets available and operand Namespace exists
reason: AsExpected
status: "False"
type: Degraded
- lastTransitionTime: "2020-07-09T14:38:30Z"
message: At least 1 DNS DaemonSet is progressing.
reason: Reconciling
status: "True"
type: Progressing
- lastTransitionTime: "2020-07-09T11:35:36Z"
message: At least 1 DNS DaemonSet available
reason: AsExpected
status: "True"
type: Available
# One of the dns dameonset pods ("dns-default-4dbgg") is unavailable:
status:
currentNumberScheduled: 5
desiredNumberScheduled: 5
numberAvailable: 4
numberMisscheduled: 0
numberReady: 4
numberUnavailable: 1
observedGeneration: 2
updatedNumberScheduled: 5
# All containers in pod "dns-default-4dbgg" are not ready:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-07-09T13:32:46Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-07-09T14:39:05Z"
message: 'containers with unready status: [dns kube-rbac-proxy dns-node-resolver]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-07-09T14:40:24Z"
message: 'containers with unready status: [dns kube-rbac-proxy dns-node-resolver]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-07-09T13:32:46Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c09633512a460fda547cd079565554ab79cbfbe767c827bba075f05b47e71d4a
imageID: ""
lastState: {}
name: dns
ready: false
restartCount: 0
started: false
state:
waiting:
reason: ContainerCreating
- image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b93a3f13057991466caf3ba6517493015299a856c6b752bd49b7d4c294312177
imageID: ""
lastState: {}
name: dns-node-resolver
ready: false
restartCount: 0
started: false
state:
waiting:
reason: ContainerCreating
- image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0b9be905dc8404760427a4bfbb9274545b2fb03774d85cd8ee5d93f847c69293
imageID: ""
lastState: {}
name: kube-rbac-proxy
ready: false
restartCount: 0
started: false
state:
waiting:
reason: ContainerCreating
192.168.123.114 is the InternalIP address of node "master-0-2' where pod "dns-default-4dbgg" was scheduled. The node conditions are as expected:
conditions:
- lastHeartbeatTime: "2020-07-12T14:05:56Z"
lastTransitionTime: "2020-07-09T14:40:16Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2020-07-12T14:05:56Z"
lastTransitionTime: "2020-07-09T14:40:16Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2020-07-12T14:05:56Z"
lastTransitionTime: "2020-07-09T14:40:16Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2020-07-12T14:05:56Z"
lastTransitionTime: "2020-07-09T14:40:16Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
Enets indicate an issue creating the pod network sandbox for pod "dns-default-4dbgg":
message: '(combined from similar events): Failed to create pod sandbox: rpc error:
code = Unknown desc = failed to create pod network sandbox k8s_dns-default-4dbgg_openshift-dns_df0adbd5-dc00-4367-b02e-07c62a925a4b_0(f771c552839c5276e622b6f0980a84f0ae496a90c39bab1b1157f7dc8d357a6d):
Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile:
timed out waiting for the condition'
CRIO logs indicate the same error for dns pod "dns-default-4dbgg":
Jul 11 05:41:37.152475 master-0-2 crio[1821]: 2020-07-11T05:41:37Z [error] Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition
Jul 11 05:41:37.154347 master-0-2 crio[1821]: time="2020-07-11 05:41:37.154247983Z" level=error msg="Error deleting network: Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition"
Jul 11 05:41:37.154347 master-0-2 crio[1821]: time="2020-07-11 05:41:37.154332566Z" level=error msg="Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition"
Jul 11 05:41:37.154557 master-0-2 crio[1821]: time="2020-07-11 05:41:37.154451137Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_dns-default-4dbgg_openshift-dns_df0adbd5-dc00-4367-b02e-07c62a925a4b_0(9064208bb220d12adb8a12c24492db4aea36419f66f0f6b932a065925429ffb2): Multus: [openshift-dns/dns-default-4dbgg]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition" id=e6548c60-b32b-41dd-a4a7-ebde016067e7 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
This BZ appears to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1805444. Maybe the fix in BZ 1805444 needs to be ported to OVN? Reassigning to the SDN team for confirmation and further investigation.
Assigned to Doug to see if it's the same (or similar) to the other issue that Dane found. A `PollImmediate error waiting for ReadinessIndicatorFile` means that (in the context of ovn-kubernetes in OCP) the file `/var/run/multus/cni/net.d/10-ovn-kubernetes.conf` was not found by Multus CNI. This is the "readiness indicator file" -- and indicates the readiness of the default network (in this case, ovn-kubernetes) is not ready, and there may be some failure of the process that writes that CNI configuration file to disk. Without this, we can't be certain that OVN is ready to handle network traffic from workloads, so Multus waits for this readiness indication from the default network CNI configuration file. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 (In reply to Eric Paris from comment #20) > I think we accidentally forgot to pull this BZ from the errata. Re-openning. Don't do that. Once it's been shipped in an errata it can never be removed or shipped again. Bugs shipped by errata are intended to be immutable. It needs to be cloned to proceed. As the ET comment indicates: > If the solution does not work for you, open a new bug report. I've cloned it as https://bugzilla.redhat.com/show_bug.cgi?id=1867718 I'll return the bug to CLOSED ERRATA although it is clearly not actually fixed. |