Created attachment 1739437 [details] Various debugging logs Description of problem: Following instructions at https://docs.openshift.com/container-platform/4.6/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html on s390x leads to network and auth co stuck in a degraded state. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-s390x-2020-12-10-094353 How reproducible: 3 out of 5 clusters failed to install with OVN. The latest two failing on 4.7.0-0.nightly-s390x-2020-12-10-094353 both KVM and z/VM Steps to Reproduce: 1. Follow https://docs.openshift.com/container-platform/4.6/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html on s390x cluster 2. Reboot nodes as shown in step 5 3. Login using system:admin user as authentication operator is failing Actual results: 2 operators stuck in progressing and degraded state (waited over 24 hours no change) 88 pods stuck in containerCreating state Expected results: Cluster should recover and use OVNkube networking Additional info: oc adm must-gather is failing. Other debugging logs attached
After chatting with the bug creator, user stated that he was able to install OVN a week ago; however, earlier this week, user saw a degradation with installation on zVM or KVM and is now encountering this bug. Therefore, I'm marking this bug as "Blocker+" at the moment. If the Networking team deems it to be not a blocker, then please feel free to change the Blocker flag
Hi Peng FYI: we assigned this to you seeing as how it concerns itself with an SDN -> OVN migration. Feel free to dipatch back to anyone else in case it turns out to be un-related to the migration procedure. Also, once we have a better picture of what is causing the issue, we can assess if it's a blocker or not /Alex
Actually, taking a quick look in the attachment: could you please provide a description of which pod is failing in which networking namespace? Is it ovn-kubernetes? From the attachment it seems multus is crashLooping Could you do: oc get pod -A -owide oc get co could you please get all logs for all pods in openshift-ovn-kubernetes? /Alex
Created attachment 1739680 [details] oc get pods -A -o wide
❯ oc get pods -n openshift-ovn-kubernetes No resources found in openshift-ovn-kubernetes namespace. ❯ oc get pods -n openshift-sdn No resources found in openshift-sdn namespace. ❯ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-s390x-2020-12-10-094353 False True True 24h baremetal 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h cloud-credential 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h cluster-autoscaler 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h config-operator 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h console 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h csi-snapshot-controller 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 2d23h dns 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 3d etcd 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h image-registry 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 4d22h ingress 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 3d7h insights 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h kube-apiserver 4.7.0-0.nightly-s390x-2020-12-10-094353 True True False 5d1h kube-controller-manager 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h kube-scheduler 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h kube-storage-version-migrator 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 3d machine-api 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h machine-approver 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h machine-config 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 3d1h marketplace 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 3d monitoring 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 2d23h network 4.7.0-0.nightly-s390x-2020-12-10-094353 True True True 5d1h node-tuning 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h openshift-apiserver 4.7.0-0.nightly-s390x-2020-12-10-094353 False False False 24h openshift-controller-manager 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 3d10h openshift-samples 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h operator-lifecycle-manager 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h operator-lifecycle-manager-catalog 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h operator-lifecycle-manager-packageserver 4.7.0-0.nightly-s390x-2020-12-10-094353 False True False 24h service-ca 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h storage 4.7.0-0.nightly-s390x-2020-12-10-094353 True False False 5d1h
My apologies, there were no pods showing in namespace because I was trying to recover to openshiftSDN. A fresh migration to OVNkube shows ❯ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-ks54v 6/6 Running 0 15m ovnkube-master-kt7zw 6/6 Running 1 15m ovnkube-master-mqbv4 6/6 Running 2 15m ovnkube-node-29fk8 3/3 Running 0 15m ovnkube-node-2tksm 2/3 CrashLoopBackOff 5 15m ovnkube-node-4b9wc 3/3 Running 3 15m ovnkube-node-dfwf6 3/3 Running 4 15m ovnkube-node-mt5mp 2/3 CrashLoopBackOff 5 15m ovs-node-2ssqr 1/1 Running 0 15m ovs-node-d5bnp 1/1 Running 0 15m ovs-node-hctzw 1/1 Running 0 15m ovs-node-jlbkz 1/1 Running 0 15m ovs-node-q2nt6 1/1 Running 0 15m logs to follow
Created attachment 1739698 [details] oc logs -n openshift-ovn-kubernetes --all-containers=true pod/ovnkube-node-mt5mp Logs for the failed ovnkube pod .
Tried OVN migration on fresh ocp Version: 4.7.0-0.nightly-s390x-2020-12-15-081322 z/VM install and got same issues with the ovnkube pod logs showing same errors as attached KVM logs WARN|Bridge 'br-local' not found for network 'locnet
Followed through updated ovn documentation with steps for and still am getting degraded network, openshift-apiserver, and authentication operators. Hitting issue at step 10c. # oc get pod -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE machine-config-controller-7685b58b68-bv95p 0/1 ContainerCreating 2 42h machine-config-daemon-4tdmt 2/2 Running 0 41h machine-config-daemon-75vp8 2/2 Running 0 42h machine-config-daemon-fxkt7 2/2 Running 0 42h machine-config-daemon-gclhz 2/2 Running 0 41h machine-config-daemon-q72vl 2/2 Running 0 42h machine-config-operator-5ccbfcbdfd-b7r4b 0/1 ContainerCreating 1 42h machine-config-server-b965g 1/1 Running 0 42h machine-config-server-bqfzj 1/1 Running 0 42h machine-config-server-r5gfq 1/1 Running 0 42h However even as system:admin I cannot read the logs from the two pods that are stuck in a ContainerCreating state [root@ospamgr3 ovn-debug]# oc logs pod/machine-config-controller-7685b58b68-bv95p -n openshift-machine-config-operator unable to retrieve container logs for cri-o://0ad931954727ba5d5e0def37a4c32e63d8c2a3d776d022ae2a552f49f26939ee
Created attachment 1744353 [details] oc describe co network
Created attachment 1744354 [details] oc describe co openshift-apiserver
Created attachment 1744357 [details] oc describe pods -n openshift-ovn-kubernetes
Re-assigning this bug to the Networking team to get their input on Comment 12 as the creator followed the updated documentation and observed the bug. Please re-assign if necessary.
ovn-kube node seems to be crashing with: State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: hub.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:44 +0x7e created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:42 +0xde panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11f6c38] goroutine 268 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x162 panic(0x13fff00, 0x2358ee0) /usr/lib/golang/src/runtime/panic.go:969 +0x16e github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait.func1.1(0x0, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:45 +0x28 k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection(0xc000744fa0, 0x1497540, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:211 +0x66 k8s.io/apimachinery/pkg/util/wait.pollImmediateInternal(0xc000787f20, 0xc0001c6fa0, 0xc000787f20, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:445 +0x2a k8s.io/apimachinery/pkg/util/wait.PollImmediate(0x1dcd6500, 0x45d964b800, 0xc000744fa0, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:441 +0x48 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait.func1(0xc000182a00, 0xc0004c7d40, 0xc0002bfa70) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:44 +0x7e created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:42 +0xde
Is this a dupe of Bug 1908231 ? Same backtrace.
The multi-arch bug triage team looked through the bugs and we think that this bug is similar to BZ 1909187 found on Power
Not sure if its the same as 1909187 as I don't have any csrs # oc get csr --all-namespaces No resources found
@Tom it looks like a dupe of Bug 1908231. Could you test with the latest 4.7 build?
This tested and failed with OCP 4.7.0-0.nightly-s390x-2020-12-21-160105, the latest build available on the public mirror.
Update, looks like this issue is fixed in new build Server Version: 4.7.0-0.nightly-s390x-2021-01-05-214454. Successfully installed OVN on z-KVM. Will close issue once I verify no issue on z/VM as well.
Issue fixed on z/VM cluster as well. Thanks for the help.
*** This bug has been marked as a duplicate of bug 1908231 ***