Bug 1908076 - OVN installation degrades auth and network operator on s390x
Summary: OVN installation degrades auth and network operator on s390x
Keywords:
Status: CLOSED DUPLICATE of bug 1908231
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: s390x
OS: Linux
medium
high
Target Milestone: ---
: 4.7.0
Assignee: Peng Liu
QA Contact: Anurag saxena
URL:
Whiteboard: multi-arch
Depends On:
Blocks: ocp-47-z-tracker
TreeView+ depends on / blocked
 
Reported: 2020-12-15 19:46 UTC by Tom Dale
Modified: 2021-01-07 14:17 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-07 14:17:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Various debugging logs (28.10 KB, text/plain)
2020-12-15 19:46 UTC, Tom Dale
no flags Details
oc get pods -A -o wide (65.99 KB, text/plain)
2020-12-16 15:17 UTC, Tom Dale
no flags Details
oc logs -n openshift-ovn-kubernetes --all-containers=true pod/ovnkube-node-mt5mp (45.38 KB, text/plain)
2020-12-16 16:35 UTC, Tom Dale
no flags Details
oc describe co network (12.39 KB, text/plain)
2021-01-04 17:12 UTC, Tom Dale
no flags Details
oc describe co openshift-apiserver (5.13 KB, text/plain)
2021-01-04 17:13 UTC, Tom Dale
no flags Details
oc describe pods -n openshift-ovn-kubernetes (219.67 KB, text/plain)
2021-01-04 17:14 UTC, Tom Dale
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1909187 0 medium CLOSED OVNKube pods on multiple nodes keep restarting and crash. 2023-09-15 00:56:26 UTC

Description Tom Dale 2020-12-15 19:46:03 UTC
Created attachment 1739437 [details]
Various debugging logs

Description of problem:
Following instructions at https://docs.openshift.com/container-platform/4.6/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html on s390x leads to network and auth co stuck in a degraded state.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-s390x-2020-12-10-094353


How reproducible:
3 out of 5 clusters failed to install with OVN. The latest two failing on 4.7.0-0.nightly-s390x-2020-12-10-094353 both KVM and z/VM


Steps to Reproduce:
1. Follow https://docs.openshift.com/container-platform/4.6/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html on s390x cluster
2. Reboot nodes as shown in step 5
3. Login using system:admin user as authentication operator is failing

Actual results:
2 operators stuck in progressing and degraded state (waited over 24 hours no change)
88 pods stuck in containerCreating state


Expected results:
Cluster should recover and use OVNkube networking

Additional info: 
oc adm must-gather is failing. Other debugging logs attached

Comment 1 Dan Li 2020-12-16 14:39:43 UTC
After chatting with the bug creator, user stated that he was able to install OVN a week ago; however, earlier this week, user saw a degradation with installation on zVM or KVM and is now encountering this bug.

Therefore, I'm marking this bug as "Blocker+" at the moment. If the Networking team deems it to be not a blocker, then please feel free to change the Blocker flag

Comment 2 Alexander Constantinescu 2020-12-16 15:00:10 UTC
Hi Peng 

FYI: we assigned this to you seeing as how it concerns itself with an SDN -> OVN migration. Feel free to dipatch back to anyone else in case it turns out to be un-related to the migration procedure.

Also, once we have a better picture of what is causing the issue, we can assess if it's a blocker or not

/Alex

Comment 3 Alexander Constantinescu 2020-12-16 15:05:52 UTC
Actually, taking a quick look in the attachment: could you please provide a description of which pod is failing in which networking namespace? 

Is it ovn-kubernetes? From the attachment it seems multus is crashLooping

Could you do:

oc get pod -A -owide 
oc get co 

could you please get all logs for all pods in openshift-ovn-kubernetes?

/Alex

Comment 4 Tom Dale 2020-12-16 15:17:50 UTC
Created attachment 1739680 [details]
oc get pods -A -o wide

Comment 5 Tom Dale 2020-12-16 15:20:06 UTC
❯ oc get pods -n openshift-ovn-kubernetes
No resources found in openshift-ovn-kubernetes namespace.

❯ oc get pods -n openshift-sdn
No resources found in openshift-sdn namespace.

❯ oc get co
NAME                                       VERSION                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-s390x-2020-12-10-094353   False       True          True       24h
baremetal                                  4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
cloud-credential                           4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
cluster-autoscaler                         4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
config-operator                            4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
console                                    4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
csi-snapshot-controller                    4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      2d23h
dns                                        4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      3d
etcd                                       4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
image-registry                             4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      4d22h
ingress                                    4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      3d7h
insights                                   4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
kube-apiserver                             4.7.0-0.nightly-s390x-2020-12-10-094353   True        True          False      5d1h
kube-controller-manager                    4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
kube-scheduler                             4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
kube-storage-version-migrator              4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      3d
machine-api                                4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
machine-approver                           4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
machine-config                             4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      3d1h
marketplace                                4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      3d
monitoring                                 4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      2d23h
network                                    4.7.0-0.nightly-s390x-2020-12-10-094353   True        True          True       5d1h
node-tuning                                4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
openshift-apiserver                        4.7.0-0.nightly-s390x-2020-12-10-094353   False       False         False      24h
openshift-controller-manager               4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      3d10h
openshift-samples                          4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
operator-lifecycle-manager                 4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
operator-lifecycle-manager-catalog         4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-s390x-2020-12-10-094353   False       True          False      24h
service-ca                                 4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h
storage                                    4.7.0-0.nightly-s390x-2020-12-10-094353   True        False         False      5d1h

Comment 6 Tom Dale 2020-12-16 16:34:54 UTC
My apologies, there were no pods showing in namespace because I was trying to recover to openshiftSDN.

A fresh migration to OVNkube shows 

❯ oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE
ovnkube-master-ks54v   6/6     Running            0          15m
ovnkube-master-kt7zw   6/6     Running            1          15m
ovnkube-master-mqbv4   6/6     Running            2          15m
ovnkube-node-29fk8     3/3     Running            0          15m
ovnkube-node-2tksm     2/3     CrashLoopBackOff   5          15m
ovnkube-node-4b9wc     3/3     Running            3          15m
ovnkube-node-dfwf6     3/3     Running            4          15m
ovnkube-node-mt5mp     2/3     CrashLoopBackOff   5          15m
ovs-node-2ssqr         1/1     Running            0          15m
ovs-node-d5bnp         1/1     Running            0          15m
ovs-node-hctzw         1/1     Running            0          15m
ovs-node-jlbkz         1/1     Running            0          15m
ovs-node-q2nt6         1/1     Running            0          15m

logs to follow

Comment 7 Tom Dale 2020-12-16 16:35:48 UTC
Created attachment 1739698 [details]
oc logs -n openshift-ovn-kubernetes --all-containers=true pod/ovnkube-node-mt5mp

Logs for the failed ovnkube pod .

Comment 8 Tom Dale 2020-12-16 18:51:49 UTC
Tried OVN migration on fresh ocp Version: 4.7.0-0.nightly-s390x-2020-12-15-081322 z/VM install and got same issues with the ovnkube pod logs showing same errors as attached KVM logs

WARN|Bridge 'br-local' not found for network 'locnet

Comment 12 Tom Dale 2021-01-04 17:11:47 UTC
Followed through updated ovn documentation with steps for and still am getting degraded network, openshift-apiserver, and authentication operators.

Hitting issue at step 10c.
# oc get pod -n openshift-machine-config-operator
NAME                                         READY   STATUS              RESTARTS   AGE
machine-config-controller-7685b58b68-bv95p   0/1     ContainerCreating   2          42h
machine-config-daemon-4tdmt                  2/2     Running             0          41h
machine-config-daemon-75vp8                  2/2     Running             0          42h
machine-config-daemon-fxkt7                  2/2     Running             0          42h
machine-config-daemon-gclhz                  2/2     Running             0          41h
machine-config-daemon-q72vl                  2/2     Running             0          42h
machine-config-operator-5ccbfcbdfd-b7r4b     0/1     ContainerCreating   1          42h
machine-config-server-b965g                  1/1     Running             0          42h
machine-config-server-bqfzj                  1/1     Running             0          42h
machine-config-server-r5gfq                  1/1     Running             0          42h

However even as system:admin I cannot read the logs from the two pods that are stuck in a ContainerCreating state

[root@ospamgr3 ovn-debug]# oc logs pod/machine-config-controller-7685b58b68-bv95p -n openshift-machine-config-operator
unable to retrieve container logs for cri-o://0ad931954727ba5d5e0def37a4c32e63d8c2a3d776d022ae2a552f49f26939ee

Comment 13 Tom Dale 2021-01-04 17:12:57 UTC
Created attachment 1744353 [details]
oc describe co network

Comment 14 Tom Dale 2021-01-04 17:13:30 UTC
Created attachment 1744354 [details]
oc describe co openshift-apiserver

Comment 15 Tom Dale 2021-01-04 17:14:21 UTC
Created attachment 1744357 [details]
oc describe pods -n openshift-ovn-kubernetes

Comment 16 Dan Li 2021-01-04 17:52:02 UTC
Re-assigning this bug to the Networking team to get their input on Comment 12 as the creator followed the updated documentation and observed the bug. Please re-assign if necessary.

Comment 17 Dan Williams 2021-01-05 15:01:40 UTC
ovn-kube node seems to be crashing with:

    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   hub.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:44 +0x7e
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:42 +0xde
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
  panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11f6c38]

goroutine 268 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x162
panic(0x13fff00, 0x2358ee0)
  /usr/lib/golang/src/runtime/panic.go:969 +0x16e
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait.func1.1(0x0, 0x0, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:45 +0x28
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection(0xc000744fa0, 0x1497540, 0x0, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:211 +0x66
k8s.io/apimachinery/pkg/util/wait.pollImmediateInternal(0xc000787f20, 0xc0001c6fa0, 0xc000787f20, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:445 +0x2a
k8s.io/apimachinery/pkg/util/wait.PollImmediate(0x1dcd6500, 0x45d964b800, 0xc000744fa0, 0x0, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:441 +0x48
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait.func1(0xc000182a00, 0xc0004c7d40, 0xc0002bfa70)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:44 +0x7e
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:42 +0xde

Comment 18 Dan Williams 2021-01-05 15:04:08 UTC
Is this a dupe of Bug 1908231 ? Same backtrace.

Comment 19 Dan Li 2021-01-05 15:43:26 UTC
The multi-arch bug triage team looked through the bugs and we think that this bug is  similar to BZ 1909187 found on Power

Comment 20 Tom Dale 2021-01-05 16:00:16 UTC
Not sure if its the same as 1909187 as I don't have any csrs

# oc get csr --all-namespaces
No resources found

Comment 21 Peng Liu 2021-01-06 12:31:37 UTC
@Tom it looks like a dupe of Bug 1908231. Could you test with the latest 4.7 build?

Comment 22 Tom Dale 2021-01-06 15:01:03 UTC
This tested and failed with OCP 4.7.0-0.nightly-s390x-2020-12-21-160105, the latest build available on the public mirror.

Comment 23 Tom Dale 2021-01-06 19:11:03 UTC
Update, looks like this issue is fixed in new build
Server Version: 4.7.0-0.nightly-s390x-2021-01-05-214454.
Successfully installed OVN on z-KVM. Will close issue once I verify no issue on z/VM as well.

Comment 24 Tom Dale 2021-01-07 14:17:00 UTC
Issue fixed on z/VM cluster as well. Thanks for the help.

Comment 25 Tom Dale 2021-01-07 14:17:31 UTC

*** This bug has been marked as a duplicate of bug 1908231 ***


Note You need to log in before you can comment on or make changes to this bug.