2013048 – Dualstack Multinode spoke nmstate static cluster stuck installing - Spoke network co clbo (ipv4 hub/env)

Bug 2013048 - Dualstack Multinode spoke nmstate static cluster stuck installing - Spoke network co clbo (ipv4 hub/env)

Summary: Dualstack Multinode spoke nmstate static cluster stuck installing - Spoke net...

Keywords:
Status:	CLOSED DUPLICATE of bug 2022144
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Infrastructure Operator
Sub Component:
Version:	rhacm-2.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Mat Kowalski
QA Contact:	Chad Crum
Docs Contact:	Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-12 00:09 UTC by Chad Crum
Modified:	2023-09-15 01:16 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-12 16:09:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 17126	0	None	None	None	2021-10-12 14:07:29 UTC

Description Chad Crum 2021-10-12 00:09:53 UTC

Description of the problem:
Multinode dualstack spoke cluster (nmstate static) in an ipv4 hub env fails to complete installation. Spoke cluster network cluster operator is stuck with one ovnkube-master in CLBO

network                                    4.9.0-fc.0   True        True          True       43m     DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-77lz4 is in CrashLoopBackOff State...
ov

Release version:
ACM: 2.4.0-DOWNSTREAM-2021-10-11-09-00-19
Hub OCP: 4.9.0-0.nightly-2021-10-08-232649
Spoke OCP: 4.9.0-fc.0:

- IPv4 hub + IPv4 libvirt env
- Dualstack Spoke 

Steps to reproduce:
1. Deploy IPv4 OCP 4.9 hub with RHACM 2.4 downstream snapshot + Assisted Service
2. Deploy a dualstack Multinode spoke cluster with NMstate static IP networking

Actual results:
- Spoke CD CR never shows complete
- Spoke cluster network cluster operator is stuck in CLBO


Expected results:
- Installation succeeds

Additional info:

Comment 1 Chad Crum 2021-10-12 00:13:38 UTC

I successfully deployed ta spoke cluster with same scenario, except in an ipv6 disconnected hub, so perhaps there was a difference with the ocp version used in the spoke.

I will continue to try to narrow this down, but leave open for tracking.

Comment 2 Mat Kowalski 2021-10-12 07:09:26 UTC

When seeing the CrashLoopBackOff next time, can you please gather the logs from the specific pod that is failing ? I have already observed a similar issue that is reported to ovn-kubernetes in https://bugzilla.redhat.com/show_bug.cgi?id=2011502.

I suppose this one here is also something with ovn-k8s, but cannot confirm without log or access to the environment.

Comment 3 Alexander Chuzhoy 2021-10-14 15:07:43 UTC

Reproduced:

[kni@provisionhost-0-0 ~]$ oc --kubeconfig kubeconfig-spoke get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-rc.4   True        False         False      9h      
baremetal                                  4.9.0-rc.4   True        True          False      9h      Applying metal3 resources
cloud-controller-manager                   4.9.0-rc.4   True        False         False      10h     
cloud-credential                           4.9.0-rc.4   True        False         False      10h     
cluster-autoscaler                         4.9.0-rc.4   True        False         False      10h     
config-operator                            4.9.0-rc.4   True        False         False      10h     
console                                    4.9.0-rc.4   True        False         False      9h      
csi-snapshot-controller                    4.9.0-rc.4   True        False         False      10h     
dns                                        4.9.0-rc.4   True        False         False      9h      
etcd                                       4.9.0-rc.4   True        False         False      9h      
image-registry                             4.9.0-rc.4   True        False         False      9h      
ingress                                    4.9.0-rc.4   True        False         False      9h      
insights                                   4.9.0-rc.4   True        False         False      9h      
kube-apiserver                             4.9.0-rc.4   True        False         False      9h      
kube-controller-manager                    4.9.0-rc.4   True        False         False      9h      
kube-scheduler                             4.9.0-rc.4   True        False         False      9h      
kube-storage-version-migrator              4.9.0-rc.4   True        False         False      10h     
machine-api                                4.9.0-rc.4   True        False         False      9h      
machine-approver                           4.9.0-rc.4   True        False         False      9h      
machine-config                             4.9.0-rc.4   True        False         False      9h      
marketplace                                4.9.0-rc.4   True        False         False      10h     
monitoring                                 4.9.0-rc.4   True        False         False      9h      
network                                    4.9.0-rc.4   True        True          True       10h     DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-nqvs4 is in CrashLoopBackOff State
node-tuning                                4.9.0-rc.4   True        False         False      9h      
openshift-apiserver                        4.9.0-rc.4   True        False         False      9h      
openshift-controller-manager               4.9.0-rc.4   True        False         False      9h      
openshift-samples                          4.9.0-rc.4   True        False         False      9h      
operator-lifecycle-manager                 4.9.0-rc.4   True        False         False      9h      
operator-lifecycle-manager-catalog         4.9.0-rc.4   True        False         False      10h     
operator-lifecycle-manager-packageserver   4.9.0-rc.4   True        False         False      9h      
service-ca                                 4.9.0-rc.4   True        False         False      10h     
storage                                    4.9.0-rc.4   True        False         False      10h     
[kni@provisionhost-0-0 ~]$


oc --kubeconfig kubeconfig-spoke get pod -n  openshift-ovn-kubernetes|grep -v Run|grep -v Comple
NAME                   READY   STATUS             RESTARTS      AGE
ovnkube-master-nqvs4   5/6     CrashLoopBackOff   8 (16s ago)   11m

Comment 5 Mat Kowalski 2021-10-14 15:13:44 UTC

From the container logs I can see

```
F1014 15:05:42.710703       1 ovndbchecker.go:118] unable to turn on memory trimming for SB DB, stderr: 2021-10-14T15:05:42Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (Connection refused)
```

I'm not sure if anything can be done from the AI side

Comment 7 Mat Kowalski 2021-10-15 15:52:55 UTC

Must-gather attached does not contain directory `namespaces/openshift-ovn-kubernetes` whereas above we have the output of `oc --kubeconfig kubeconfig-spoke get pod -n  openshift-ovn-kubernetes`. It does not match, in order to investigate further we need to have must-gather from the cluster that failed and matches the reported error messages. From the event-filter I can see that there are problems with OVN-K8s, but with no *full* logs from the failing pod nothing more can be said.

Given that there is no immediate access to the affected cluster, I'm dropping the URGENT priority. There is nothing that can be done from the engineering side right now and URGENT should be used for issues that need a developer's attention "right here, right now".

Comment 8 Chad Crum 2021-10-20 12:41:47 UTC

This not reproduced constantly - We will try to reproduce / keep an eye out and grab the logs when it happens again.

Comment 9 Chad Crum 2021-11-12 16:09:22 UTC

Will continue to track on bz2022144, which is already in the openshift networking product.

*** This bug has been marked as a duplicate of bug 2022144 ***

Comment 10 Red Hat Bugzilla 2023-09-15 01:16:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.