Bug 2223780

Summary:	Multus, Connection issue to Noobaa DB after reset all pods in openshift-storage ns
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Oded <oviner>
Component:	rook	Assignee:	Blaine Gardner <brgardne>
Status:	CLOSED DUPLICATE	QA Contact:	Neha Berry <nberry>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.13	CC:	brgardne, ebenahar, muagarwa, odf-bz-bot, tnielsen
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-15 15:09:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Oded 2023-07-18 21:48:43 UTC

Description of problem (please be detailed as possible and provide log
snippests):
1.Installed a cluster with Multus
2.After reset all the pods in openshift-storage namespace, I found a connection issue to  Noobaa DB
3.Tested the same procedure on the cluster without Multus, and everything worked as expected [Storagecluster moved to Ready state]
4.noobaa-core-0 pod in Running state althoght we got this error:

Jul-18 14:51:49.698 [Upgrade/20] [ERROR] core.util.postgres_client:: _connect: initial connect failed, will retry connect EHOSTUNREACH 10.128.2.30:5432
Jul-18 14:51:52.698 [Upgrade/20]    [L0] core.util.postgres_client:: _connect: called with { max: 10, host: 'noobaa-db-pg-0.noobaa-db-pg', user: 'noobaa', password: 'arrvkPEp/3MbXA==', database: 'nbcore', port: 5432 }
Jul-18 14:51:55.778 [Upgrade/20] [ERROR] core.util.postgres_client:: apply_sql_functions execute error Error: connect EHOSTUNREACH 10.128.2.30:5432
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '10.128.2.30',
  port: 5432
}

Version of all relevant components (if applicable):
ODF Version: 4.13.0-218
OCP Version: 4.13.0-0.nightly-2023-07-18-041822
Plaform: BM

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install LSO4.13
2.Install ODF4.13 with multus
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: public-net
 namespace: openshift-storage
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "enp1s0f1", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }'
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: cluster-net
 namespace: openshift-storage
 labels: {}
 annotations: {}
spec:
  config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "enp1s0f1", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }'

3.Verify storagecluster on Ready state

4.Verify ceph statsu is OK

5.Restart all pods in openshift-storage
$ oc delete pods --all -n openshift-storage
pod "csi-addons-controller-manager-7998b997-d6d8m" deleted
pod "csi-cephfsplugin-fl749" deleted

6.Check storagecluster status -> [stuck on Progressing state]
$ oc get storagecluster
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   19m   Progressing              2023-07-18T14:28:43Z   4.13.0


    Status:                True
    Type:                  Available
    Last Heartbeat Time:   2023-07-18T14:48:17Z
    Last Transition Time:  2023-07-18T14:39:32Z
    Message:               Waiting on Nooba instance to finish initialization
    Reason:                NoobaaInitializing
    Status:                True

7. Check noobaa pods on openshift-storage namespace
$ oc get pods -l app=noobaa
NAME                               READY   STATUS    RESTARTS   AGE
noobaa-core-0                      1/1     Running   0          9m15s
noobaa-db-pg-0                     1/1     Running   0          8m44s
noobaa-endpoint-69c754f649-hgjmv   1/1     Running   0          9m45s
noobaa-operator-897469f66-6ghkl    1/1     Running   0          9m45s

8. Although the noobaa-core-0 pod is in Running state, there is a connection issue to Noobaa DB.
$ oc logs noobaa-core-0 
Jul-18 14:51:49.698 [Upgrade/20] [ERROR] core.util.postgres_client:: _connect: initial connect failed, will retry connect EHOSTUNREACH 10.128.2.30:5432
Jul-18 14:51:52.698 [Upgrade/20]    [L0] core.util.postgres_client:: _connect: called with { max: 10, host: 'noobaa-db-pg-0.noobaa-db-pg', user: 'noobaa', password: 'arrvkPEp/3MbXA==', database: 'nbcore', port: 5432 }
Jul-18 14:51:55.778 [Upgrade/20] [ERROR] core.util.postgres_client:: apply_sql_functions execute error Error: connect EHOSTUNREACH 10.128.2.30:5432
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '10.128.2.30',
  port: 5432
}
Jul-18 14:51:55.778 [Upgrade/20] [ERROR] core.util.postgres_client:: _connect: initial connect failed, will retry connect EHOSTUNREACH 10.128.2.30:5432
Jul-18 14:51:58.779 [Upgrade/20]    [L0] core.util.postgres_client:: _connect: called with { max: 10, host: 'noobaa-db-pg-0.noobaa-db-pg', user: 'noobaa', password: 'arrvkPEp/3MbXA==', database: 'nbcore', port: 5432 }
Jul-18 14:51:58.850 [Upgrade/20] [ERROR] core.util.postgres_client:: apply_sql_functions execute error Error: connect EHOSTUNREACH 10.128.2.30:5432
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '10.128.2.30',
  port: 5432
}



Actual results:
Connection issue to noobaa DB

Expected results:
Tested the same procedure on the cluster without Multus, and everything worked as expected [Storagecluster moved to Ready state]

Additional info:
OCS MG:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2223780.tar.gz

Comment 4 Blaine Gardner 2023-07-19 21:33:11 UTC

Noobaa is having trouble reaching the noobaa-db-pg-0 pod at 10.128.2.30. I've never come across a log like this, but it seems like the noobaa-db-pg-0 pod might not have a running container. I don't see any container logs for the pod, and it has this error from the kubelet:

  Warning  Failed  21m (x27 over 3h25m)  kubelet  (combined from similar events): Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_initialize-database_noobaa-db-pg-0_openshift-storage_57124fca-66a5-434e-9874-db1410cf0e27_0 for id 836db3365cfd19b7463e4a131a948042a2a9d249938a99a4790a0a26c5d44bb1: name is reserved

I don't see anything that suggests the is issue is multus related. That noobaa pod doesn't have a multus IP.

Does this resolve if you try deleting the noobaa-db-pg-0 pod again?

Can you repro this issue, or was this a one-time thing?

Comment 5 Oded 2023-07-25 07:58:47 UTC

Does this resolve if you try deleting the noobaa-db-pg-0 pod again? --> 
I tried to delete the "noobaa-db-pg-0" pod twice, with force flag and without force flag.


Can you repro this issue, or was this a one-time thing? -->
I reproduced it 3 times.

Comment 6 Blaine Gardner 2023-07-25 14:27:02 UTC

From chat thread, Eran suggested that this looks like it could be a cri-o issue:

  I found this issue https://github.com/cri-o/cri-o/issues/6185
  https://access.redhat.com/solutions/6499541
  
  Peter Hunt was the engineer that handled that issue, so might worth reaching out to him and get his opinion.
  
  There are several suggestions to narrow the reason

Comment 8 Blaine Gardner 2023-08-07 15:25:41 UTC

I'm coming back to this and realize that I missed or mis-interpreted Elad's comment here: https://bugzilla.redhat.com/show_bug.cgi?id=2223780#c3

> To clarify, the restart of all pods in the openshift-storage namespace is required post the NAD configuration, based on instructions we got from Dev.

I vaguely recall mentioning this as a means of speeding up QE test efforts early on we were failing to get NADs configured correctly, but I forget some of the context. Someone took my recommendation to restart the openshift-storage pods after NAD update too broadly. This is absolutely *not* a recommendation for Multus once ODF is in progress of installing or after. To be safe, we should always assume that it is never safe to reboot ODF pods when multus is configured. If this is recommended in ODF documents anywhere, we should instead update the recommendation that the entire node should be rebooted in multus cases.

It *is* safe to reboot pods related to the multus validation tool, and that is the only exception.

Comment 9 Mudit Agarwal 2023-08-08 05:32:16 UTC

What are the next steps for this? Did we follow https://bugzilla.redhat.com/show_bug.cgi?id=2223780#c6?

Comment 10 Blaine Gardner 2023-08-09 20:02:56 UTC

I don't think there is a need to follow comment 6. I believe the next step is to remove this test. This BZ is tracking the feature that will coincide with this test: https://bugzilla.redhat.com/show_bug.cgi?id=2167974

Comment 11 Mudit Agarwal 2023-08-11 15:29:17 UTC

Ok, not a blocker then. Keeping it open.

Comment 12 Travis Nielsen 2023-08-15 15:09:53 UTC

Discussed with Blaine, closing since the work item is really tracked with https://bugzilla.redhat.com/show_bug.cgi?id=2167974

*** This bug has been marked as a duplicate of bug 2167974 ***

Comment 13 Red Hat Bugzilla 2023-12-14 04:25:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days