Bug 2223780
| Summary: | Multus, Connection issue to Noobaa DB after reset all pods in openshift-storage ns | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Oded <oviner> |
| Component: | rook | Assignee: | Blaine Gardner <brgardne> |
| Status: | CLOSED DUPLICATE | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.13 | CC: | brgardne, ebenahar, muagarwa, odf-bz-bot, tnielsen |
| Target Milestone: | --- | Flags: | brgardne:
needinfo?
(oviner) |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-15 15:09:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Noobaa is having trouble reaching the noobaa-db-pg-0 pod at 10.128.2.30. I've never come across a log like this, but it seems like the noobaa-db-pg-0 pod might not have a running container. I don't see any container logs for the pod, and it has this error from the kubelet: Warning Failed 21m (x27 over 3h25m) kubelet (combined from similar events): Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_initialize-database_noobaa-db-pg-0_openshift-storage_57124fca-66a5-434e-9874-db1410cf0e27_0 for id 836db3365cfd19b7463e4a131a948042a2a9d249938a99a4790a0a26c5d44bb1: name is reserved I don't see anything that suggests the is issue is multus related. That noobaa pod doesn't have a multus IP. Does this resolve if you try deleting the noobaa-db-pg-0 pod again? Can you repro this issue, or was this a one-time thing? Does this resolve if you try deleting the noobaa-db-pg-0 pod again? --> I tried to delete the "noobaa-db-pg-0" pod twice, with force flag and without force flag. Can you repro this issue, or was this a one-time thing? --> I reproduced it 3 times. From chat thread, Eran suggested that this looks like it could be a cri-o issue: I found this issue https://github.com/cri-o/cri-o/issues/6185 https://access.redhat.com/solutions/6499541 Peter Hunt was the engineer that handled that issue, so might worth reaching out to him and get his opinion. There are several suggestions to narrow the reason I'm coming back to this and realize that I missed or mis-interpreted Elad's comment here: https://bugzilla.redhat.com/show_bug.cgi?id=2223780#c3 > To clarify, the restart of all pods in the openshift-storage namespace is required post the NAD configuration, based on instructions we got from Dev. I vaguely recall mentioning this as a means of speeding up QE test efforts early on we were failing to get NADs configured correctly, but I forget some of the context. Someone took my recommendation to restart the openshift-storage pods after NAD update too broadly. This is absolutely *not* a recommendation for Multus once ODF is in progress of installing or after. To be safe, we should always assume that it is never safe to reboot ODF pods when multus is configured. If this is recommended in ODF documents anywhere, we should instead update the recommendation that the entire node should be rebooted in multus cases. It *is* safe to reboot pods related to the multus validation tool, and that is the only exception. What are the next steps for this? Did we follow https://bugzilla.redhat.com/show_bug.cgi?id=2223780#c6? I don't think there is a need to follow comment 6. I believe the next step is to remove this test. This BZ is tracking the feature that will coincide with this test: https://bugzilla.redhat.com/show_bug.cgi?id=2167974 Ok, not a blocker then. Keeping it open. Discussed with Blaine, closing since the work item is really tracked with https://bugzilla.redhat.com/show_bug.cgi?id=2167974 *** This bug has been marked as a duplicate of bug 2167974 *** |
Description of problem (please be detailed as possible and provide log snippests): 1.Installed a cluster with Multus 2.After reset all the pods in openshift-storage namespace, I found a connection issue to Noobaa DB 3.Tested the same procedure on the cluster without Multus, and everything worked as expected [Storagecluster moved to Ready state] 4.noobaa-core-0 pod in Running state althoght we got this error: Jul-18 14:51:49.698 [Upgrade/20] [ERROR] core.util.postgres_client:: _connect: initial connect failed, will retry connect EHOSTUNREACH 10.128.2.30:5432 Jul-18 14:51:52.698 [Upgrade/20] [L0] core.util.postgres_client:: _connect: called with { max: 10, host: 'noobaa-db-pg-0.noobaa-db-pg', user: 'noobaa', password: 'arrvkPEp/3MbXA==', database: 'nbcore', port: 5432 } Jul-18 14:51:55.778 [Upgrade/20] [ERROR] core.util.postgres_client:: apply_sql_functions execute error Error: connect EHOSTUNREACH 10.128.2.30:5432 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16) { errno: -113, code: 'EHOSTUNREACH', syscall: 'connect', address: '10.128.2.30', port: 5432 } Version of all relevant components (if applicable): ODF Version: 4.13.0-218 OCP Version: 4.13.0-0.nightly-2023-07-18-041822 Plaform: BM Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Install LSO4.13 2.Install ODF4.13 with multus --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: public-net namespace: openshift-storage labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "enp1s0f1", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }' --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: cluster-net namespace: openshift-storage labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "enp1s0f1", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }' 3.Verify storagecluster on Ready state 4.Verify ceph statsu is OK 5.Restart all pods in openshift-storage $ oc delete pods --all -n openshift-storage pod "csi-addons-controller-manager-7998b997-d6d8m" deleted pod "csi-cephfsplugin-fl749" deleted 6.Check storagecluster status -> [stuck on Progressing state] $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 19m Progressing 2023-07-18T14:28:43Z 4.13.0 Status: True Type: Available Last Heartbeat Time: 2023-07-18T14:48:17Z Last Transition Time: 2023-07-18T14:39:32Z Message: Waiting on Nooba instance to finish initialization Reason: NoobaaInitializing Status: True 7. Check noobaa pods on openshift-storage namespace $ oc get pods -l app=noobaa NAME READY STATUS RESTARTS AGE noobaa-core-0 1/1 Running 0 9m15s noobaa-db-pg-0 1/1 Running 0 8m44s noobaa-endpoint-69c754f649-hgjmv 1/1 Running 0 9m45s noobaa-operator-897469f66-6ghkl 1/1 Running 0 9m45s 8. Although the noobaa-core-0 pod is in Running state, there is a connection issue to Noobaa DB. $ oc logs noobaa-core-0 Jul-18 14:51:49.698 [Upgrade/20] [ERROR] core.util.postgres_client:: _connect: initial connect failed, will retry connect EHOSTUNREACH 10.128.2.30:5432 Jul-18 14:51:52.698 [Upgrade/20] [L0] core.util.postgres_client:: _connect: called with { max: 10, host: 'noobaa-db-pg-0.noobaa-db-pg', user: 'noobaa', password: 'arrvkPEp/3MbXA==', database: 'nbcore', port: 5432 } Jul-18 14:51:55.778 [Upgrade/20] [ERROR] core.util.postgres_client:: apply_sql_functions execute error Error: connect EHOSTUNREACH 10.128.2.30:5432 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16) { errno: -113, code: 'EHOSTUNREACH', syscall: 'connect', address: '10.128.2.30', port: 5432 } Jul-18 14:51:55.778 [Upgrade/20] [ERROR] core.util.postgres_client:: _connect: initial connect failed, will retry connect EHOSTUNREACH 10.128.2.30:5432 Jul-18 14:51:58.779 [Upgrade/20] [L0] core.util.postgres_client:: _connect: called with { max: 10, host: 'noobaa-db-pg-0.noobaa-db-pg', user: 'noobaa', password: 'arrvkPEp/3MbXA==', database: 'nbcore', port: 5432 } Jul-18 14:51:58.850 [Upgrade/20] [ERROR] core.util.postgres_client:: apply_sql_functions execute error Error: connect EHOSTUNREACH 10.128.2.30:5432 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16) { errno: -113, code: 'EHOSTUNREACH', syscall: 'connect', address: '10.128.2.30', port: 5432 } Actual results: Connection issue to noobaa DB Expected results: Tested the same procedure on the cluster without Multus, and everything worked as expected [Storagecluster moved to Ready state] Additional info: OCS MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2223780.tar.gz