Bug 2122785

Summary: Redhat-operators CatalogSource is not present after few reboots
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Shay Rozen <srozen>
Component: distributionAssignee: Boris Ranto <branto>
Status: CLOSED NOTABUG QA Contact: Petr Balogh <pbalogh>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: aeyal, bniver, madam, muagarwa, ocs-bugs, odf-bz-bot, sostapov
Target Milestone: ---Flags: muagarwa: needinfo? (srozen)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-26 06:16:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shay Rozen 2022-08-30 19:07:34 UTC
Description of problem:
On SNO (wasn't tested on ODF cluster) after multiple reboots CatalogSource is not present. There is an new automation test that tests out of 10 reboots how many times CatalogSource was not present. Is not consistent but it happens. Last run was 2 out of 10.
Maybe it's the wrong component. Please advise.

Version-Release number of selected component (if applicable):
OCP 4.11.0-0.nightly-2022-08-04-081314
Any ODF CatalogSource.

How reproducible:
multiple reboot will reproduce. 

Steps to Reproduce:
1. Install OCP, add catalogsource.
2. In my case I'm installing LVM but I'm not sure it is related.
3. reboot the cluster and check for catalogsource.
4. repeat step 3.

Actual results:
test that is doing 10 reboots reports
ocs_ci.ocs.exceptions.CatalogSourceNotFoundAfterReboot: Catalogsource redhat-operators not found in 2 out of 10 reboots

Expected results:
CatalogSource should be present after all reboots.

Additional info:
Test is gethering events regarding redhat-operators:
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** default                                            4h22m       Warning   ResolutionFailed                             namespace/openshift-storage                                           constraints not satisfiable: no operators found from catalog redhat-operators in namespace openshift-marketplace referenced by subscription odf-lvm-operator, subscription odf-lvm-operator exists
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** openshift-marketplace                              4h16m       Warning   Unhealthy                                    pod/redhat-operators-scr7p                                            Startup probe failed: command timed out
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** openshift-marketplace                              4h12m       Warning   Unhealthy                                    pod/redhat-operators-scr7p                                            Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 6c5824f03e50a4cc4991171685e67bf9803448bf496963aaab5458bfde7d6eda is running failed: container process not found
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** openshift-marketplace                              4h11m       Warning   Unhealthy                                    pod/redhat-operators-rbstl                                            Startup probe failed: timeout: failed to connect service ":50051" within 1s
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** openshift-marketplace                              170m        Warning   Unhealthy                                    pod/redhat-operators-nwq42                                            Startup probe failed: timeout: failed to connect service ":50051" within 1s
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** default                                            158m        Warning   ResolutionFailed                             namespace/openshift-storage                                           [failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.178.95:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.105.221:50051: connect: connection refused"]
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** default                                            152m        Warning   ResolutionFailed                             namespace/openshift-storage                                           [failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.178.95:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.228.141:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.105.221:50051: connect: connection refused"]
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** default                                            148m        Warning   ResolutionFailed                             namespace/openshift-storage                                           [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.105.221:50051: connect: connection refused", failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.178.95:50051: connect: connection refused", failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.120.80:50051: connect: connection refused"]
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** default                                            148m        Warning   ResolutionFailed                             namespace/openshift-storage                                           failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.178.95:50051: connect: connection refused"
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** default                                            141m        Warning   ResolutionFailed                             namespace/openshift-storage                                           [failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.178.95:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.105.221:50051: connect: connection refused"]
[2022-08-30T17:35:56.214Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** openshift-marketplace                              141m        Warning   Unhealthy                                    pod/redhat-operators-nwq42                                            Startup probe failed: timeout: failed to connect service ":50051" within 1s
[2022-08-30T17:35:56.471Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** default                                            2m36s       Warning   ResolutionFailed                             namespace/openshift-storage                                           [failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.228.141:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.105.221:50051: connect: connection refused", failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.178.95:50051: connect: connection refused"]
[2022-08-30T17:35:56.471Z] 17:35:56 - MainThread - test_lvm_node_reboot_catalogsource - [32mINFO[0m  - ** default                                            2m9s        Warning   ResolutionFailed                             namespace/openshift-storage                                           constraints not satisfiable: no operators found from catalog redhat-operators in namespace openshift-marketplace referenced by subscription odf-lvm-operator, subscription odf-lvm-operator exists