Bug 2032510

Summary: Operator installation failed with 'Unknown failure' on 500 node Openshift environment
Product: OpenShift Container Platform Reporter: Murali Krishnasamy <murali>
Component: OLMAssignee: Per da Silva <pegoncal>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: medium CC: pegoncal
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-25 16:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Murali Krishnasamy 2021-12-14 15:34:27 UTC
Description of problem:

After scaling the baremetal environment to 500 worker nodes, the operator installation via operatorhub/CLI fails with `unknown failure` message, InstallPlan has not created and stuck forever. 

But I could install on a smaller sized cluster (3 master + 10 workers) running same OCP release, adding more workers seems to be a problem for OLM.  

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-10-21-105053 - Gets in to 'Unknown failure' without install plan
4.10.0-0.nightly-2021-12-12-184227 - Gets in to 'Upgrade Pending' without install plan

How reproducible:
Always with 400+ nodes

Steps to Reproduce:
1. Deploy a cluster with 4.10 nightly with atleast 10 workers
2. Install operators, it will be a success
3. Add more workers atleast 400 and try to re-install any operator or install new operator
4. Installation would get stuck with Unknown failure

Actual results:
Installation get stuck forever 

Expected results:
Operators installation should go thru as normal


Additional info:
seems to be related to bz1860185

Comment 5 Per da Silva 2022-01-11 02:39:32 UTC
Hi Murali,

Sorry for the delay in getting back to you. The end of the year was a hectic rush to the finish line. I wanted to ask you whether you still have the cluster up and running.
If you could still share the the provisioner host details, it would be super appreciated. In the mean time we'll look into the must-gather to see if there's anything we can learn.

Cheers,

Per

Comment 6 Murali Krishnasamy 2022-01-11 17:06:27 UTC
Hey Per, 
Nope, we don't have the environment currently, they were temporarily allocated from scalelab. 
I will get a smaller size(120 node) lab allocation from next week, will let you know if this is reproducible again.
Thanks,
Murali

Comment 7 Per da Silva 2022-01-19 19:45:44 UTC
Hey Murali,

Just wanted to touch-based with you again on this issue. Any developments on your side?

Cheers,

Per

Comment 8 Per da Silva 2022-01-19 21:34:59 UTC
Bumping this down to medium/medium. Doesn't seem like a blocker atm.

Comment 9 Murali Krishnasamy 2022-01-19 22:32:12 UTC
Hey Per, 
I have a 120 node cluster but testing something else on 4.9 GA, I will do a 4.10 nightly build after this and will try to reproduce the problem. I will DM you once I find something useful.
500 node with 4.9.12 was working fine and able to install operators throughout. 
Thanks,
Murali

Comment 10 Per da Silva 2022-01-21 03:22:26 UTC
You're a legend, Murali! Thank you ^^

Comment 11 Murali Krishnasamy 2022-01-24 20:51:05 UTC
Hey Per,

I tried it on a 120 node cluster running 4.10.0-fc.2 build, I am able to install operators without any issue. 
Not sure if the problem was only on that particular nightly release or something else, anyway will open a fresh bz if I see it again. 

Thanks,
Murali

Comment 12 Per da Silva 2022-01-25 16:06:28 UTC
Awesome, thank you, Murali! I'll close as NOTABUG and hopefully won't hear back from you on this matter XD