Bug 2032510 - Operator installation failed with 'Unknown failure' on 500 node Openshift environment
Summary: Operator installation failed with 'Unknown failure' on 500 node Openshift env...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.10
Hardware: All
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Per da Silva
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-14 15:34 UTC by Murali Krishnasamy
Modified: 2022-01-25 16:06 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-25 16:06:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Murali Krishnasamy 2021-12-14 15:34:27 UTC
Description of problem:

After scaling the baremetal environment to 500 worker nodes, the operator installation via operatorhub/CLI fails with `unknown failure` message, InstallPlan has not created and stuck forever. 

But I could install on a smaller sized cluster (3 master + 10 workers) running same OCP release, adding more workers seems to be a problem for OLM.  

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-10-21-105053 - Gets in to 'Unknown failure' without install plan
4.10.0-0.nightly-2021-12-12-184227 - Gets in to 'Upgrade Pending' without install plan

How reproducible:
Always with 400+ nodes

Steps to Reproduce:
1. Deploy a cluster with 4.10 nightly with atleast 10 workers
2. Install operators, it will be a success
3. Add more workers atleast 400 and try to re-install any operator or install new operator
4. Installation would get stuck with Unknown failure

Actual results:
Installation get stuck forever 

Expected results:
Operators installation should go thru as normal


Additional info:
seems to be related to bz1860185

Comment 5 Per da Silva 2022-01-11 02:39:32 UTC
Hi Murali,

Sorry for the delay in getting back to you. The end of the year was a hectic rush to the finish line. I wanted to ask you whether you still have the cluster up and running.
If you could still share the the provisioner host details, it would be super appreciated. In the mean time we'll look into the must-gather to see if there's anything we can learn.

Cheers,

Per

Comment 6 Murali Krishnasamy 2022-01-11 17:06:27 UTC
Hey Per, 
Nope, we don't have the environment currently, they were temporarily allocated from scalelab. 
I will get a smaller size(120 node) lab allocation from next week, will let you know if this is reproducible again.
Thanks,
Murali

Comment 7 Per da Silva 2022-01-19 19:45:44 UTC
Hey Murali,

Just wanted to touch-based with you again on this issue. Any developments on your side?

Cheers,

Per

Comment 8 Per da Silva 2022-01-19 21:34:59 UTC
Bumping this down to medium/medium. Doesn't seem like a blocker atm.

Comment 9 Murali Krishnasamy 2022-01-19 22:32:12 UTC
Hey Per, 
I have a 120 node cluster but testing something else on 4.9 GA, I will do a 4.10 nightly build after this and will try to reproduce the problem. I will DM you once I find something useful.
500 node with 4.9.12 was working fine and able to install operators throughout. 
Thanks,
Murali

Comment 10 Per da Silva 2022-01-21 03:22:26 UTC
You're a legend, Murali! Thank you ^^

Comment 11 Murali Krishnasamy 2022-01-24 20:51:05 UTC
Hey Per,

I tried it on a 120 node cluster running 4.10.0-fc.2 build, I am able to install operators without any issue. 
Not sure if the problem was only on that particular nightly release or something else, anyway will open a fresh bz if I see it again. 

Thanks,
Murali

Comment 12 Per da Silva 2022-01-25 16:06:28 UTC
Awesome, thank you, Murali! I'll close as NOTABUG and hopefully won't hear back from you on this matter XD


Note You need to log in before you can comment on or make changes to this bug.