Bug 1936975

Summary: VSphereProblemDetectorControllerDegraded: context canceled during upgrade to 4.7.0
Product: OpenShift Container Platform Reporter: David Hernández Fernández <dahernan>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Operators QA Contact: Qin Ping <piqin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, chuffman, hekumar, jsafrane, lmohanty, skolicha, wduan, wking
Version: 4.7Keywords: Upgrades
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1939555 (view as bug list) Environment:
Last Closed: 2021-04-05 13:56:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1939555    
Bug Blocks:    
Attachments:
Description Flags
mcodump none

Comment 6 Sam Batschelet 2021-03-11 14:00:10 UTC
> The CSO status controller is repeatedly getting "context canceled" errors, though:

context canceled is a very generic error referring to a client timeout this has nothing to do with etcd directly.

> Nearly 1/3 of the messages are this error from within the etcd operator are this context canceled. The only peculiar thing to me is why only the vsphere problem detector is reporting as degraded.

While I agree this chatter is distracting it is not the root cause, etcd the operand is running fine.

> dns                                       4.6.18   True       False        False     6d
> machine-config                            4.6.18   True       False        False     5h3m
> network                                   4.6.18   True       False        False     8h57m

I think you want to understand this issue first right. Why is MCO, network and dns are failing to upgrade, perhaps machine-config-daemon logs can report?

I am moving this to MCO as I am curious why given the old version of MCO in the context of upgrade why the operator is not Degraded,at a minimum Progressing.?

Comment 7 David Hernández Fernández 2021-03-11 14:27:46 UTC
Created attachment 1762672 [details]
mcodump

Comment 8 David Hernández Fernández 2021-03-11 15:19:41 UTC
I think it's not related to MCO as CVO is just trying to follow the order of the upgrades and those operators (dns,mco,network) are simply the next ones in the order list after we solve the storage operator issue, which seemed that it was failing due to etcd storage spike but now it's blocking the rest of the upgrades.

I think we're starting to run in circles as the root cause is not clear, but in my humble opinion, a workaround from storage/cvo should be the aiming instead.

Let me know if you need anything.

Comment 29 W. Trevor King 2021-03-24 15:46:08 UTC
I cleared blocker+, because the issue affects all existing 4.7.z [1].  While, updating within 4.7.z does introduce some of the triggering problem-detector-interruptions, the workaround of setting storage Unmanaged should help folks who need to resolve the Degraded condition before they can update to a 4.7.z with the fix.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1939555#c3

Comment 31 Qin Ping 2021-03-25 09:22:21 UTC
Verified with: 4.7.0-0.nightly-2021-03-25-013802

Comment 35 errata-xmlrpc 2021-04-05 13:56:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.5 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1005