Bug 1936975 - VSphereProblemDetectorControllerDegraded: context canceled during upgrade to 4.7.0
Summary: VSphereProblemDetectorControllerDegraded: context canceled during upgrade to ...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.7
Hardware: All
OS: All
Target Milestone: ---
: 4.7.z
Assignee: Hemant Kumar
QA Contact: Qin Ping
Depends On: 1939555
TreeView+ depends on / blocked
Reported: 2021-03-09 15:25 UTC by David Hernández Fernández
Modified: 2021-10-28 06:39 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1939555 (view as bug list)
Last Closed: 2021-04-05 13:56:14 UTC
Target Upstream Version:

Attachments (Terms of Use)
mcodump (2.09 MB, application/gzip)
2021-03-11 14:27 UTC, David Hernández Fernández
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift vsphere-problem-detector pull 35 0 None open [4.7] Bug 1936975: Fix deadlock when enqueing functions into the pool 2021-03-24 14:49:32 UTC
Red Hat Product Errata RHSA-2021:1005 0 None None None 2021-04-05 13:56:32 UTC

Comment 6 Sam Batschelet 2021-03-11 14:00:10 UTC
> The CSO status controller is repeatedly getting "context canceled" errors, though:

context canceled is a very generic error referring to a client timeout this has nothing to do with etcd directly.

> Nearly 1/3 of the messages are this error from within the etcd operator are this context canceled. The only peculiar thing to me is why only the vsphere problem detector is reporting as degraded.

While I agree this chatter is distracting it is not the root cause, etcd the operand is running fine.

> dns                                       4.6.18   True       False        False     6d
> machine-config                            4.6.18   True       False        False     5h3m
> network                                   4.6.18   True       False        False     8h57m

I think you want to understand this issue first right. Why is MCO, network and dns are failing to upgrade, perhaps machine-config-daemon logs can report?

I am moving this to MCO as I am curious why given the old version of MCO in the context of upgrade why the operator is not Degraded,at a minimum Progressing.?

Comment 7 David Hernández Fernández 2021-03-11 14:27:46 UTC
Created attachment 1762672 [details]

Comment 8 David Hernández Fernández 2021-03-11 15:19:41 UTC
I think it's not related to MCO as CVO is just trying to follow the order of the upgrades and those operators (dns,mco,network) are simply the next ones in the order list after we solve the storage operator issue, which seemed that it was failing due to etcd storage spike but now it's blocking the rest of the upgrades.

I think we're starting to run in circles as the root cause is not clear, but in my humble opinion, a workaround from storage/cvo should be the aiming instead.

Let me know if you need anything.

Comment 29 W. Trevor King 2021-03-24 15:46:08 UTC
I cleared blocker+, because the issue affects all existing 4.7.z [1].  While, updating within 4.7.z does introduce some of the triggering problem-detector-interruptions, the workaround of setting storage Unmanaged should help folks who need to resolve the Degraded condition before they can update to a 4.7.z with the fix.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1939555#c3

Comment 31 Qin Ping 2021-03-25 09:22:21 UTC
Verified with: 4.7.0-0.nightly-2021-03-25-013802

Comment 35 errata-xmlrpc 2021-04-05 13:56:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.5 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.