Bug 1936975

Summary:

VSphereProblemDetectorControllerDegraded: context canceled during upgrade to 4.7.0

Product:

OpenShift Container Platform

Reporter:

David Hernández Fernández <dahernan>

Component:

Storage

Assignee:

Hemant Kumar <hekumar>

Storage sub component:

Operators

QA Contact:

Qin Ping <piqin>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

aos-bugs, chuffman, hekumar, jsafrane, lmohanty, skolicha, wduan, wking

Version:

4.7

Keywords:

Upgrades

Target Milestone:

---

Target Release:

4.7.z

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1939555 (view as bug list)

Environment:

Last Closed:

2021-04-05 13:56:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1939555

Bug Blocks:

Attachments:

Description	Flags
mcodump	none

Comment 6 Sam Batschelet 2021-03-11 14:00:10 UTC

> The CSO status controller is repeatedly getting "context canceled" errors, though:

context canceled is a very generic error referring to a client timeout this has nothing to do with etcd directly.

> Nearly 1/3 of the messages are this error from within the etcd operator are this context canceled. The only peculiar thing to me is why only the vsphere problem detector is reporting as degraded.

While I agree this chatter is distracting it is not the root cause, etcd the operand is running fine.

> dns                                       4.6.18   True       False        False     6d
> machine-config                            4.6.18   True       False        False     5h3m
> network                                   4.6.18   True       False        False     8h57m

I think you want to understand this issue first right. Why is MCO, network and dns are failing to upgrade, perhaps machine-config-daemon logs can report?

I am moving this to MCO as I am curious why given the old version of MCO in the context of upgrade why the operator is not Degraded,at a minimum Progressing.?

Comment 7 David Hernández Fernández 2021-03-11 14:27:46 UTC

Created attachment 1762672 [details]
mcodump

Comment 8 David Hernández Fernández 2021-03-11 15:19:41 UTC

I think it's not related to MCO as CVO is just trying to follow the order of the upgrades and those operators (dns,mco,network) are simply the next ones in the order list after we solve the storage operator issue, which seemed that it was failing due to etcd storage spike but now it's blocking the rest of the upgrades.

I think we're starting to run in circles as the root cause is not clear, but in my humble opinion, a workaround from storage/cvo should be the aiming instead.

Let me know if you need anything.

Comment 29 W. Trevor King 2021-03-24 15:46:08 UTC

I cleared blocker+, because the issue affects all existing 4.7.z [1].  While, updating within 4.7.z does introduce some of the triggering problem-detector-interruptions, the workaround of setting storage Unmanaged should help folks who need to resolve the Degraded condition before they can update to a 4.7.z with the fix.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1939555#c3

Comment 31 Qin Ping 2021-03-25 09:22:21 UTC

Verified with: 4.7.0-0.nightly-2021-03-25-013802

Comment 35 errata-xmlrpc 2021-04-05 13:56:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.5 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1005