Bug 1907333
Summary: | Node stuck in degraded state, mcp reports "Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Martin André <m.andre> |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.7 | CC: | walters |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:43:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Martin André
2020-12-14 09:29:08 UTC
Something seems weird there in how the client vanished immediately after starting the txn. Need to dig into that. Anyways I think there are two root causes here: First, removing the rollback should be part of our resync loop; i.e. it shouldn't be immediately fatal. Second, there's a lot of I/O happening when the MCD first hits a node; e.g. we're pulling a lot of other container images too. We could move the cleanup to the firstboot process instead. Also, we should be running workers with ephemeral storage: https://hackmd.io/dTUvY7BIQIu_vFK5bMzYvg (That way, container images wouldn't be competing with the OS root for disk I/O) I am no longer seeing issues in https://search.ci.openshift.org/?search=error+pool+worker+is+not+ready%2C+retrying&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job. Closing as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |