Bug 1907333 - Node stuck in degraded state, mcp reports "Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached"
Summary: Node stuck in degraded state, mcp reports "Failed to remove rollback: error r...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
Depends On:
TreeView+ depends on / blocked
Reported: 2020-12-14 09:29 UTC by Martin André
Modified: 2021-02-24 15:43 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-02-24 15:43:16 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2302 0 None closed Bug 1907333: daemon: Revert code to remove rollback 2021-01-04 09:38:51 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:43:41 UTC

Description Martin André 2020-12-14 09:29:08 UTC
Description of problem:

4.7 is _very_ sensitive to storage performance, causing nodes to be stuck in degraded state and as a consequence deployments to fail.

This looks like a regression of 4.7: the 4.6 and below jobs running on the same platform do not highlight the issue. Running the same 4.7 job on a different -- faster -- platform also works fine.

Take for instance this OpenStack CI job [1] running on Mass Open Cloud. If failed with:

level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.7.0-0.ci.test-2020-12-13-211208-ci-op-tp15nkj7 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3) 

Looking at the machine config pool [2], it shows:

"Node tp15nkj7-6817d-sk654-worker-0-gdv4p is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\", Node tp15nkj7-6817d-sk654-worker-0-lgmkt is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\", Node tp15nkj7-6817d-sk654-worker-0-2hbbv is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\""

Digging a little bit deeper in the journalctl logs [3] we can see that rpm-ostree initiated txn Cleanup for client and never returned:

Dec 13 21:41:46.594767 tp15nkj7-6817d-sk654-worker-0-2hbbv rpm-ostree[3684]: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.174 unit:crio-ad6e06d53758526e0ff1677a1e450c5960631c93817130953328b1f6c25d4c25.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Dec 13 21:41:46.595236 tp15nkj7-6817d-sk654-worker-0-2hbbv rpm-ostree[3684]: client(id:machine-config-operator dbus:1.174 unit:crio-ad6e06d53758526e0ff1677a1e450c5960631c93817130953328b1f6c25d4c25.scope uid:0) vanished; remaining=0

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960/artifacts/e2e-openstack/machineconfigpools.json
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960/artifacts/e2e-openstack/nodes/tp15nkj7-6817d-sk654-worker-0-2hbbv/

Version-Release number of selected component (if applicable):

How reproducible: Almost 100% for 4.7 OpenStack jobs running on Mass Open Cloud

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 1 Colin Walters 2020-12-14 14:17:49 UTC
Something seems weird there in how the client vanished immediately after starting the txn.  Need to dig into that.

Anyways I think there are two root causes here:

First, removing the rollback should be part of our resync loop; i.e. it shouldn't be immediately fatal.

Second, there's a lot of I/O happening when the MCD first hits a node; e.g. we're pulling a lot of other container images too.  We could move the cleanup to the firstboot process instead.

Also, we should be running workers with ephemeral storage: https://hackmd.io/dTUvY7BIQIu_vFK5bMzYvg
(That way, container images wouldn't be competing with the OS root for disk I/O)

Comment 6 errata-xmlrpc 2021-02-24 15:43:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.