Bug 1907333

Summary: Node stuck in degraded state, mcp reports "Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached"
Product: OpenShift Container Platform Reporter: Martin André <m.andre>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.7CC: walters
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:43:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Martin André 2020-12-14 09:29:08 UTC
Description of problem:

4.7 is _very_ sensitive to storage performance, causing nodes to be stuck in degraded state and as a consequence deployments to fail.

This looks like a regression of 4.7: the 4.6 and below jobs running on the same platform do not highlight the issue. Running the same 4.7 job on a different -- faster -- platform also works fine.

Take for instance this OpenStack CI job [1] running on Mass Open Cloud. If failed with:

level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.7.0-0.ci.test-2020-12-13-211208-ci-op-tp15nkj7 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3) 

Looking at the machine config pool [2], it shows:

"Node tp15nkj7-6817d-sk654-worker-0-gdv4p is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\", Node tp15nkj7-6817d-sk654-worker-0-lgmkt is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\", Node tp15nkj7-6817d-sk654-worker-0-2hbbv is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\""

Digging a little bit deeper in the journalctl logs [3] we can see that rpm-ostree initiated txn Cleanup for client and never returned:

Dec 13 21:41:46.594767 tp15nkj7-6817d-sk654-worker-0-2hbbv rpm-ostree[3684]: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.174 unit:crio-ad6e06d53758526e0ff1677a1e450c5960631c93817130953328b1f6c25d4c25.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Dec 13 21:41:46.595236 tp15nkj7-6817d-sk654-worker-0-2hbbv rpm-ostree[3684]: client(id:machine-config-operator dbus:1.174 unit:crio-ad6e06d53758526e0ff1677a1e450c5960631c93817130953328b1f6c25d4c25.scope uid:0) vanished; remaining=0

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960/artifacts/e2e-openstack/machineconfigpools.json
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960/artifacts/e2e-openstack/nodes/tp15nkj7-6817d-sk654-worker-0-2hbbv/

Version-Release number of selected component (if applicable):

How reproducible: Almost 100% for 4.7 OpenStack jobs running on Mass Open Cloud

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 1 Colin Walters 2020-12-14 14:17:49 UTC
Something seems weird there in how the client vanished immediately after starting the txn.  Need to dig into that.

Anyways I think there are two root causes here:

First, removing the rollback should be part of our resync loop; i.e. it shouldn't be immediately fatal.

Second, there's a lot of I/O happening when the MCD first hits a node; e.g. we're pulling a lot of other container images too.  We could move the cleanup to the firstboot process instead.

Also, we should be running workers with ephemeral storage: https://hackmd.io/dTUvY7BIQIu_vFK5bMzYvg
(That way, container images wouldn't be competing with the OS root for disk I/O)

Comment 6 errata-xmlrpc 2021-02-24 15:43:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.