Description of problem:
4.7 is _very_ sensitive to storage performance, causing nodes to be stuck in degraded state and as a consequence deployments to fail.
This looks like a regression of 4.7: the 4.6 and below jobs running on the same platform do not highlight the issue. Running the same 4.7 job on a different -- faster -- platform also works fine.
Take for instance this OpenStack CI job  running on Mass Open Cloud. If failed with:
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.7.0-0.ci.test-2020-12-13-211208-ci-op-tp15nkj7 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)
Looking at the machine config pool , it shows:
"Node tp15nkj7-6817d-sk654-worker-0-gdv4p is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\", Node tp15nkj7-6817d-sk654-worker-0-lgmkt is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\", Node tp15nkj7-6817d-sk654-worker-0-2hbbv is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\""
Digging a little bit deeper in the journalctl logs  we can see that rpm-ostree initiated txn Cleanup for client and never returned:
Dec 13 21:41:46.594767 tp15nkj7-6817d-sk654-worker-0-2hbbv rpm-ostree: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.174 unit:crio-ad6e06d53758526e0ff1677a1e450c5960631c93817130953328b1f6c25d4c25.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Dec 13 21:41:46.595236 tp15nkj7-6817d-sk654-worker-0-2hbbv rpm-ostree: client(id:machine-config-operator dbus:1.174 unit:crio-ad6e06d53758526e0ff1677a1e450c5960631c93817130953328b1f6c25d4c25.scope uid:0) vanished; remaining=0
Version-Release number of selected component (if applicable):
How reproducible: Almost 100% for 4.7 OpenStack jobs running on Mass Open Cloud
Steps to Reproduce:
Something seems weird there in how the client vanished immediately after starting the txn. Need to dig into that.
Anyways I think there are two root causes here:
First, removing the rollback should be part of our resync loop; i.e. it shouldn't be immediately fatal.
Second, there's a lot of I/O happening when the MCD first hits a node; e.g. we're pulling a lot of other container images too. We could move the cleanup to the firstboot process instead.
Also, we should be running workers with ephemeral storage: https://hackmd.io/dTUvY7BIQIu_vFK5bMzYvg
(That way, container images wouldn't be competing with the OS root for disk I/O)
I am no longer seeing issues in https://search.ci.openshift.org/?search=error+pool+worker+is+not+ready%2C+retrying&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job. Closing as verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.