Description of problem: 4.7 is _very_ sensitive to storage performance, causing nodes to be stuck in degraded state and as a consequence deployments to fail. This looks like a regression of 4.7: the 4.6 and below jobs running on the same platform do not highlight the issue. Running the same 4.7 job on a different -- faster -- platform also works fine. Take for instance this OpenStack CI job [1] running on Mass Open Cloud. If failed with: level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.7.0-0.ci.test-2020-12-13-211208-ci-op-tp15nkj7 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3) Looking at the machine config pool [2], it shows: "Node tp15nkj7-6817d-sk654-worker-0-gdv4p is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\", Node tp15nkj7-6817d-sk654-worker-0-lgmkt is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\", Node tp15nkj7-6817d-sk654-worker-0-2hbbv is reporting: \"Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\\n: exit status 1\"" Digging a little bit deeper in the journalctl logs [3] we can see that rpm-ostree initiated txn Cleanup for client and never returned: Dec 13 21:41:46.594767 tp15nkj7-6817d-sk654-worker-0-2hbbv rpm-ostree[3684]: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.174 unit:crio-ad6e06d53758526e0ff1677a1e450c5960631c93817130953328b1f6c25d4c25.scope uid:0): /org/projectatomic/rpmostree1/rhcos Dec 13 21:41:46.595236 tp15nkj7-6817d-sk654-worker-0-2hbbv rpm-ostree[3684]: client(id:machine-config-operator dbus:1.174 unit:crio-ad6e06d53758526e0ff1677a1e450c5960631c93817130953328b1f6c25d4c25.scope uid:0) vanished; remaining=0 [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960/artifacts/e2e-openstack/machineconfigpools.json [3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-openstack/141/pull-ci-openshift-cluster-api-provider-openstack-master-e2e-openstack/1338230204597800960/artifacts/e2e-openstack/nodes/tp15nkj7-6817d-sk654-worker-0-2hbbv/ Version-Release number of selected component (if applicable): How reproducible: Almost 100% for 4.7 OpenStack jobs running on Mass Open Cloud https://search.ci.openshift.org/?search=error+pool+worker+is+not+ready%2C+retrying&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Something seems weird there in how the client vanished immediately after starting the txn. Need to dig into that. Anyways I think there are two root causes here: First, removing the rollback should be part of our resync loop; i.e. it shouldn't be immediately fatal. Second, there's a lot of I/O happening when the MCD first hits a node; e.g. we're pulling a lot of other container images too. We could move the cleanup to the firstboot process instead. Also, we should be running workers with ephemeral storage: https://hackmd.io/dTUvY7BIQIu_vFK5bMzYvg (That way, container images wouldn't be competing with the OS root for disk I/O)
I am no longer seeing issues in https://search.ci.openshift.org/?search=error+pool+worker+is+not+ready%2C+retrying&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job. Closing as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633