Hide Forgot
Description of problem: Upgrading from 4.1.0-0.nightly-2019-05-04-210601 to 4.1.0-0.nightly-2019-05-05-070156. The masters upgraded, but a worker got stuck with the following message: Node ip-10-0-154-28.us-west-2.compute.internal is reporting: "unexpected on-disk state validating against rendered-worker-125ec006ceb944bae2a3ae1f120bd9a7" The upgrade proceeded and reported success, but workers were never actually ugpraded. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: master and worker pools are upgraded successfully. Additional info:
I'm already investigating this kind of error in upgrades jobs. This started to happen recently. We have some PR to help debugging as well (https://github.com/openshift/machine-config-operator/pull/704 and https://github.com/openshift/machine-config-operator/pull/651) That osimageurl stuff is also new though (ccing Colin as well)
This is also probably something related to mounts as from logs, we write the pending state but on reboot, it's not there anymore (the pending state lives under /etc/machine-config-daemon which is a mount, but so is /rootfs where we chroot into it so it might be a Runtime/Kubelet race as well since it's not consistent). I've removed the additional, not needed mounts in https://github.com/openshift/machine-config-operator/pull/704 now.
The issue here is that we're missing the actual pending state file written by the MCD and so, on reboot, it thinks it's still on the previous (current) config whether it should now validate against desired: May 04 19:30:12 ip-10-0-166-36 root[98419]: machine-config-daemon: Starting update from rendered-worker-c7c66dff5c89c13c04a4579220d7d1c6 to rendered-worker-44d8626e77bd20d83dd1cc04d3b85424 May 04 19:30:12 ip-10-0-166-36 root[98420]: machine-config-daemon: runcom about to write pending config rendered-worker-44d8626e77bd20d83dd1cc04d3b85424 May 04 19:30:12 ip-10-0-166-36 root[98421]: machine-config-daemon: runcom {"pendingConfig":"rendered-worker-44d8626e77bd20d83dd1cc04d3b85424","bootID":"fa0dad06-1c8e-4ccb-9dd5-8a05a3d865fa"} May 04 19:30:12 ip-10-0-166-36 root[98422]: machine-config-daemon: runcom, written pending config May 04 19:30:12 ip-10-0-166-36 root[98423]: machine-config-daemon: Update prepared; beginning drain The we reboot, and the MCD is now going to check the pending state to check on disk state..but then we see this: May 04 19:32:47 ip-10-0-166-36 root[3037]: machine-config-daemon: error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory So, something is eating that file (I'm not sure if it could be mount propagation, races, runtime races or kube races in umounting the volumes).
> This is also probably something related to mounts as from logs, we write the pending state but on reboot, it's not there anymore (the pending state lives under /etc/machine-config-daemon which is a mount, but so is /rootfs where we chroot into it so it might be a Runtime/Kubelet race as well since it's not consistent). I've removed the additional, not needed mounts in https://github.com/openshift/machine-config-operator/pull/704 now. Related to this I did https://github.com/openshift/machine-config-operator/pull/708 Note that if my original hypothesis that chroot races were involved here is actually somehow right, then removing the mounts would make this a lot more obvious of a race.
*** Bug 1706632 has been marked as a duplicate of this bug. ***
High level status on this is that we're kind of in a cycle of looking at the code, and trying to add some more debug logs, but we don't have a root cause. One thing that would help us a *lot* would be https://coreos.slack.com/archives/CBN38N3MW/p1557174614259400
One random theory I had is that this could be related to `ostree-finalize-staged.service` failing; see e.g. https://github.com/ostreedev/ostree/pull/1840 That would very much explain the symptoms here as we'd be rebooting into the *previous* OS image. That said I'm logged into a cluster and `journalctl -u ostree-finalize-staged` looks fine to me.
(In reply to Colin Walters from comment #11) > One random theory I had is that this could be related to > `ostree-finalize-staged.service` failing; see e.g. > https://github.com/ostreedev/ostree/pull/1840 > > That would very much explain the symptoms here as we'd be rebooting into the > *previous* OS image. That said I'm logged into a cluster and `journalctl -u > ostree-finalize-staged` looks fine to me. I've always checked that we were indeed into the right ostree commit, can you see a reboot where it landed in the *previous* ostree?
*** Bug 1707502 has been marked as a duplicate of this bug. ***
We've merged a PR to fix the missing pending state on disk https://github.com/openshift/machine-config-operator/pull/711 - we're actively monitoring the e2e upgrade jobs I'm moving to POST as we have one other PR which is getting in https://github.com/openshift/machine-config-operator/pull/719
The last relevant PR has now been merged: https://github.com/openshift/machine-config-operator/pull/719 Moving to MODIFIED NOTE: Annoyingly, the original error has actually been absent from CI test output for something close to 24 hours. That is, the error condition stopped before this fix went in. Regardless, we believe this fix protects the MCO from the underlying issue, which we still do not fully understand. We'll keep an eye on it.
e2e on the last two nightly builds today have failed with: level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-ssm1lmkd-0e31a.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..." level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-0.nightly-2019-05-08-133845 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-767f9a16f526388f8923fac41e523e28 expected 4.1.0-201905071832-dirty has 4.1.0-201905080833-dirty, retrying: timed out waiting for the condition" Not sure if related to these PRs. I'll try a full install and file a new bz if needed
It's not related and the above seems to be related to a very recent github hiccup. Thread going on here https://coreos.slack.com/archives/CBN38N3MW/p1557266588303900
> ... controller version mismatch for rendered... This got sorted in bug 1707928.
verified by looking at https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=e2e-aws-upgrade&search=unexpected%20on-disk%20state which has started (2019-05-08T21:03:11Z through 2019-05-09T20:43:50Z)duration (minutes)0 (0% of all failures) unexpected on-disk state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758