Bug 1706606

Summary: Unexpected on disk state error updating workers
Product: OpenShift Container Platform Reporter: Derek Carr <decarr>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Micah Abbott <miabbott>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: bpeterse, ccoleman, cwalters, eparis, imcleod, mifiedle, nstielau, pruan, walters, wking
Target Milestone: ---Keywords: BetaBlocker
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1707928    
Bug Blocks: 1707212    

Description Derek Carr 2019-05-05 16:06:07 UTC
Description of problem:
Upgrading from 4.1.0-0.nightly-2019-05-04-210601 to 4.1.0-0.nightly-2019-05-05-070156.

The masters upgraded, but a worker got stuck with the following message:
Node ip-10-0-154-28.us-west-2.compute.internal is reporting: "unexpected on-disk state validating against rendered-worker-125ec006ceb944bae2a3ae1f120bd9a7"

The upgrade proceeded and reported success, but workers were never actually ugpraded.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
master and worker pools are upgraded successfully.

Additional info:

Comment 4 Antonio Murdaca 2019-05-05 18:48:45 UTC
I'm already investigating this kind of error in upgrades jobs. This started to happen recently. We have some PR to help debugging as well (https://github.com/openshift/machine-config-operator/pull/704 and https://github.com/openshift/machine-config-operator/pull/651)

That osimageurl stuff is also new though (ccing Colin as well)

Comment 5 Antonio Murdaca 2019-05-05 18:51:36 UTC
This is also probably something related to mounts as from logs, we write the pending state but on reboot, it's not there anymore (the pending state lives under /etc/machine-config-daemon which is a mount, but so is /rootfs where we chroot into it so it might be a Runtime/Kubelet race as well since it's not consistent). I've removed the additional, not needed mounts in https://github.com/openshift/machine-config-operator/pull/704 now.

Comment 6 Antonio Murdaca 2019-05-05 18:58:38 UTC
The issue here is that we're missing the actual pending state file written by the MCD and so, on reboot, it thinks it's still on the previous (current) config whether it should now validate against desired:


May 04 19:30:12 ip-10-0-166-36 root[98419]: machine-config-daemon: Starting update from rendered-worker-c7c66dff5c89c13c04a4579220d7d1c6 to rendered-worker-44d8626e77bd20d83dd1cc04d3b85424
May 04 19:30:12 ip-10-0-166-36 root[98420]: machine-config-daemon: runcom about to write pending config rendered-worker-44d8626e77bd20d83dd1cc04d3b85424
May 04 19:30:12 ip-10-0-166-36 root[98421]: machine-config-daemon: runcom {"pendingConfig":"rendered-worker-44d8626e77bd20d83dd1cc04d3b85424","bootID":"fa0dad06-1c8e-4ccb-9dd5-8a05a3d865fa"}
May 04 19:30:12 ip-10-0-166-36 root[98422]: machine-config-daemon: runcom, written pending config
May 04 19:30:12 ip-10-0-166-36 root[98423]: machine-config-daemon: Update prepared; beginning drain

The we reboot, and the MCD is now going to check the pending state to check on disk state..but then we see this:

May 04 19:32:47 ip-10-0-166-36 root[3037]: machine-config-daemon: error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory

So, something is eating that file (I'm not sure if it could be mount propagation, races, runtime races or kube races in umounting the volumes).

Comment 8 Colin Walters 2019-05-06 17:15:41 UTC
> This is also probably something related to mounts as from logs, we write the pending state but on reboot, it's not there anymore (the pending state lives under /etc/machine-config-daemon which is a mount, but so is /rootfs where we chroot into it so it might be a Runtime/Kubelet race as well since it's not consistent). I've removed the additional, not needed mounts in https://github.com/openshift/machine-config-operator/pull/704 now.

Related to this I did https://github.com/openshift/machine-config-operator/pull/708

Note that if my original hypothesis that chroot races were involved here is actually somehow right, then removing the mounts would make this a lot more obvious of a race.

Comment 9 Colin Walters 2019-05-06 20:31:30 UTC
*** Bug 1706632 has been marked as a duplicate of this bug. ***

Comment 10 Colin Walters 2019-05-06 20:34:39 UTC
High level status on this is that we're kind of in a cycle of looking at the code, and trying to add some more debug logs, but we don't have a root cause.

One thing that would help us a *lot* would be https://coreos.slack.com/archives/CBN38N3MW/p1557174614259400

Comment 11 Colin Walters 2019-05-07 15:49:54 UTC
One random theory I had is that this could be related to `ostree-finalize-staged.service` failing; see e.g. https://github.com/ostreedev/ostree/pull/1840

That would very much explain the symptoms here as we'd be rebooting into the *previous* OS image.  That said I'm logged into a cluster and `journalctl -u ostree-finalize-staged` looks fine to me.

Comment 12 Antonio Murdaca 2019-05-07 15:54:24 UTC
(In reply to Colin Walters from comment #11)
> One random theory I had is that this could be related to
> `ostree-finalize-staged.service` failing; see e.g.
> https://github.com/ostreedev/ostree/pull/1840
> 
> That would very much explain the symptoms here as we'd be rebooting into the
> *previous* OS image.  That said I'm logged into a cluster and `journalctl -u
> ostree-finalize-staged` looks fine to me.

I've always checked that we were indeed into the right ostree commit, can you see a reboot where it landed in the *previous* ostree?

Comment 13 Antonio Murdaca 2019-05-07 16:14:53 UTC
*** Bug 1707502 has been marked as a duplicate of this bug. ***

Comment 14 Antonio Murdaca 2019-05-08 14:12:43 UTC
We've merged a PR to fix the missing pending state on disk https://github.com/openshift/machine-config-operator/pull/711 - we're actively monitoring the e2e upgrade jobs
I'm moving to POST as we have one other PR which is getting in https://github.com/openshift/machine-config-operator/pull/719

Comment 15 Ian McLeod 2019-05-08 15:20:34 UTC
The last relevant PR has now been merged: https://github.com/openshift/machine-config-operator/pull/719

Moving to MODIFIED

NOTE: Annoyingly, the original error has actually been absent from CI test output for something close to 24 hours.
That is, the error condition stopped before this fix went in.

Regardless, we believe this fix protects the MCO from the underlying issue, which we still do not fully understand.  We'll keep an eye on it.

Comment 17 Mike Fiedler 2019-05-08 16:57:56 UTC
e2e on the last two nightly builds today have failed with:

level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-ssm1lmkd-0e31a.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-0.nightly-2019-05-08-133845 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-767f9a16f526388f8923fac41e523e28 expected 4.1.0-201905071832-dirty has 4.1.0-201905080833-dirty, retrying: timed out waiting for the condition"

Not sure if related to these PRs.   I'll try a full install and file a new bz if needed

Comment 18 Antonio Murdaca 2019-05-08 17:02:24 UTC
It's not related and the above seems to be related to a very recent github hiccup. Thread going on here https://coreos.slack.com/archives/CBN38N3MW/p1557266588303900

Comment 19 W. Trevor King 2019-05-09 20:26:36 UTC
> ... controller version mismatch for rendered...

This got sorted in bug 1707928.

Comment 20 Peter Ruan 2019-05-09 21:09:01 UTC
verified by looking at https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=e2e-aws-upgrade&search=unexpected%20on-disk%20state which has started (2019-05-08T21:03:11Z through 2019-05-09T20:43:50Z)duration (minutes)0 (0% of all failures) unexpected on-disk state

Comment 22 errata-xmlrpc 2019-06-04 10:48:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758