Looking at some upgrades jobs, we've come across something super weird which is causing the MCD to interrupt in the middle of a sync, causing a subsequent restart and permanent failed (unless an admin fix it on the node itself).
The MCD syncs like this:
1) MCC generates a new config, and set that as desiredConfig on a node
2) the MCD notices that and start rolling new files on the node, then writes a "pending" config file (which is just desiredConfig)
3) Drain + reboot
4) when it comes back online, it looks for the pending config, and set that as the currentCOnfig on the node, ending up the sync
We've only seen this happening on master nodes. And this only happened in upgrades job.
What we've noticed in masters logs is that the MCD comes online, but it's the killed and recreated.
If it's killed before the time we write pending config on disk, then when it comes back online it cannot reconcile itself.
This is most probably a model we need to change in the MCD but I wanted to understand if:
1) is it normal for the MCD to be killed like that? my response to that it's No
2) who and why MCD is getting killed when it's already up and running?
Some example of the failures are here:
Logs can probably tell you more
I'm cc'ing Mrunal as well for the Container Runtime part.
I see the probes failing in this log - https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-upgrade/89
> I see the probes failing in this log
(In reply to Colin Walters from comment #2)
> > I see the probes failing in this log
> Which probes?
Lots of probes?
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-upgrade/89/build-log.txt | sed -n 's|.*ns/\([a-z-]*\) pod/\([a-z0-9-]*\) \([A-Za-z]*\) probe \([a-z]*\):.*|\3\t\4\t\1/\2|p' | sort | uniq -c | sort -n
1 Liveness errored openshift-sdn/ovs-79jml
1 Liveness errored openshift-sdn/ovs-bd22c
1 Liveness errored openshift-sdn/ovs-vzq8s
1 Liveness failed openshift-console/console-5bb6bf7db4-6bwl7
1 Liveness failed openshift-operator-lifecycle-manager/catalog-operator-6478bf6988-f4d5l
1 Readiness errored kube-system/etcd-quorum-guard-69b7b4499b-6pqrv
1 Readiness failed openshift-marketplace/certified-operators-747d97b84-mvcds
2 Liveness errored openshift-marketplace/community-operators-6dd8c5c5f4-g7p7q
3 Liveness failed openshift-apiserver/apiserver-qxc4z
3 Liveness failed openshift-console/console-5bb6bf7db4-kjfvp
3 Liveness failed openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-6ntft
3 Readiness failed openshift-marketplace/community-operators-6dd8c5c5f4-g7p7q
3 Readiness failed openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-4gkkg
3 Readiness failed openshift-operator-lifecycle-manager/packageserver-79c89fc4bd-257l7
4 Liveness failed openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-4gkkg
4 Liveness failed openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-hkt49
4 Liveness failed openshift-operator-lifecycle-manager/packageserver-79c89fc4bd-257l7
4 Readiness failed openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-6ntft
5 Liveness failed openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-j6k9t
5 Readiness failed openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-j6k9t
5 Readiness failed openshift-sdn/sdn-579mq
6 Liveness failed openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-xgbgl
8 Readiness failed openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-hkt49
11 Readiness failed openshift-apiserver/apiserver-qxc4z
13 Liveness failed openshift-dns/dns-default-fjn84
17 Readiness failed openshift-console/console-5bb6bf7db4-kjfvp
18 Readiness failed kube-system/etcd-quorum-guard-69b7b4499b-6pqrv
28 Readiness failed openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-xgbgl
This is not happening anymore. The last jobs 1162 recent -e2e- jobs aren't throwing this anymore and it's probably related to the systemd fix which went in as well.
*** Bug 1702390 has been marked as a duplicate of this bug. ***
I checked some of the recent failures in the CI jobs that were referenced in comment #1. I don't see any evidence of the same kind of failures anymore (thanks to Trevor for the handy oneliner!). Moving to VERIFIED.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.