Bug 1703699

Summary:	MCD is being killed and recreated causing a failed sync
Product:	OpenShift Container Platform	Reporter:	Antonio Murdaca <amurdaca>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED ERRATA	QA Contact:	Micah Abbott <miabbott>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	unspecified	CC:	ccoleman, deads, mpatel, walters, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:48:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1703879

Description Antonio Murdaca 2019-04-27 17:12:23 UTC

Looking at some upgrades jobs, we've come across something super weird which is causing the MCD to interrupt in the middle of a sync, causing a subsequent restart and permanent failed (unless an admin fix it on the node itself).

The MCD syncs like this:

1) MCC generates a new config, and set that as desiredConfig on a node
2) the MCD notices that and start rolling new files on the node, then writes a "pending" config file (which is just desiredConfig)
3) Drain + reboot
4) when it comes back online, it looks for the pending config, and set that as the currentCOnfig on the node, ending up the sync


We've only seen this happening on master nodes. And this only happened in upgrades job.

What we've noticed in masters logs is that the MCD comes online, but it's the killed and recreated.

If it's killed before the time we write pending config on disk, then when it comes back online it cannot reconcile itself.

This is most probably a model we need to change in the MCD but I wanted to understand if:

1) is it normal for the MCD to be killed like that? my response to that it's No
2) who and why MCD is getting killed when it's already up and running?

Some example of the failures are here:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-upgrade/89
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/336/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-upgrade/36
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/156/pull-ci-openshift-cluster-network-operator-master-e2e-aws-upgrade/25

Logs can probably tell you more

I'm cc'ing Mrunal as well for the Container Runtime part.

Comment 1 Mrunal Patel 2019-04-27 19:26:49 UTC

I see the probes failing in this log - https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-upgrade/89

Comment 2 Colin Walters 2019-04-28 21:20:15 UTC

> I see the probes failing in this log

Which probes?

Comment 3 W. Trevor King 2019-04-28 22:18:44 UTC

(In reply to Colin Walters from comment #2)
> > I see the probes failing in this log
> 
> Which probes?

Lots of probes?

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-upgrade/89/build-log.txt | sed -n 's|.*ns/\([a-z-]*\) pod/\([a-z0-9-]*\) \([A-Za-z]*\) probe \([a-z]*\):.*|\3\t\4\t\1/\2|p' | sort | uniq -c | sort -n
      1 Liveness	errored	openshift-sdn/ovs-79jml
      1 Liveness	errored	openshift-sdn/ovs-bd22c
      1 Liveness	errored	openshift-sdn/ovs-vzq8s
      1 Liveness	failed	openshift-console/console-5bb6bf7db4-6bwl7
      1 Liveness	failed	openshift-operator-lifecycle-manager/catalog-operator-6478bf6988-f4d5l
      1 Readiness	errored	kube-system/etcd-quorum-guard-69b7b4499b-6pqrv
      1 Readiness	failed	openshift-marketplace/certified-operators-747d97b84-mvcds
      2 Liveness	errored	openshift-marketplace/community-operators-6dd8c5c5f4-g7p7q
      3 Liveness	failed	openshift-apiserver/apiserver-qxc4z
      3 Liveness	failed	openshift-console/console-5bb6bf7db4-kjfvp
      3 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-6ntft
      3 Readiness	failed	openshift-marketplace/community-operators-6dd8c5c5f4-g7p7q
      3 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-4gkkg
      3 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-79c89fc4bd-257l7
      4 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-4gkkg
      4 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-hkt49
      4 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-79c89fc4bd-257l7
      4 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-6ntft
      5 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-j6k9t
      5 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-j6k9t
      5 Readiness	failed	openshift-sdn/sdn-579mq
      6 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-xgbgl
      8 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-hkt49
     11 Readiness	failed	openshift-apiserver/apiserver-qxc4z
     13 Liveness	failed	openshift-dns/dns-default-fjn84
     17 Readiness	failed	openshift-console/console-5bb6bf7db4-kjfvp
     18 Readiness	failed	kube-system/etcd-quorum-guard-69b7b4499b-6pqrv
     28 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-xgbgl

Comment 4 Antonio Murdaca 2019-05-02 17:55:09 UTC

This is not happening anymore. The last jobs 1162 recent -e2e- jobs aren't throwing this anymore and it's probably related to the systemd fix which went in as well.

Comment 5 Antonio Murdaca 2019-05-02 17:57:42 UTC

*** Bug 1702390 has been marked as a duplicate of this bug. ***

Comment 7 Micah Abbott 2019-05-07 15:41:17 UTC

I checked some of the recent failures in the CI jobs that were referenced in comment #1.  I don't see any evidence of the same kind of failures anymore (thanks to Trevor for the handy oneliner!).  Moving to VERIFIED.

Comment 9 errata-xmlrpc 2019-06-04 10:48:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758