1703699 – MCD is being killed and recreated causing a failed sync

Bug 1703699 - MCD is being killed and recreated causing a failed sync

Summary: MCD is being killed and recreated causing a failed sync

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1702390 (view as bug list)
Depends On:
Blocks:	1703879
TreeView+	depends on / blocked

Reported:	2019-04-27 17:12 UTC by Antonio Murdaca
Modified:	2019-06-04 10:48 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:48:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:48:13 UTC

Description Antonio Murdaca 2019-04-27 17:12:23 UTC

Looking at some upgrades jobs, we've come across something super weird which is causing the MCD to interrupt in the middle of a sync, causing a subsequent restart and permanent failed (unless an admin fix it on the node itself).

The MCD syncs like this:

1) MCC generates a new config, and set that as desiredConfig on a node
2) the MCD notices that and start rolling new files on the node, then writes a "pending" config file (which is just desiredConfig)
3) Drain + reboot
4) when it comes back online, it looks for the pending config, and set that as the currentCOnfig on the node, ending up the sync


We've only seen this happening on master nodes. And this only happened in upgrades job.

What we've noticed in masters logs is that the MCD comes online, but it's the killed and recreated.

If it's killed before the time we write pending config on disk, then when it comes back online it cannot reconcile itself.

This is most probably a model we need to change in the MCD but I wanted to understand if:

1) is it normal for the MCD to be killed like that? my response to that it's No
2) who and why MCD is getting killed when it's already up and running?

Some example of the failures are here:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-upgrade/89
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/336/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-upgrade/36
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/156/pull-ci-openshift-cluster-network-operator-master-e2e-aws-upgrade/25

Logs can probably tell you more

I'm cc'ing Mrunal as well for the Container Runtime part.

Comment 1 Mrunal Patel 2019-04-27 19:26:49 UTC

I see the probes failing in this log - https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-upgrade/89

Comment 2 Colin Walters 2019-04-28 21:20:15 UTC

> I see the probes failing in this log

Which probes?

Comment 3 W. Trevor King 2019-04-28 22:18:44 UTC

(In reply to Colin Walters from comment #2)
> > I see the probes failing in this log
> 
> Which probes?

Lots of probes?

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-upgrade/89/build-log.txt | sed -n 's|.*ns/\([a-z-]*\) pod/\([a-z0-9-]*\) \([A-Za-z]*\) probe \([a-z]*\):.*|\3\t\4\t\1/\2|p' | sort | uniq -c | sort -n
      1 Liveness	errored	openshift-sdn/ovs-79jml
      1 Liveness	errored	openshift-sdn/ovs-bd22c
      1 Liveness	errored	openshift-sdn/ovs-vzq8s
      1 Liveness	failed	openshift-console/console-5bb6bf7db4-6bwl7
      1 Liveness	failed	openshift-operator-lifecycle-manager/catalog-operator-6478bf6988-f4d5l
      1 Readiness	errored	kube-system/etcd-quorum-guard-69b7b4499b-6pqrv
      1 Readiness	failed	openshift-marketplace/certified-operators-747d97b84-mvcds
      2 Liveness	errored	openshift-marketplace/community-operators-6dd8c5c5f4-g7p7q
      3 Liveness	failed	openshift-apiserver/apiserver-qxc4z
      3 Liveness	failed	openshift-console/console-5bb6bf7db4-kjfvp
      3 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-6ntft
      3 Readiness	failed	openshift-marketplace/community-operators-6dd8c5c5f4-g7p7q
      3 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-4gkkg
      3 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-79c89fc4bd-257l7
      4 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-4gkkg
      4 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-hkt49
      4 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-79c89fc4bd-257l7
      4 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-6ntft
      5 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-j6k9t
      5 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-j6k9t
      5 Readiness	failed	openshift-sdn/sdn-579mq
      6 Liveness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-xgbgl
      8 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-6bb686cfbb-hkt49
     11 Readiness	failed	openshift-apiserver/apiserver-qxc4z
     13 Liveness	failed	openshift-dns/dns-default-fjn84
     17 Readiness	failed	openshift-console/console-5bb6bf7db4-kjfvp
     18 Readiness	failed	kube-system/etcd-quorum-guard-69b7b4499b-6pqrv
     28 Readiness	failed	openshift-operator-lifecycle-manager/packageserver-5ffdb9d78c-xgbgl

Comment 4 Antonio Murdaca 2019-05-02 17:55:09 UTC

This is not happening anymore. The last jobs 1162 recent -e2e- jobs aren't throwing this anymore and it's probably related to the systemd fix which went in as well.

Comment 5 Antonio Murdaca 2019-05-02 17:57:42 UTC

*** Bug 1702390 has been marked as a duplicate of this bug. ***

Comment 7 Micah Abbott 2019-05-07 15:41:17 UTC

I checked some of the recent failures in the CI jobs that were referenced in comment #1.  I don't see any evidence of the same kind of failures anymore (thanks to Trevor for the handy oneliner!).  Moving to VERIFIED.

Comment 9 errata-xmlrpc 2019-06-04 10:48:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.