Bug 1703877

Summary:	[stability] MCD pod is periodically exiting with error during some e2e runs
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED ERRATA	QA Contact:	Micah Abbott <miabbott>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1.0	CC:	mnguyen, sponnaga
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:28:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-04-29 01:20:45 UTC

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/313

Apr 27 05:25:47.689 E ns/openshift-monitoring pod/prometheus-adapter-575b89dd7d-55qtv node/ip-10-0-154-190.ec2.internal container=prometheus-adapter container exited with code 2: 
Apr 27 05:26:09.512 E ns/openshift-image-registry pod/node-ca-vhq7d node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 05:37:31.362 E ns/openshift-machine-config-operator pod/machine-config-daemon-f754s node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 05:43:13.101 E ns/openshift-machine-config-operator pod/machine-config-daemon-wx22g node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 05:43:43.147 E ns/openshift-image-registry pod/node-ca-244fh node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 05:44:30.213 E ns/openshift-machine-config-operator pod/machine-config-daemon-9jww2 node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 05:45:00.266 E ns/openshift-image-registry pod/node-ca-7tnpd node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 05:50:55.850 E ns/openshift-machine-config-operator pod/machine-config-daemon-jlvw2 node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 05:51:25.909 E ns/openshift-image-registry pod/node-ca-sf2h9 node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 06:02:21.213 E ns/openshift-machine-config-operator pod/machine-config-daemon-npfpc node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 06:02:51.265 E ns/openshift-image-registry pod/node-ca-fv7pz node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 06:05:00.488 E ns/openshift-machine-config-operator pod/machine-config-daemon-w27wx node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 06:05:30.535 E ns/openshift-image-registry pod/node-ca-xtfl4 node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 

This causes a test to fail, but needs independent investigation to understand why it is exiting and restarting every ~5 minutes.

Comment 1 Antonio Murdaca 2019-04-29 08:39:02 UTC

Apr 27 05:44:30.213 E ns/openshift-machine-config-operator pod/machine-config-daemon-9jww2 node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 

Someone (kubelet likely) is killing (SIGTERM) us.

Comment 2 Antonio Murdaca 2019-05-02 18:49:31 UTC

The 143 error code is because the MCD is getting killed after someone asked to SIGTERM it. Now, in the MCD we have an handler for sigterm only during our sync, the rest of the code doesn't really care about sigterm so we don't catch it and we exit with 143 instead of 0 (if we had an handler).

Comment 3 Antonio Murdaca 2019-05-02 19:59:52 UTC

PR to fix this by adding an handler for SIGTERM and exiting nicely is here https://github.com/openshift/machine-config-operator/pull/697

Comment 4 Antonio Murdaca 2019-05-02 20:28:21 UTC

Alright, all daemonsets w/o a SIGTERM handler are exposing this behavior of being terminated (full conversation here https://coreos.slack.com/archives/CEKNRGF25/p1556821026430400) The MCO in that job also isn't erroring out also. As outlined in the conversation also, this may be just noise (but we do have a PR anyway). I'm moving the target to 4.2 actually.

Comment 6 Michael Nguyen 2019-06-28 14:37:19 UTC

No reports of 'container exited with code 143' in the last 14 days of test runs.  Closing as verified.

Comment 7 errata-xmlrpc 2019-10-16 06:28:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922