Bug 1703877 - [stability] MCD pod is periodically exiting with error during some e2e runs
Summary: [stability] MCD pod is periodically exiting with error during some e2e runs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.0
Assignee: Antonio Murdaca
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-29 01:20 UTC by Clayton Coleman
Modified: 2019-10-16 06:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:28:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:28:33 UTC

Description Clayton Coleman 2019-04-29 01:20:45 UTC
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/313

Apr 27 05:25:47.689 E ns/openshift-monitoring pod/prometheus-adapter-575b89dd7d-55qtv node/ip-10-0-154-190.ec2.internal container=prometheus-adapter container exited with code 2: 
Apr 27 05:26:09.512 E ns/openshift-image-registry pod/node-ca-vhq7d node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 05:37:31.362 E ns/openshift-machine-config-operator pod/machine-config-daemon-f754s node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 05:43:13.101 E ns/openshift-machine-config-operator pod/machine-config-daemon-wx22g node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 05:43:43.147 E ns/openshift-image-registry pod/node-ca-244fh node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 05:44:30.213 E ns/openshift-machine-config-operator pod/machine-config-daemon-9jww2 node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 05:45:00.266 E ns/openshift-image-registry pod/node-ca-7tnpd node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 05:50:55.850 E ns/openshift-machine-config-operator pod/machine-config-daemon-jlvw2 node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 05:51:25.909 E ns/openshift-image-registry pod/node-ca-sf2h9 node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 06:02:21.213 E ns/openshift-machine-config-operator pod/machine-config-daemon-npfpc node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 06:02:51.265 E ns/openshift-image-registry pod/node-ca-fv7pz node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 
Apr 27 06:05:00.488 E ns/openshift-machine-config-operator pod/machine-config-daemon-w27wx node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 
Apr 27 06:05:30.535 E ns/openshift-image-registry pod/node-ca-xtfl4 node/ip-10-0-154-190.ec2.internal container=node-ca container exited with code 137: 

This causes a test to fail, but needs independent investigation to understand why it is exiting and restarting every ~5 minutes.

Comment 1 Antonio Murdaca 2019-04-29 08:39:02 UTC
Apr 27 05:44:30.213 E ns/openshift-machine-config-operator pod/machine-config-daemon-9jww2 node/ip-10-0-154-190.ec2.internal container=machine-config-daemon container exited with code 143: 

Someone (kubelet likely) is killing (SIGTERM) us.

Comment 2 Antonio Murdaca 2019-05-02 18:49:31 UTC
The 143 error code is because the MCD is getting killed after someone asked to SIGTERM it. Now, in the MCD we have an handler for sigterm only during our sync, the rest of the code doesn't really care about sigterm so we don't catch it and we exit with 143 instead of 0 (if we had an handler).

Comment 3 Antonio Murdaca 2019-05-02 19:59:52 UTC
PR to fix this by adding an handler for SIGTERM and exiting nicely is here https://github.com/openshift/machine-config-operator/pull/697

Comment 4 Antonio Murdaca 2019-05-02 20:28:21 UTC
Alright, all daemonsets w/o a SIGTERM handler are exposing this behavior of being terminated (full conversation here https://coreos.slack.com/archives/CEKNRGF25/p1556821026430400) The MCO in that job also isn't erroring out also. As outlined in the conversation also, this may be just noise (but we do have a PR anyway). I'm moving the target to 4.2 actually.

Comment 6 Michael Nguyen 2019-06-28 14:37:19 UTC
No reports of 'container exited with code 143' in the last 14 days of test runs.  Closing as verified.

Comment 7 errata-xmlrpc 2019-10-16 06:28:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.