Bug 1412087 - Hit panic and segment error in atomic-openshift-node log
Summary: Hit panic and segment error in atomic-openshift-node log
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.4.z
Assignee: Ben Bennett
QA Contact: Zhang Cheng
URL:
Whiteboard:
Depends On:
Blocks: 1415282
TreeView+ depends on / blocked
 
Reported: 2017-01-11 08:33 UTC by Zhang Cheng
Modified: 2017-01-31 20:19 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When the admission controller that adds security contexts is disabled, the node can crash. Consequence: The node crashes trying to process a security context that is not present. Fix: Check the pointer is defined before dereferencing it. Result: The node doesn't crash.
Clone Of:
: 1415282 (view as bug list)
Environment:
Last Closed: 2017-01-31 20:19:54 UTC
Target Upstream Version:


Attachments (Terms of Use)
openshift-sdn-debug.tgz (18.58 MB, application/x-gzip)
2017-01-11 09:41 UTC, DeShuai Ma
no flags Details


Links
System ID Priority Status Summary Last Updated
Origin (Github) 12446 None None None 2017-01-11 15:57:57 UTC
Red Hat Product Errata RHBA-2017:0218 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4.1.2 bug fix update 2017-02-01 01:18:20 UTC

Comment 2 Zhang Cheng 2017-01-11 08:56:35 UTC
There is a mistake in above writing.
should be:
I found node are not ready, and saw panic and segment error in atomic-openshift-node log.

Comment 3 DeShuai Ma 2017-01-11 09:41:03 UTC
Created attachment 1239383 [details]
openshift-sdn-debug.tgz

Node runs on ec2 t2.large instance, met this bug in our one env, don't know how to reproduce it.

Comment 4 Ben Bennett 2017-01-11 16:00:35 UTC
PR https://github.com/openshift/origin/pull/12446 resolves the symptom of the problem.  But investigation is ongoing to determine if there is a deeper problem where the SecurityContext is not being set, but should.

Comment 5 Andy Goldstein 2017-01-11 19:44:04 UTC
I'm seeing a few pods that are missing the openshift.io/scc annotation, and their containers' SecurityContext fields are all nil when they shouldn't be. When you were using this cluster, were you just doing normal operations (oc create, oc run, etc)? Or was there anything out of the ordinary?

Comment 6 Andy Goldstein 2017-01-11 21:01:41 UTC
Did you ever change the admission control configuration in the master config file?

Comment 7 Weibin Liang 2017-01-11 21:48:18 UTC
Hi Zhang Cheng, I am trying to reproduce your bug locally, in my testing env, OCP 3.4.0.39 can work with docker 1.10, but not docker 1.12(master and node are stuck in NotReady state), could you let me know which steps you did to make OCP 3.4.30 work with 1.12?

Comment 8 openshift-github-bot 2017-01-12 01:35:09 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/6255e656cf97021047086d925192ca506b81ffae
Add a nil check to Container.SecurityContext

We were panicing sometimes when we dereferenced a nil pointer when
looking at the Container.SecurityContext which is defined as optional.

This fix adds a check to see if it is not nil before dereferencing.

Fixes bug 1412087 (https://bugzilla.redhat.com/show_bug.cgi?id=1412087)

Comment 11 Ben Bennett 2017-01-12 14:36:33 UTC
The PR https://github.com/openshift/origin/pull/12446 prevents the crash when the admission controller is disabled.

Fortunately, disabling the admission controller that adds the security contexts is not likely to be desired at the customer site, so this is not a release blocker for 3.4.0 and will be fixed in 3.4.1 and 3.3.x.

Comment 12 Weibin Liang 2017-01-12 15:39:35 UTC
@Zhang, By default docker 1.10 is installed when install OCP3.4 by using Flexy , even I rebuild my ec2 instances based on your previous build https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/AOS_V3_Installation/job/Launch%20Environment%20Flexy/9570/, the docker version is still 1.10.

When I remove docker 1.10 and reinstall docker 1.12, both master and node can not be up.

I know this bug is fixed, but just curious how you upgrade docker to 1.12 in OCP3.4

Comment 14 Zhang Cheng 2017-01-13 01:53:59 UTC
@Gan Huang Thanks for your kindly reply.
@Weibin, please refer to Gan Huang's comments.

Comment 16 Zhang Cheng 2017-01-22 06:35:48 UTC
Passed and Verified on OCP 3.4.1.0, test steps follow my above comments.
Test env:
OCP v3.4.1.0
kubernetes v1.4.0+776c994

Comment 18 errata-xmlrpc 2017-01-31 20:19:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0218


Note You need to log in before you can comment on or make changes to this bug.