Bug 1412087

Summary: Hit panic and segment error in atomic-openshift-node log
Product: OpenShift Container Platform Reporter: Zhang Cheng <chezhang>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Status: CLOSED ERRATA QA Contact: Zhang Cheng <chezhang>
Severity: high Docs Contact:
Priority: high    
Version: 3.4.0CC: agoldste, aos-bugs, chezhang, dma, ghuang, mifiedle, weliang, wmeng, xtian
Target Milestone: ---   
Target Release: 3.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When the admission controller that adds security contexts is disabled, the node can crash. Consequence: The node crashes trying to process a security context that is not present. Fix: Check the pointer is defined before dereferencing it. Result: The node doesn't crash.
Story Points: ---
Clone Of:
: 1415282 (view as bug list) Environment:
Last Closed: 2017-01-31 20:19:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1415282    
Attachments:
Description Flags
openshift-sdn-debug.tgz none

Comment 2 Zhang Cheng 2017-01-11 08:56:35 UTC
There is a mistake in above writing.
should be:
I found node are not ready, and saw panic and segment error in atomic-openshift-node log.

Comment 3 DeShuai Ma 2017-01-11 09:41:03 UTC
Created attachment 1239383 [details]
openshift-sdn-debug.tgz

Node runs on ec2 t2.large instance, met this bug in our one env, don't know how to reproduce it.

Comment 4 Ben Bennett 2017-01-11 16:00:35 UTC
PR https://github.com/openshift/origin/pull/12446 resolves the symptom of the problem.  But investigation is ongoing to determine if there is a deeper problem where the SecurityContext is not being set, but should.

Comment 5 Andy Goldstein 2017-01-11 19:44:04 UTC
I'm seeing a few pods that are missing the openshift.io/scc annotation, and their containers' SecurityContext fields are all nil when they shouldn't be. When you were using this cluster, were you just doing normal operations (oc create, oc run, etc)? Or was there anything out of the ordinary?

Comment 6 Andy Goldstein 2017-01-11 21:01:41 UTC
Did you ever change the admission control configuration in the master config file?

Comment 7 Weibin Liang 2017-01-11 21:48:18 UTC
Hi Zhang Cheng, I am trying to reproduce your bug locally, in my testing env, OCP 3.4.0.39 can work with docker 1.10, but not docker 1.12(master and node are stuck in NotReady state), could you let me know which steps you did to make OCP 3.4.30 work with 1.12?

Comment 8 openshift-github-bot 2017-01-12 01:35:09 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/6255e656cf97021047086d925192ca506b81ffae
Add a nil check to Container.SecurityContext

We were panicing sometimes when we dereferenced a nil pointer when
looking at the Container.SecurityContext which is defined as optional.

This fix adds a check to see if it is not nil before dereferencing.

Fixes bug 1412087 (https://bugzilla.redhat.com/show_bug.cgi?id=1412087)

Comment 11 Ben Bennett 2017-01-12 14:36:33 UTC
The PR https://github.com/openshift/origin/pull/12446 prevents the crash when the admission controller is disabled.

Fortunately, disabling the admission controller that adds the security contexts is not likely to be desired at the customer site, so this is not a release blocker for 3.4.0 and will be fixed in 3.4.1 and 3.3.x.

Comment 12 Weibin Liang 2017-01-12 15:39:35 UTC
@Zhang, By default docker 1.10 is installed when install OCP3.4 by using Flexy , even I rebuild my ec2 instances based on your previous build https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/AOS_V3_Installation/job/Launch%20Environment%20Flexy/9570/, the docker version is still 1.10.

When I remove docker 1.10 and reinstall docker 1.12, both master and node can not be up.

I know this bug is fixed, but just curious how you upgrade docker to 1.12 in OCP3.4

Comment 14 Zhang Cheng 2017-01-13 01:53:59 UTC
@Gan Huang Thanks for your kindly reply.
@Weibin, please refer to Gan Huang's comments.

Comment 16 Zhang Cheng 2017-01-22 06:35:48 UTC
Passed and Verified on OCP 3.4.1.0, test steps follow my above comments.
Test env:
OCP v3.4.1.0
kubernetes v1.4.0+776c994

Comment 18 errata-xmlrpc 2017-01-31 20:19:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0218