Bug 1512370

Summary: [free-stg] Long period ContainerCreating / Init:0/2
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED DUPLICATE QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, bingli, danw, haowang, jokerman, mmccomas, sjenning, wzheng, yasun, yufchang, zhaliu
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-13 19:49:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Pierce 2017-11-13 03:06:29 UTC
Description of problem:

NAME                                 READY     STATUS              RESTARTS   AGE
po/dancer-mysql-persistent-1-build   0/1       Init:0/2            0          9m
po/database-1-deploy                 0/1       ContainerCreating   0          9m



Version-Release number of selected component (if applicable):
[root@free-stg-master-03fb6 ~]# oc version
oc v3.7.4
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.31.78.254:443
openshift v3.7.4
kubernetes v1.7.6+a08f5eeb62

Comment 4 Seth Jennings 2017-11-13 15:48:56 UTC
The CNI plugin is jammed (54k occurences in the node log):

1304 cni.go:304] Error deleting network when building cni runtime conf: could not retrieve port mappings: checkpoint is corrupted.

While the checkpoint file should not be corrupt, the docker shim should remove any corrupt checkpoint file, but it is not due to a bug.

buildCNIRuntimeConf() is modifying the err from plugin.host.GetPodPortMappings() as it propagates to the caller.  However, the caller checks the error against errors.CorruptCheckpointError to determine if the checkpoint file should be removed.  This will never be true as buildCNIRuntimeConf() is modifying the error.

Comment 5 Seth Jennings 2017-11-13 16:02:17 UTC
Sorry, meant to keep this one. Working on a fix.

Comment 6 Seth Jennings 2017-11-13 19:49:19 UTC
Sorry for the delay. The corrupt checkpoint messages, while nasty, are not the cause of the delay in sandbox start.  It is the vnid issue again.

*** This bug has been marked as a duplicate of bug 1509799 ***

Comment 7 Dan Winship 2017-11-13 20:48:29 UTC
(In reply to Seth Jennings from comment #4)
> While the checkpoint file should not be corrupt, the docker shim should
> remove any corrupt checkpoint file, but it is not due to a bug.
> 
> buildCNIRuntimeConf() is modifying the err from
> plugin.host.GetPodPortMappings() as it propagates to the caller.  However,
> the caller checks the error against errors.CorruptCheckpointError to
> determine if the checkpoint file should be removed.  This will never be true
> as buildCNIRuntimeConf() is modifying the error.

The newly-added check in docker_sandbox.go is also too late: the CorruptCheckpointError we're getting isn't coming from StopContainer(), it's coming from TearDownPod() a few lines earlier (via buildCNIRuntimeConf() -> GetPodPortMappings() -> GetCheckpoint()).