1595256 – Error message does not lead to root cause.

Bug 1595256 - Error message does not lead to root cause.

Summary: Error message does not lead to root cause.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Urvashi Mohnani
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-26 12:38 UTC by Priyanka Kanthale
Modified:	2021-09-09 14:45 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:40:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:40:28 UTC

Description Priyanka Kanthale 2018-06-26 12:38:53 UTC

Description of problem:

The pod was stuck in ContainerCreating state. Which was build with a build config. 

Events noted: 

Events:
  FirstSeen     LastSeen        Count   From                                                    SubObjectPath   Type            Reason          Message
  ---------     --------        -----   ----                                                    -------------   --------        ------          -------
  4m            4m              1       default-scheduler                                                       Normal          Scheduled       Successfully assigned mongodb-replicaset-blockchain-7-build to ip-10-x-x-x.abc
  4m            4m              12      kubelet, ip-10-x-x-x.abc                Normal          SandboxChanged  Pod sandbox changed, it will be killed and re-created.
  4m            4m              13      kubelet, ip-10-x-x-x.abc                 Warning         FailedSync      Error syncing pod


On node the following error message was logged number of times: 

<snip>
un 18 17:16:30 xyz.com atomic-openshift-node[68991]: E0618 17:16:30.569465   69062 pod_workers.go:182] Error syncing pod 4d850f4a-72dd-11e8-8add-02a300ce655d ("mongodb-replica-5-build_abc(4d850f4a-72dd-11e8-8add-02a300ce655d)"), skipping: failed to "CreatePodSandbox" for "mongodb-replica-5-build_abc(4d850f4a-72dd-11e8-8add-02a300ce655d)" with CreatePodSandboxError: "CreatePodSandbox for pod \"mongodb-replica-5-build_abc(4d850f4a-72dd-11e8-8add-02a300ce655d)\" failed: rpc error: code = 2 desc = failed to start sandbox container for pod \"mongodb-replica-5-build\": Error response from daemon: {\"message\":\"invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"process_linux.go:286: decoding sync type from init pipe caused \\\\\\\\\\\\\\\"read parent: connection reset by peer\\\\\\\\\\\\\\\"\\\\\\\"\\\\n\\\"\"}"
</snip> 

The pod was running when the containers limits were removed from the build config. 
i.e for example 
    Limits:
      cpu:      1
      memory:   2


Actual results: Pod went into ContainerCreating state. 


Expected results: The logs should specify the root cause. 


Additional info:

Comment 1 Seth Jennings 2018-06-29 14:45:43 UTC

This is tricky unfortunately.

https://github.com/kubernetes/kubernetes/issues/49442

The problem is that the minimum memory limit is runtime dependent so we can code that knowledge into the kubelet. docker does not return a meaningful error to us from the CRI, so that is really an issue with docker.

I'll send this over to Containers team to see if they want to improve the error message in that situation i.e. a container tries to start with a memroy limit below the 4MB minimum docker enforces.

Comment 2 Mrunal Patel 2018-08-06 22:02:34 UTC

We can fix this in CRI-O to improve the error message when the memory is too low.

Comment 3 Mrunal Patel 2018-08-06 22:45:35 UTC

https://github.com/kubernetes-incubator/cri-o/pull/1723 created for cri-o 1.11

Comment 5 weiwei jiang 2018-08-27 05:59:02 UTC

Checked with 
cri-o-1.11.1-2.rhaos3.11.git1759204.el7.x86_64

And the fix is still not in, so move back to modified

Comment 7 weiwei jiang 2018-09-07 02:48:29 UTC

Checked on 
# oc version 
oc v3.11.0-0.28.0
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-wjiang-master-etcd-1:8443
openshift v3.11.0-0.28.0
kubernetes v1.11.0+d4cacc0

with cri-o-1.11.2-1.rhaos3.11.git3eac3b2.el7.x86_64

And looks like patch still not in.

# oc describe pods pod-lks6v
Name:               pod-lks6v                           
Namespace:          default                                                             
Priority:           0              
PriorityClassName:  <none>      
Node:               qe-wjiang-node-registry-router-1/10.240.0.115
Start Time:         Thu, 06 Sep 2018 22:29:08 -0400
Labels:             <none>
Annotations:        openshift.io/scc=anyuid
Status:             Pending
IP:                        
Containers:  
  c1:                                
    Container ID:                            
    Image:          openshift/hello-openshift:latest
    Image ID:    
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False       
    Restart Count:  0                   
    Limits:                       
      cpu:     1             
      memory:  1Ki   
    Requests:                     
      cpu:        1                                
      memory:     1Ki
    Environment:  <none>                                            
    Mounts:          
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6s2kq (ro)
Conditions:                            
  Type              Status                                                 
  Initialized       True         
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-6s2kq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6s2kq
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                  Age               From                                       Message
  ----     ------                  ----              ----                                       -------
  Normal   Scheduled               1m                default-scheduler                          Successfully assigned default/pod-lks6v to qe-wjiang-node-registry-router-1
  Warning  FailedCreatePodSandBox  15s (x8 over 1m)  kubelet, qe-wjiang-node-registry-router-1  Failed create pod sandbox: rpc error: code = Unknown desc = signal: killed

Comment 9 weiwei jiang 2018-09-17 08:38:19 UTC

Checked with cri-o-1.11.4-2.rhaos3.11.gite0c89d8.el7_5.x86_64
on
# oc version 
oc v3.11.6
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-wjiang-master-etcd-1:8443
openshift v3.11.6
kubernetes v1.11.0+d4cacc0


And the issue still not fixed in 1.11.4-2.

# oc describe pods cluster-capacity-stub-container
Name:               cluster-capacity-stub-container
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               qe-wjiang-master-etcd-1/10.240.0.112
Start Time:         Mon, 17 Sep 2018 04:33:56 -0400
Labels:             <none>
Annotations:        openshift.io/scc=anyuid
Status:             Pending
IP:
Containers:
  cluster-capacity-stub-container:
    Container ID:
    Image:          gcr.io/google_containers/pause:2.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  1Ki
    Requests:
      cpu:        100m
      memory:     1Ki
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sbldp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-sbldp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-sbldp
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  node-role.kubernetes.io/master=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                  Age   From                              Message
  ----     ------                  ----  ----                              -------
  Normal   Scheduled               11s   default-scheduler                 Successfully assigned default/cluster-capacity-stub-container to qe-wjiang-master-etcd-1
  Warning  FailedCreatePodSandBox  11s   kubelet, qe-wjiang-master-etcd-1  Failed create pod sandbox: rpc error: code = Unknown desc = signal: killed

Comment 17 weiwei jiang 2018-12-05 04:44:49 UTC

Checked with 
$ oc version
oc v4.0.0-0.82.0
kubernetes v1.11.0+6855010f70
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://192.168.99.6:8443
openshift v4.0.0-0.81.0
kubernetes v1.11.0+ba8f89f8f9

$ oc get nodes -o wide 
NAME                                  STATUS    ROLES     AGE       VERSION              INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION              CONTAINER-RUNTIME
app-node-0.wjiang-ocp.example.com     Ready     compute   1h        v1.11.0+6855010f70   192.168.99.12   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.12.0
infra-node-0.wjiang-ocp.example.com   Ready     infra     1h        v1.11.0+6855010f70   192.168.99.15   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.12.0
master-0.wjiang-ocp.example.com       Ready     master    1h        v1.11.0+6855010f70   192.168.99.6    <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.12.0

$ rpm -qa|grep -i cri-o
cri-o-1.12.0-3.rhaos4.0.git774a29e.el7.x86_64


And still can not got the expected error message:
$ oc describe pods h-3-x975w
Name:               h-3-x975w
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               app-node-0.wjiang-ocp.example.com/192.168.99.12
Start Time:         Tue, 04 Dec 2018 23:38:02 -0500
Labels:             deployment=h-3
                    deploymentconfig=h
                    run=h
Annotations:        openshift.io/deployment-config.latest-version=3
                    openshift.io/deployment-config.name=h
                    openshift.io/deployment.name=h-3
                    openshift.io/scc=restricted
Status:             Pending
IP:
Controlled By:      ReplicationController/h-3
Containers:
  h:
    Container ID:
    Image:          openshift/hello-openshift
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  2Mi
    Requests:
      cpu:        100m
      memory:     2Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-wrxzg (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-wrxzg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-wrxzg
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                  Age   From                                        Message
  ----     ------                  ----  ----                                        -------
  Normal   Scheduled               10s   default-scheduler                           Successfully assigned default/h-3-x975w to app-node-0.wjiang-ocp.example.com                                                                                                              
  Warning  FailedCreatePodSandBox  9s    kubelet, app-node-0.wjiang-ocp.example.com  Failed create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF

Comment 20 weiwei jiang 2019-04-12 09:30:52 UTC

Checked and still same output as https://bugzilla.redhat.com/show_bug.cgi?id=1595256#c17

➜  ~ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-04-10-182914   True        False         128m    Cluster version is 4.0.0-0.nightly-2019-04-10-182914
➜  ~ oc describe pods -l run=h
Name:               h-1-qllgc
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               compute-0.gpei-vsp-0412.qe.devcluster.openshift.com/139.178.73.5
Start Time:         Fri, 12 Apr 2019 17:29:11 +0800
Labels:             deployment=h-1
                    deploymentconfig=h
                    run=h
Annotations:        openshift.io/deployment-config.latest-version: 1
                    openshift.io/deployment-config.name: h
                    openshift.io/deployment.name: h-1
Status:             Pending
IP:                 
Controlled By:      ReplicationController/h-1
Containers:
  h:
    Container ID:   
    Image:          openshift/hello-openshift:latest
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  1Mi
    Requests:
      cpu:        100m
      memory:     1Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b99kl (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-b99kl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-b99kl
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age               From                                                          Message
  ----     ------                  ----              ----                                                          -------
  Normal   Scheduled               36s               default-scheduler                                             Successfully assigned default/h-1-qllgc to compute-0.gpei-vsp-0412.qe.devcluster.openshift.com
  Warning  FailedCreatePodSandBox  5s (x3 over 34s)  kubelet, compute-0.gpei-vsp-0412.qe.devcluster.openshift.com  Failed create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
➜  ~ oc version               
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"7ad663e77", GitTreeState:"", BuildDate:"2019-04-11T22:43:58Z", GoVersion:"", Compiler:"", Platform:""}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+f174899", GitCommit:"f174899", GitTreeState:"clean", BuildDate:"2019-04-10T17:18:27Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Comment 21 Mrunal Patel 2019-04-15 05:05:29 UTC

Recent changes in runc have needed a bump in minimum required memory. We will update the code to account for that.

Comment 22 Urvashi Mohnani 2019-04-16 20:17:17 UTC

Being fixed in https://github.com/cri-o/cri-o/pull/2263

Comment 23 Urvashi Mohnani 2019-04-18 14:03:58 UTC

The fix was merged and a new cri-o has been built: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=882052

Comment 24 weiwei jiang 2019-04-22 04:57:40 UTC

Checked with 4.1.0-0.nightly-2019-04-22-005054, and the issue finally fixed, thanks.
# oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-04-22-005054   True        False         130m    Cluster version is 4.1.0-0.nightly-2019-04-22-005054

# oc get nodes -o wide 
NAME                                         STATUS   ROLES    AGE    VERSION             INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                   KERNEL-VERSION         CONTAINER-RUNTIME
ip-10-0-135-161.eu-west-1.compute.internal   Ready    worker   139m   v1.13.4+da48e8391   10.0.135.161   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-10-0-140-63.eu-west-1.compute.internal    Ready    master   144m   v1.13.4+da48e8391   10.0.140.63    <none>        Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-10-0-147-114.eu-west-1.compute.internal   Ready    worker   139m   v1.13.4+da48e8391   10.0.147.114   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-10-0-157-194.eu-west-1.compute.internal   Ready    master   144m   v1.13.4+da48e8391   10.0.157.194   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-10-0-163-90.eu-west-1.compute.internal    Ready    worker   139m   v1.13.4+da48e8391   10.0.163.90    <none>        Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8
ip-10-0-167-208.eu-west-1.compute.internal   Ready    master   144m   v1.13.4+da48e8391   10.0.167.208   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190418.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.6-4.rhaos4.1.gita4b40b7.el8


# oc describe pods h-1-dn9jm
Name:               h-1-dn9jm
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-0-135-161.eu-west-1.compute.internal/10.0.135.161
Start Time:         Mon, 22 Apr 2019 00:55:33 -0400
Labels:             deployment=h-1
                    deploymentconfig=h
                    run=h
Annotations:        openshift.io/deployment-config.latest-version: 1
                    openshift.io/deployment-config.name: h
                    openshift.io/deployment.name: h-1
Status:             Pending
IP:                 
Controlled By:      ReplicationController/h-1
Containers:
  h:
    Container ID:   
    Image:          openshift/hello-openshift:latest
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  1Mi
    Requests:
      cpu:        100m
      memory:     1Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p8297 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-p8297:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-p8297
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age        From                                                 Message
  ----     ------                  ----       ----                                                 -------
  Normal   Scheduled               <invalid>  default-scheduler                                    Successfully assigned default/h-1-dn9jm to ip-10-0-135-161.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  <invalid>  kubelet, ip-10-0-135-161.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = pod set memory limit 1048576 too low; should be at least 4194304

Comment 26 errata-xmlrpc 2019-06-04 10:40:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.