Bug 1795185 - OCP cluster worker node goes down on deploying a job with disk stress workload with parallelism as 50
Summary: OCP cluster worker node goes down on deploying a job with disk stress workloa...
Keywords:
Status: CLOSED DUPLICATE of bug 1810136
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.z
Hardware: s390x
OS: Unspecified
low
urgent
Target Milestone: ---
: ---
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-27 11:52 UTC by Lakshmi Ravichandran
Modified: 2020-07-02 06:11 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-04 19:47:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
SIGSEGV : segmentation error (535.21 KB, image/png)
2020-01-27 11:52 UTC, Lakshmi Ravichandran
no flags Details
oc get nodes (28.79 KB, image/png)
2020-02-03 12:29 UTC, Lakshmi Ravichandran
no flags Details
oc status (123.25 KB, image/png)
2020-02-03 12:29 UTC, Lakshmi Ravichandran
no flags Details
oc describe node (12.71 KB, text/plain)
2020-02-03 12:30 UTC, Lakshmi Ravichandran
no flags Details
oc describe console pod (4.42 KB, text/plain)
2020-02-03 12:31 UTC, Lakshmi Ravichandran
no flags Details
oc describe console pod-2 (4.42 KB, text/plain)
2020-02-03 12:31 UTC, Lakshmi Ravichandran
no flags Details
oc describe disk stress (2.21 KB, text/plain)
2020-02-03 12:32 UTC, Lakshmi Ravichandran
no flags Details
x86_job.yaml (982 bytes, text/plain)
2020-02-04 15:45 UTC, Lakshmi Ravichandran
no flags Details
x86_CPU, Memory Utilization - grafana UI (272.70 KB, image/png)
2020-02-04 15:46 UTC, Lakshmi Ravichandran
no flags Details
x86_oc adm must-gather (575 bytes, text/plain)
2020-02-04 15:46 UTC, Lakshmi Ravichandran
no flags Details
x86_oc describe node (6.97 KB, text/plain)
2020-02-04 15:47 UTC, Lakshmi Ravichandran
no flags Details
x86_oc describe pod (8.16 KB, text/plain)
2020-02-04 15:47 UTC, Lakshmi Ravichandran
no flags Details
x86_oc status and oc get nodes (2.92 KB, text/plain)
2020-02-04 15:48 UTC, Lakshmi Ravichandran
no flags Details
Dockerfile (423 bytes, text/plain)
2020-02-07 09:20 UTC, Lakshmi Ravichandran
no flags Details
setup.sh (1.29 KB, application/x-shellscript)
2020-02-07 09:20 UTC, Lakshmi Ravichandran
no flags Details
workload_fileserver.f (1.37 KB, text/plain)
2020-02-07 09:21 UTC, Lakshmi Ravichandran
no flags Details

Description Lakshmi Ravichandran 2020-01-27 11:52:07 UTC
Created attachment 1655650 [details]
SIGSEGV : segmentation error

Description of problem:
When deploying a disk stress workload for file server workload using file bench tool and parameters of file bench as
N_FILES=1000
MEAN_DIR_WIDTH=20
MEAN_FILE_SIZE=512k
N_THREADS=1000
IO_SIZE=1m
MEAN_APPEND_SIZE=16k, executing for an hour

When a job is deployed using the above workload container with parallelism as 50 on a particular worker node using "nodeSelector", the job is deployed and 50 corresponding pods are created, but immediately after this, the web console breaks down and worker the node goes to "Not ready state".

On issuing oc status command shows segmentation error.



Version-Release number of selected component (if applicable):
Client Version: openshift-clients-4.2.2-201910250432-12-g72076900
Server Version: 4.2.12-s390x
Kubernetes Version: v1.14.6+32dc4a0


How reproducible:


Steps to Reproduce:
1. Deploy a job for disk workload using filebench tool with parallelism 50 on a particular worker node with "nodeSelector" parameter
2. The OCP console breaks down, and the worker node goes to "Not ready" state
3. Issue oc status command in OC CLI, it gives memory segmentation error

Actual results:
The worker node goes to Not ready state and the console is irresponsive


Expected results:
While deploying the job, the pods should move to "Unschedulable" state rather than breaking down


Additional info:
Only the particular worker node goes to "Not ready" state other worker and master nodes are alive

Comment 1 Alex Crawford 2020-01-31 00:18:51 UTC
Can you run `oc describe` for the node, the workload pods, and the console pods? I believe the console runs at least two replicas, so the good worker should still be able to serve traffic. Since the ingress routers run on the workers by default, are you sure that you are connecting to the good worker? Normally a load balancer would help you out here, but I'm not sure what your infrastructure looks like.

The panic you are seeing with oc is definitely unexpected. Would you mind opening another bug specifically for that?

Lastly, I see that you have a bootstrap node in your cluster. That shouldn't be there. The bootstrap machine is never a part of the cluster, so something isn't right with this setup. How was this cluster created?

Comment 3 Lakshmi Ravichandran 2020-02-03 12:26:21 UTC
The logs of the following commands are as attached.
oc status, oc get nodes, oc describe node worker-0.s8343ocp.lnxne.boe, oc describe pods disk-stress-**, oc describe pods console-**.

while repeating this exercise, it was observed that the console was still reponsive though the worker node was not ready and the worker node was ready after sometime.

There is already a bug raised for the runtime panic https://bugzilla.redhat.com/show_bug.cgi?id=1795177

Comment 4 Lakshmi Ravichandran 2020-02-03 12:29:01 UTC
Created attachment 1657332 [details]
oc get nodes

Comment 5 Lakshmi Ravichandran 2020-02-03 12:29:31 UTC
Created attachment 1657333 [details]
oc status

Comment 6 Lakshmi Ravichandran 2020-02-03 12:30:45 UTC
Created attachment 1657334 [details]
oc describe node

Comment 7 Lakshmi Ravichandran 2020-02-03 12:31:11 UTC
Created attachment 1657335 [details]
oc describe console pod

Comment 8 Lakshmi Ravichandran 2020-02-03 12:31:47 UTC
Created attachment 1657336 [details]
oc describe console pod-2

Comment 9 Lakshmi Ravichandran 2020-02-03 12:32:15 UTC
Created attachment 1657337 [details]
oc describe disk stress

Comment 10 Lakshmi Ravichandran 2020-02-03 12:45:50 UTC
Regarding the nodes in the cluster,
[root@s8343001 ~]# oc get nodes
NAME                             STATUS   ROLES    AGE     VERSION
bootstrap-0.s8343ocp.lnxne.boe   Ready    worker   3d22h   v1.14.6+97c81d00e
master-0.s8343ocp.lnxne.boe      Ready    master   4d18h   v1.14.6+97c81d00e
master-1.s8343ocp.lnxne.boe      Ready    master   4d18h   v1.14.6+97c81d00e
master-2.s8343ocp.lnxne.boe      Ready    master   4d18h   v1.14.6+97c81d00e
worker-0.s8343ocp.lnxne.boe      Ready    worker   4d18h   v1.14.6+97c81d00e
worker-1.s8343ocp.lnxne.boe      Ready    worker   4d18h   v1.14.6+97c81d00e


The bootstrap node was later changed to a worker node after the installation just that the dns hostname remains as bootstrap-0.s8343ocp.lnxne.boe.
The OCP cluster was created using the ansible playbooks

Comment 11 Lakshmi Ravichandran 2020-02-04 15:44:09 UTC
I would like to update that 
“same behaviour is noticed on OCP clusters on x86 hardware”


oc status displays 
“panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x25c9138]”

When the job (definition is attached) is deployed with parallelism 50 then, one or two pods created by the job runs, other pods move to unscheduled state and is not recovered after that anytime.
The node worker 1 (worker1.ocp4.openshift.de.ibm.com) was stressed during this exercise, the logs of oc describe node, pod, are in the attached files starting with prefix x_86.

An attractive point may be is that the events in the node say “System OOM encountered” after which nothing happens in the job.
At this point or just before that oc status displays the message “panic: runtime error”.
But, the CPU utilization or memory utilization in the node is not reached to its maximum, the metrics are showed in the attached snapshot of grafana UI.

Another observation to be noted is that :
When the resource definition for the containers in the job is removed:
say the below:

resources: 
            requests:
              cpu: 500m
              memory: 11700Ki

and the same test case is executed, it is observed that the worker node goes to “Not ready” state, comes back after sometime and terminates all the pods without running them.
The oc status, oc describe node, pod shows similar messages at this point.

However, the OCP console was found to be responsive over this exercise.


System specification:
Master nodes : (common for all masters)
status:
  capacity:
    cpu: '4'
    hugepages-2Mi: '0'
    memory: 16421232Ki
    pods: '250'
  allocatable:
    cpu: 3500m
    hugepages-2Mi: '0'
    memory: 15806832Ki
    pods: '250'

Worker nodes: (common for all workers)
status:
  capacity:
    cpu: '4'
    hugepages-2Mi: '0'
    memory: 16421264Ki
    pods: '250'
  allocatable:
    cpu: 3500m
    hugepages-2Mi: '0'
    memory: 15806864Ki
    pods: '250'


Storage specification:
No external storage was used

Comment 12 Lakshmi Ravichandran 2020-02-04 15:45:57 UTC
Created attachment 1657595 [details]
x86_job.yaml

Comment 13 Lakshmi Ravichandran 2020-02-04 15:46:21 UTC
Created attachment 1657596 [details]
x86_CPU, Memory Utilization - grafana UI

Comment 14 Lakshmi Ravichandran 2020-02-04 15:46:43 UTC
Created attachment 1657597 [details]
x86_oc adm must-gather

Comment 15 Lakshmi Ravichandran 2020-02-04 15:47:07 UTC
Created attachment 1657598 [details]
x86_oc describe node

Comment 16 Lakshmi Ravichandran 2020-02-04 15:47:38 UTC
Created attachment 1657599 [details]
x86_oc describe pod

Comment 17 Lakshmi Ravichandran 2020-02-04 15:48:11 UTC
Created attachment 1657600 [details]
x86_oc status and oc get nodes

Comment 18 Alex Crawford 2020-02-04 20:28:22 UTC
I was mistaken, the console pods run on the masters. If you are having trouble with connectivity, it's probably due to the routers which _doS run on the workers. Your confirmation that the console remained responsive the second time you ran the test supports this hypothesis.

Looking at the job definition you provided, I am confused as to what is actually being tested here. I'd don't know what that tool is doing, but without any volume mounts, it isn't going to be stressing the disk itself. My guess is that it's stressing the tmpfs that backs the container's filesystem (which would just result in memory pressure).

Comment 19 Lakshmi Ravichandran 2020-02-05 09:13:21 UTC
In this exercise, we are interested in bringing the Disk IO utilisation metric to the  maximum on a particular node (worker1.ocp4.openshift.de.ibm.com) while keeping its memory and CPU low and see how the cluster is behaving to it.

The job uses simple file bench - a file system and storage benchmark tool(https://github.com/filebench/filebench/) as a container and accepts parameters like workload_type = fileserver, N_FILES, MEAN_DIR_WIDTH, MEAN_FILE_SIZE, N_THREADS, MEAN_APPEND_SIZE which are required to run the fileserver workload.

Regarding the volume mounts mentioned by you, the workload increases the disk space usage but it does not increase the memory to the maximum, the memory utilisation of the node is still at 43% as seen from the grafana console.
Also, the node does not complain about any existing memory pressure which is visible from the logs of oc describe node.

Comment 20 Holger Wolf 2020-02-05 13:46:29 UTC
Hi Lakshmi, 

on each Linux you have a tmpfs storage which uses main memory. It looks like the container are using tmpfs as a default when data is written to the containers FS. Therefore we would need check if we do use the tmpfs and therefore generating memory preassure.

Comment 21 Lakshmi Ravichandran 2020-02-07 09:19:08 UTC
I am attaching the Dockerfile, the files referred by the Dockerfile to create the disk stress workload above (on x86)

1. Dockerfile
2. setup.sh
3. workload_fileserver.f

Comment 22 Lakshmi Ravichandran 2020-02-07 09:19:40 UTC
I am attaching the Dockerfile, the files referred by the Dockerfile to create the disk stress workload above (on x86)

1. Dockerfile
2. setup.sh
3. workload_fileserver.f

Comment 23 Lakshmi Ravichandran 2020-02-07 09:20:28 UTC
Created attachment 1661626 [details]
Dockerfile

Comment 24 Lakshmi Ravichandran 2020-02-07 09:20:55 UTC
Created attachment 1661627 [details]
setup.sh

Comment 25 Lakshmi Ravichandran 2020-02-07 09:21:20 UTC
Created attachment 1661628 [details]
workload_fileserver.f

Comment 26 Andy McCrae 2020-02-07 12:24:38 UTC
Hi Lakshmi,

What does the infrastructure look like for these tests? Is there shared infrastructure on disks/memory/networking etc - is it a virtualized configuration or are these all separate hosts with minimal shared infrastructure (both for x86_64 and s390x tests that you ran)?

Comment 27 Lakshmi Ravichandran 2020-02-07 16:11:34 UTC
Infrastructure on ZVM:

The master and worker nodes are ZVM guests.

CPU and Memory Configuration:
Master nodes : 
CPU: 4, Memory: 16 GB

Worker nodes:
CPU: 2, Memory: 8 GB

CPU and memory are virtualised

Disk Configuration:
master-0                   FCP        
master-1                    DASD
master-2                    DASD
worker-0                    FCP         
worker-1                    DASD
worker-2    		  DASD

The disks are dedicated, especially no ZVM mini-disks

Network Configuration:
OSA cards (VSwitch)

-----------------------------------------------------------------------

Infrastructure on X86:

The master and worker nodes are VM guests on Intel hardware.

CPU and Memory Configuration:
Master nodes : 
CPU: 4, Memory: 16 GB

Worker nodes:
CPU: 4, Memory: 16 GB

CPU and memory are virtualized

Disk Configuration:
Both master and worker nodes:
The disks are SAN (Storage Area Network) disks and are typically shared - this is the default/stansard setup for VMware (ESX)

Network Configuration:
The network is configured on a distributed switch and internet access is provided by a dedicated gateway (which is invisible for OpenShift).

The cluster configuration is derived from the document 'OpenShift 4.2 vSphere Install Quickstart' (https://blog.openshift.com/openshift-4-2-vsphere-install-quickstart/)
The vCenter and the ESX hypervisor are on version 6.7

Comment 28 Alex Crawford 2020-02-07 18:45:12 UTC
As far as I can tell, this still isn't actually testing against the disk. You need to mount a location from the underlying filesystem into your pod and then point filebench at that directory.

Comment 29 Lakshmi Ravichandran 2020-02-12 10:30:54 UTC
I repeated the disk stress test with persistent volumes using nfs server and I could see that initially the memory consumption differed from earlier cases but it had a gradual increase over time. The metrics of disk IO utilisation was as expected at this point.

While the test was ongoing, I could observe that the memory consumption in the node was reaching towards 98% and there was no progress in the status of the pods. 

At this point, executing “oc get nodes” show that the worker node is in “Not Ready” state and “oc describe node” displays the below

  Type     Reason                   Age                  From                                    Message
  ----     ------                   ----                 ----                                    -------
  Warning  SystemOOM                15s (x14 over 6d7h)  kubelet, worker-10.ocp3lp50w.lnxne.boe  System OOM encountered
  Warning  ContainerGCFailed        15s                  kubelet, worker-10.ocp3lp50w.lnxne.boe  rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   NodeNotReady             15s                  kubelet, worker-10.ocp3lp50w.lnxne.boe  Node worker-10.ocp3lp50w.lnxne.boe status is now: NodeNotReady

So my notion is that, 
the node is unable to handle the given memory pressure (reaching towards 98% memory utilization) irrespective of the kind of stress and it encounters OOM, after which the node is unable to recover its services.

While repeating this exercise several times over the past days on Z and x86, in some cases defining the resource-requests for the container in the job avoided the pods to be scheduled and further preventing the node to encounter OOM state. But this behaviour was not found consistent on x86, that, inspite of defining the resource requests in the job, the node went down to Not Ready state and was reviving after that but the pods were unsuccessful either; rather than making the pods un-schedulable in the first place.

Comment 30 Alex Crawford 2020-02-26 21:35:32 UTC
Ryan, could you help us narrow down the failure here?

Comment 31 Ryan Phillips 2020-03-04 19:47:02 UTC
Duplicate of 1810136 which is getting backported.

*** This bug has been marked as a duplicate of bug 1810136 ***


Note You need to log in before you can comment on or make changes to this bug.