Created attachment 1655650 [details] SIGSEGV : segmentation error Description of problem: When deploying a disk stress workload for file server workload using file bench tool and parameters of file bench as N_FILES=1000 MEAN_DIR_WIDTH=20 MEAN_FILE_SIZE=512k N_THREADS=1000 IO_SIZE=1m MEAN_APPEND_SIZE=16k, executing for an hour When a job is deployed using the above workload container with parallelism as 50 on a particular worker node using "nodeSelector", the job is deployed and 50 corresponding pods are created, but immediately after this, the web console breaks down and worker the node goes to "Not ready state". On issuing oc status command shows segmentation error. Version-Release number of selected component (if applicable): Client Version: openshift-clients-4.2.2-201910250432-12-g72076900 Server Version: 4.2.12-s390x Kubernetes Version: v1.14.6+32dc4a0 How reproducible: Steps to Reproduce: 1. Deploy a job for disk workload using filebench tool with parallelism 50 on a particular worker node with "nodeSelector" parameter 2. The OCP console breaks down, and the worker node goes to "Not ready" state 3. Issue oc status command in OC CLI, it gives memory segmentation error Actual results: The worker node goes to Not ready state and the console is irresponsive Expected results: While deploying the job, the pods should move to "Unschedulable" state rather than breaking down Additional info: Only the particular worker node goes to "Not ready" state other worker and master nodes are alive
Can you run `oc describe` for the node, the workload pods, and the console pods? I believe the console runs at least two replicas, so the good worker should still be able to serve traffic. Since the ingress routers run on the workers by default, are you sure that you are connecting to the good worker? Normally a load balancer would help you out here, but I'm not sure what your infrastructure looks like. The panic you are seeing with oc is definitely unexpected. Would you mind opening another bug specifically for that? Lastly, I see that you have a bootstrap node in your cluster. That shouldn't be there. The bootstrap machine is never a part of the cluster, so something isn't right with this setup. How was this cluster created?
The logs of the following commands are as attached. oc status, oc get nodes, oc describe node worker-0.s8343ocp.lnxne.boe, oc describe pods disk-stress-**, oc describe pods console-**. while repeating this exercise, it was observed that the console was still reponsive though the worker node was not ready and the worker node was ready after sometime. There is already a bug raised for the runtime panic https://bugzilla.redhat.com/show_bug.cgi?id=1795177
Created attachment 1657332 [details] oc get nodes
Created attachment 1657333 [details] oc status
Created attachment 1657334 [details] oc describe node
Created attachment 1657335 [details] oc describe console pod
Created attachment 1657336 [details] oc describe console pod-2
Created attachment 1657337 [details] oc describe disk stress
Regarding the nodes in the cluster, [root@s8343001 ~]# oc get nodes NAME STATUS ROLES AGE VERSION bootstrap-0.s8343ocp.lnxne.boe Ready worker 3d22h v1.14.6+97c81d00e master-0.s8343ocp.lnxne.boe Ready master 4d18h v1.14.6+97c81d00e master-1.s8343ocp.lnxne.boe Ready master 4d18h v1.14.6+97c81d00e master-2.s8343ocp.lnxne.boe Ready master 4d18h v1.14.6+97c81d00e worker-0.s8343ocp.lnxne.boe Ready worker 4d18h v1.14.6+97c81d00e worker-1.s8343ocp.lnxne.boe Ready worker 4d18h v1.14.6+97c81d00e The bootstrap node was later changed to a worker node after the installation just that the dns hostname remains as bootstrap-0.s8343ocp.lnxne.boe. The OCP cluster was created using the ansible playbooks
I would like to update that “same behaviour is noticed on OCP clusters on x86 hardware” oc status displays “panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x25c9138]” When the job (definition is attached) is deployed with parallelism 50 then, one or two pods created by the job runs, other pods move to unscheduled state and is not recovered after that anytime. The node worker 1 (worker1.ocp4.openshift.de.ibm.com) was stressed during this exercise, the logs of oc describe node, pod, are in the attached files starting with prefix x_86. An attractive point may be is that the events in the node say “System OOM encountered” after which nothing happens in the job. At this point or just before that oc status displays the message “panic: runtime error”. But, the CPU utilization or memory utilization in the node is not reached to its maximum, the metrics are showed in the attached snapshot of grafana UI. Another observation to be noted is that : When the resource definition for the containers in the job is removed: say the below: resources: requests: cpu: 500m memory: 11700Ki and the same test case is executed, it is observed that the worker node goes to “Not ready” state, comes back after sometime and terminates all the pods without running them. The oc status, oc describe node, pod shows similar messages at this point. However, the OCP console was found to be responsive over this exercise. System specification: Master nodes : (common for all masters) status: capacity: cpu: '4' hugepages-2Mi: '0' memory: 16421232Ki pods: '250' allocatable: cpu: 3500m hugepages-2Mi: '0' memory: 15806832Ki pods: '250' Worker nodes: (common for all workers) status: capacity: cpu: '4' hugepages-2Mi: '0' memory: 16421264Ki pods: '250' allocatable: cpu: 3500m hugepages-2Mi: '0' memory: 15806864Ki pods: '250' Storage specification: No external storage was used
Created attachment 1657595 [details] x86_job.yaml
Created attachment 1657596 [details] x86_CPU, Memory Utilization - grafana UI
Created attachment 1657597 [details] x86_oc adm must-gather
Created attachment 1657598 [details] x86_oc describe node
Created attachment 1657599 [details] x86_oc describe pod
Created attachment 1657600 [details] x86_oc status and oc get nodes
I was mistaken, the console pods run on the masters. If you are having trouble with connectivity, it's probably due to the routers which _doS run on the workers. Your confirmation that the console remained responsive the second time you ran the test supports this hypothesis. Looking at the job definition you provided, I am confused as to what is actually being tested here. I'd don't know what that tool is doing, but without any volume mounts, it isn't going to be stressing the disk itself. My guess is that it's stressing the tmpfs that backs the container's filesystem (which would just result in memory pressure).
In this exercise, we are interested in bringing the Disk IO utilisation metric to the maximum on a particular node (worker1.ocp4.openshift.de.ibm.com) while keeping its memory and CPU low and see how the cluster is behaving to it. The job uses simple file bench - a file system and storage benchmark tool(https://github.com/filebench/filebench/) as a container and accepts parameters like workload_type = fileserver, N_FILES, MEAN_DIR_WIDTH, MEAN_FILE_SIZE, N_THREADS, MEAN_APPEND_SIZE which are required to run the fileserver workload. Regarding the volume mounts mentioned by you, the workload increases the disk space usage but it does not increase the memory to the maximum, the memory utilisation of the node is still at 43% as seen from the grafana console. Also, the node does not complain about any existing memory pressure which is visible from the logs of oc describe node.
Hi Lakshmi, on each Linux you have a tmpfs storage which uses main memory. It looks like the container are using tmpfs as a default when data is written to the containers FS. Therefore we would need check if we do use the tmpfs and therefore generating memory preassure.
I am attaching the Dockerfile, the files referred by the Dockerfile to create the disk stress workload above (on x86) 1. Dockerfile 2. setup.sh 3. workload_fileserver.f
Created attachment 1661626 [details] Dockerfile
Created attachment 1661627 [details] setup.sh
Created attachment 1661628 [details] workload_fileserver.f
Hi Lakshmi, What does the infrastructure look like for these tests? Is there shared infrastructure on disks/memory/networking etc - is it a virtualized configuration or are these all separate hosts with minimal shared infrastructure (both for x86_64 and s390x tests that you ran)?
Infrastructure on ZVM: The master and worker nodes are ZVM guests. CPU and Memory Configuration: Master nodes : CPU: 4, Memory: 16 GB Worker nodes: CPU: 2, Memory: 8 GB CPU and memory are virtualised Disk Configuration: master-0 FCP master-1 DASD master-2 DASD worker-0 FCP worker-1 DASD worker-2 DASD The disks are dedicated, especially no ZVM mini-disks Network Configuration: OSA cards (VSwitch) ----------------------------------------------------------------------- Infrastructure on X86: The master and worker nodes are VM guests on Intel hardware. CPU and Memory Configuration: Master nodes : CPU: 4, Memory: 16 GB Worker nodes: CPU: 4, Memory: 16 GB CPU and memory are virtualized Disk Configuration: Both master and worker nodes: The disks are SAN (Storage Area Network) disks and are typically shared - this is the default/stansard setup for VMware (ESX) Network Configuration: The network is configured on a distributed switch and internet access is provided by a dedicated gateway (which is invisible for OpenShift). The cluster configuration is derived from the document 'OpenShift 4.2 vSphere Install Quickstart' (https://blog.openshift.com/openshift-4-2-vsphere-install-quickstart/) The vCenter and the ESX hypervisor are on version 6.7
As far as I can tell, this still isn't actually testing against the disk. You need to mount a location from the underlying filesystem into your pod and then point filebench at that directory.
I repeated the disk stress test with persistent volumes using nfs server and I could see that initially the memory consumption differed from earlier cases but it had a gradual increase over time. The metrics of disk IO utilisation was as expected at this point. While the test was ongoing, I could observe that the memory consumption in the node was reaching towards 98% and there was no progress in the status of the pods. At this point, executing “oc get nodes” show that the worker node is in “Not Ready” state and “oc describe node” displays the below Type Reason Age From Message ---- ------ ---- ---- ------- Warning SystemOOM 15s (x14 over 6d7h) kubelet, worker-10.ocp3lp50w.lnxne.boe System OOM encountered Warning ContainerGCFailed 15s kubelet, worker-10.ocp3lp50w.lnxne.boe rpc error: code = DeadlineExceeded desc = context deadline exceeded Normal NodeNotReady 15s kubelet, worker-10.ocp3lp50w.lnxne.boe Node worker-10.ocp3lp50w.lnxne.boe status is now: NodeNotReady So my notion is that, the node is unable to handle the given memory pressure (reaching towards 98% memory utilization) irrespective of the kind of stress and it encounters OOM, after which the node is unable to recover its services. While repeating this exercise several times over the past days on Z and x86, in some cases defining the resource-requests for the container in the job avoided the pods to be scheduled and further preventing the node to encounter OOM state. But this behaviour was not found consistent on x86, that, inspite of defining the resource requests in the job, the node went down to Not Ready state and was reviving after that but the pods were unsuccessful either; rather than making the pods un-schedulable in the first place.
Ryan, could you help us narrow down the failure here?
Duplicate of 1810136 which is getting backported. *** This bug has been marked as a duplicate of bug 1810136 ***