Created attachment 1655647 [details] SIGSEGV: segmentation error Description of problem: "oc status" command of OC CLI displays - panic: runtime error: invalid memory address or nil pointer dereference [signam SIGSEGV: segmentation violation code=0x1 ...] Version-Release number of selected component (if applicable): Client Version: openshift-clients-4.2.2-201910250432-12-g72076900 Server Version: 4.2.12-s390x Kubernetes Version: v1.14.6+32dc4a0 How reproducible: Introduce 100% disk stress on one of the worker nodes in the OCP cluster using filebench command. The OCP console stops responding and the worker node goes down. On opening the OC CLI and executing the command "oc status" displays " panic: runtime error: invalid memory address or nil pointer dereference [signam SIGSEGV: segmentation violation code=0x1 ...] " rather than giving a proper error message. Steps to Reproduce: 1. Introduce 100% disk utilization workload on one of the worker nodes 2. On observering the worker node goes to "Not ready" state, the OCP console stops responding 3. On logging to the bastion and giving oc status command gives SIGSEGV error Actual results: Segmenation error given by the oc status command Expected results: The oc status should display a proper error message for the scenario. Additional info: Only the worker node on which the stress was put in has went to "Not ready" state other master and worker nodes in the cluster was in the "Ready" state
Can you provide some more specific information about "Introduce 100% disk utilization workload on one of the worker nodes" filebench does not appear to be included as part of RHCOS. Also could you please provide exact command used to run filebench, including arguments.
During the bugzappers call, this bug has been discussed to follow on with (https://bugzilla.redhat.com/show_bug.cgi?id=1795185)
A possible fix for this landed in the latest 4.2 nightly. Can you re-test this and see if it can be reproduced?
The scenario was tested on OCP version Client Version: 4.4.0-0.nightly-s390x-2020-06-17-185805 Server Version: 4.4.0-0.nightly-s390x-2020-06-17-185805 Kubernetes Version: v1.17.1+912792b and the reported behaviour was not reproducible and the fix is supposed to be landed. can someone please help me to understand, if the scenario still has to be tested on the latest 4.2 nightly as well ? I suppose, since the version of CRIO/other components responsible carrying the OOM fixes is well available/advanced in 4.4.0-0.nightly-s390x-2020-06-17-185805, please correct me if am wrong.
Tested the bug scenario on OCP 4.2.34 and the reported behaviour is not observed. oc version ----------- Client Version: 4.4.0-0.nightly-s390x-2020-06-12-154108 Server Version: 4.2.34 Kubernetes Version: v1.14.6+20b13ba