Description of problem: This is a reliability test run that was started several days ago on OCP 4.4.0-0.nightly-2020-02-15-084805, 3 worker and 3 master nodes in GCP and instance type n1-standard-4. The SVT reliability test creates namespaces, deploys quickstart apps (cakephp-mysql-persistent, nodejs-mongo-persistent, django-psql-persistent, rails-pgsql-persistent, dancer-mysql-persistent), visits the apps, scales up and down the apps, and deletes namespaces periodically for several consecutive days. During this test, node cpu requests average about 40% and memory requests between 60-70 %. https://github.com/openshift/svt/tree/master/reliability Builds are pruned daily. After 9+ days, one of the worker nodes went NotReady and stayed in that state until it was rebooted. NAME STATUS ROLES AGE VERSION walid4-g9mp2-m-0.c.openshift-qe.internal Ready master 9d v1.17.1 walid4-g9mp2-m-1.c.openshift-qe.internal Ready master 9d v1.17.1 walid4-g9mp2-m-2.c.openshift-qe.internal Ready master 9d v1.17.1 walid4-g9mp2-w-a-dpzdl.c.openshift-qe.internal NotReady worker 9d v1.17.1 walid4-g9mp2-w-b-qwtn8.c.openshift-qe.internal Ready worker 9d v1.17.1 walid4-g9mp2-w-c-rb22r.c.openshift-qe.internal Ready worker 9d v1.17.1 After reboot, the worker node returned to Ready state. oc adm must-gather tar ball was collected after the reboot since the command did not complete due to the node Not Ready state on worker node "walid4-g9mp2-w-a-dpzdl.c.openshift-qe.internal" Version-Release number of selected component (if applicable): Server Version: 4.4.0-0.nightly-2020-02-15-084805 Kubernetes Version: v1.17.1 How reproducible: Happened once after 9 days of continuous running Steps to Reproduce: 1. Run the SVT Reliability test run as described in: https://github.com/openshift/svt/tree/master/reliability 2. sample config file: tasks: minute: - action: check resource: pods - action: check resource: projects hour: - action: check resource: projects - action: visit resource: apps applyPercent: 100 - action: create resource: projects quantity: 3 - action: scaleUp resource: apps applyPercent: 50 - action: scaleDown resource: apps applyPercent: 50 - action: build resource: apps applyPercent: 33 - action: modify resource: projects applyPercent: 25 - action: clusteroperators resource: monitor week: - action: delete resource: projects applyPercent: 25 - action: login resource: session user: testuser-47 password: 3. Monitor the cluster via oc commands: oc get nodes, oc get pods -A | grep Error, etc Actual results: One of the 3 worker nodes gets into Node NotReady and does not get out of that state for several days until rebooted. Expected results: All nodes should remain in Ready state during the test run Additional info: Link to must-gather logs will be provided in next comment.
*** This bug has been marked as a duplicate of bug 1802687 ***