1809606 – OCP 4.4: Worker node goes NotReady after 9 days of reliability test run

Bug 1809606 - OCP 4.4: Worker node goes NotReady after 9 days of reliability test run

Summary: OCP 4.4: Worker node goes NotReady after 9 days of reliability test run

Keywords:
Status:	CLOSED DUPLICATE of bug 1802687
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Ryan Phillips
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-03 13:40 UTC by Walid A.
Modified:	2020-03-06 18:42 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-06 18:42:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Walid A. 2020-03-03 13:40:12 UTC

Description of problem:
This is a reliability test run that was started several days ago on OCP 4.4.0-0.nightly-2020-02-15-084805, 3 worker and 3 master nodes in GCP and instance type n1-standard-4.  The SVT reliability test creates namespaces, deploys quickstart apps (cakephp-mysql-persistent, nodejs-mongo-persistent, django-psql-persistent, rails-pgsql-persistent, dancer-mysql-persistent), visits the apps, scales up and down the apps, and deletes namespaces periodically for several consecutive days.  During this test, node cpu requests average about 40% and memory requests between 60-70 %.

https://github.com/openshift/svt/tree/master/reliability 

Builds are pruned daily.  After 9+ days, one of the worker nodes went NotReady and stayed in that state until it was rebooted.

NAME                                             STATUS     ROLES    AGE   VERSION
walid4-g9mp2-m-0.c.openshift-qe.internal         Ready      master   9d    v1.17.1
walid4-g9mp2-m-1.c.openshift-qe.internal         Ready      master   9d    v1.17.1
walid4-g9mp2-m-2.c.openshift-qe.internal         Ready      master   9d    v1.17.1
walid4-g9mp2-w-a-dpzdl.c.openshift-qe.internal   NotReady   worker   9d    v1.17.1
walid4-g9mp2-w-b-qwtn8.c.openshift-qe.internal   Ready      worker   9d    v1.17.1
walid4-g9mp2-w-c-rb22r.c.openshift-qe.internal   Ready      worker   9d    v1.17.1

After reboot, the worker node returned to Ready state.

oc adm must-gather tar ball was collected after the reboot since the command did not complete due to the node Not Ready state on worker node "walid4-g9mp2-w-a-dpzdl.c.openshift-qe.internal"

Version-Release number of selected component (if applicable):
Server Version: 4.4.0-0.nightly-2020-02-15-084805
Kubernetes Version: v1.17.1

How reproducible:
Happened once after 9 days of continuous running

Steps to Reproduce:
1. Run the SVT Reliability test run as described in: https://github.com/openshift/svt/tree/master/reliability
2.  sample config file: 
  tasks:
    minute:
      - action: check
        resource: pods
      - action: check
        resource: projects
    hour:
      - action: check
        resource: projects
      - action: visit
        resource: apps
        applyPercent: 100
      - action: create
        resource: projects
        quantity: 3
      - action: scaleUp
        resource: apps
        applyPercent: 50
      - action: scaleDown
        resource: apps
        applyPercent: 50
      - action: build
        resource: apps
        applyPercent: 33
      - action: modify
        resource: projects
        applyPercent: 25
      - action: clusteroperators
        resource: monitor
    week:
      - action: delete
        resource: projects
        applyPercent: 25
      - action: login
        resource: session
        user: testuser-47
        password:


3. Monitor the cluster via oc commands: oc get nodes, oc get pods -A | grep Error, etc


Actual results:
One of the 3 worker nodes gets into Node NotReady and does not get out of that state for several days until rebooted.

Expected results:
All nodes should remain in Ready state during the test run

Additional info:
Link to must-gather logs will be provided in next comment.

Comment 2 Ryan Phillips 2020-03-06 18:42:39 UTC


*** This bug has been marked as a duplicate of bug 1802687 ***

Note You need to log in before you can comment on or make changes to this bug.