1882000 – SysTest - Noticed Worker0 Node with OOM in logs before it Reboot itself

Bug 1882000 - SysTest - Noticed Worker0 Node with OOM in logs before it Reboot itself

Summary: SysTest - Noticed Worker0 Node with OOM in logs before it Reboot itself

Keywords:
Status:	CLOSED DUPLICATE of bug 1889912
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1869362
TreeView+	depends on / blocked

Reported:	2020-09-23 14:56 UTC by baiesi
Modified:	2020-11-11 14:01 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-11 14:01:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Journalcrl logs of some oom / reboot log (104.87 KB, text/plain) 2020-09-23 14:56 UTC, baiesi	no flags	Details
View All

Description baiesi 2020-09-23 14:56:06 UTC

Created attachment 1716080 [details]
Journalcrl logs of some oom / reboot log

Description of problem:

Issue:
SysTest - Noticed Worker0 Node with OOM logs on the system before it Reboot itself

Environment:
UPI initial installed OCP 4.3 on Bare Metal
UI upgraded to OCP 4.4 OSC.4.4
load balancers: LB-(master0, master1, master2), LB-(worker0, worker1, worker2)
bootstrap node are all in private network,
public NICs are disabled.
1 infra node has dual NICs to access both public and private network.
3 workers nodes are labeled and configured  with RHOCS

Test: System Test
Type: Negative Testing
Name: Node Failure - Reboot / Power Cycle

Prerequisites
-Have a running Cluster listed above
-Apply continuous concurrent Admin / Developer client load against the environment and verify successful test results

Steps to Reproduce:
1. Reboot in a sequential order [1hr + delay between]
-Worker2 
-Master2
-Worker0
2.. Power off Worker1 node - Thu Sep 17 21:09:20 EDT 2020
3. Leave worker node off 13 hours
4. Continue running concurrent client load against the Test cluster
5. Power on Worker1 node - Fri Sep 18 10:12:15 EDT 2020
6. Allow for node reconciliation 6hrs+

Actual results:
Notice OOM messages and the reboot of the mode in early morning in the nodes system logs trying to determine why we lost network connection with the node.

Expected Results:
I expected the node not to run out of memory and reboot itself

Additional info:
[core@worker0 ~]$ journalctl -p 3 --since "2020-09-19 04:10:00" --until "2020-09-23 23:00:00"
Sep 19 04:11:58 worker0 kernel: bnx2x: [bnx2x_panic_dump:1180(enp6s0f0)]end crash dump -----------------
Sep 19 04:11:58 worker0 kernel: bnx2x: [bnx2x_sp_rtnl_task:10322(enp6s0f0)]Indicating link is down due to Tx-timeout
Sep 19 04:12:04 worker0 kernel: Memory cgroup out of memory: Killed process 710762 (gunicorn) total-vm:189428kB, anon-rss:28852kB, file-rss:5968kB, shmem-rss:0kB, UID:1013050000
Sep 19 04:12:05 worker0 kernel: Memory cgroup out of memory: Killed process 710598 (gunicorn) total-vm:189464kB, anon-rss:28652kB, file-rss:5968kB, shmem-rss:0kB, UID:1013050000
Sep 19 04:12:05 worker0 kernel: Memory cgroup out of memory: Killed process 710715 (gunicorn) total-vm:186044kB, anon-rss:25288kB, file-rss:6224kB, shmem-rss:0kB, UID:1013050000
-- Reboot --

* note: the reboot failed requireing me to access the drac of the system to press <F1> to continue booting up.... on 09/22/2020

Attached file:
ocp-worker0-node-oom-errors.txt

Cluster State:
The system is currently up and running at the time this defect was reported and may still be available for additional debugging / observations. Also please feel free to access the Worker0 node to access the logs via journalclt to query the oom / dump, etc...

Test script definitions:
t1- generic-app-test-client7_scenario_1 - build project, apps, set replicas, verity endpoint and cleanup operations
t2- postgres_load_test_scenario_1 - postgres load, connect to existing Postgres project and drive pbench db load operations
t3- ocp_app_httpd_scenario_1 - Connect to Apache and very access operations
t4- ocp_dev_app_git_scenario_1 - Git build and deploy operations
t5- ocp_app_jenkins_persistent_scenario_1 - jenkins client load against Jenkins apps

Comment 1 milei 2020-09-23 18:11:45 UTC

must-gather file is stored:
http://10.8.32.38/str/ocpdebug/must-gather_after_worker0_start.tar.gz

Comment 2 baiesi 2020-09-24 14:36:57 UTC

The qe system test team will attempt a perform a disaster recovery procedure today 09/24/2020 to recover from the bad cluster state situation occurring today.  We will follow the doc procedures in order to return our cluster to a working state.  The goal is to attempt to get the system in a good state then upgrade the cluster.

Note You need to log in before you can comment on or make changes to this bug.