Created attachment 1716080 [details] Journalcrl logs of some oom / reboot log Description of problem: Issue: SysTest - Noticed Worker0 Node with OOM logs on the system before it Reboot itself Environment: UPI initial installed OCP 4.3 on Bare Metal UI upgraded to OCP 4.4 OSC.4.4 load balancers: LB-(master0, master1, master2), LB-(worker0, worker1, worker2) bootstrap node are all in private network, public NICs are disabled. 1 infra node has dual NICs to access both public and private network. 3 workers nodes are labeled and configured with RHOCS Test: System Test Type: Negative Testing Name: Node Failure - Reboot / Power Cycle Prerequisites -Have a running Cluster listed above -Apply continuous concurrent Admin / Developer client load against the environment and verify successful test results Steps to Reproduce: 1. Reboot in a sequential order [1hr + delay between] -Worker2 -Master2 -Worker0 2.. Power off Worker1 node - Thu Sep 17 21:09:20 EDT 2020 3. Leave worker node off 13 hours 4. Continue running concurrent client load against the Test cluster 5. Power on Worker1 node - Fri Sep 18 10:12:15 EDT 2020 6. Allow for node reconciliation 6hrs+ Actual results: Notice OOM messages and the reboot of the mode in early morning in the nodes system logs trying to determine why we lost network connection with the node. Expected Results: I expected the node not to run out of memory and reboot itself Additional info: [core@worker0 ~]$ journalctl -p 3 --since "2020-09-19 04:10:00" --until "2020-09-23 23:00:00" Sep 19 04:11:58 worker0 kernel: bnx2x: [bnx2x_panic_dump:1180(enp6s0f0)]end crash dump ----------------- Sep 19 04:11:58 worker0 kernel: bnx2x: [bnx2x_sp_rtnl_task:10322(enp6s0f0)]Indicating link is down due to Tx-timeout Sep 19 04:12:04 worker0 kernel: Memory cgroup out of memory: Killed process 710762 (gunicorn) total-vm:189428kB, anon-rss:28852kB, file-rss:5968kB, shmem-rss:0kB, UID:1013050000 Sep 19 04:12:05 worker0 kernel: Memory cgroup out of memory: Killed process 710598 (gunicorn) total-vm:189464kB, anon-rss:28652kB, file-rss:5968kB, shmem-rss:0kB, UID:1013050000 Sep 19 04:12:05 worker0 kernel: Memory cgroup out of memory: Killed process 710715 (gunicorn) total-vm:186044kB, anon-rss:25288kB, file-rss:6224kB, shmem-rss:0kB, UID:1013050000 -- Reboot -- * note: the reboot failed requireing me to access the drac of the system to press <F1> to continue booting up.... on 09/22/2020 Attached file: ocp-worker0-node-oom-errors.txt Cluster State: The system is currently up and running at the time this defect was reported and may still be available for additional debugging / observations. Also please feel free to access the Worker0 node to access the logs via journalclt to query the oom / dump, etc... Test script definitions: t1- generic-app-test-client7_scenario_1 - build project, apps, set replicas, verity endpoint and cleanup operations t2- postgres_load_test_scenario_1 - postgres load, connect to existing Postgres project and drive pbench db load operations t3- ocp_app_httpd_scenario_1 - Connect to Apache and very access operations t4- ocp_dev_app_git_scenario_1 - Git build and deploy operations t5- ocp_app_jenkins_persistent_scenario_1 - jenkins client load against Jenkins apps
must-gather file is stored: http://10.8.32.38/str/ocpdebug/must-gather_after_worker0_start.tar.gz
The qe system test team will attempt a perform a disaster recovery procedure today 09/24/2020 to recover from the bad cluster state situation occurring today. We will follow the doc procedures in order to return our cluster to a working state. The goal is to attempt to get the system in a good state then upgrade the cluster.