Created attachment 1841416 [details] du_reboot_console_log Description of problem: Sometimes the DU node reboot during policies configuration can take a lot of time Version-Release number of selected component (if applicable): 4.9.6 How reproducible: Not always Steps to Reproduce: 1. Deploy DU node with the siteconfig and policygentemplates in http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.9 2. Wait for OCP to be deployed on the node 3. Wait for the policies to be created 4. Watch the node console while it is rebooting Actual results: Sometimes the node stays for tens of minutes while the node is not reachable over SSH. The console shows messages such as 'irq Affinity broken due to verctor space exhaution'(attaching screenshot) Expected results: Reboot times are consistent and do not take tens of minutes for the node to reboot. Additional info: Attaching console screenshot and must-gather from a node which experienced this issue.
In the system journal we can see: Nov 12 10:21:56 sno.kni-qe-1.lab.eng.rdu2.redhat.com ovs-vswitchd[1889]: ovs|04858|bridge|WARN|could not open network device 2ff77135d388d0c (No such device) Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: reboot.target: Job reboot.target/start timed out. Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Timed out starting Reboot. Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: reboot.target: Job reboot.target/start failed with result 'timeout'. Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Forcibly rebooting: job timed out Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Shutting down. Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Hardware watchdog 'HPE iLO2+ HW Watchdog Timer', version 0 Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Set hardware watchdog to 10min. Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com kernel: watchdog: watchdog0: watchdog did not stop! Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd-shutdown[1]: Syncing filesystems and block devices. Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd-shutdown[1]: Sending SIGTERM to remaining processes... Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd-journald[1430]: Journal stopped -- Reboot --
It also looks like the number of reboots is higher when this condition occurs: ssh -6 core.lab.eng.rdu2.redhat.com 'last|grep reboot|wc -l' Warning: Permanently added 'sno.kni-qe-1.lab.eng.rdu2.redhat.com,2620:52:0:198::10' (ECDSA) to the list of known hosts. 10
*** This bug has been marked as a duplicate of bug 2021151 ***