Bug 2022665 - Sometimes the DU node reboot during policies configuration takes a lot of time(tens of minutes)
Summary: Sometimes the DU node reboot during policies configuration takes a lot of tim...
Keywords:
Status: CLOSED DUPLICATE of bug 2021151
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Telco Edge
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ian Miller
QA Contact: yliu1
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-12 10:28 UTC by Marius Cornea
Modified: 2021-11-24 13:50 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-24 13:50:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
du_reboot_console_log (220.35 KB, image/png)
2021-11-12 10:28 UTC, Marius Cornea
no flags Details

Description Marius Cornea 2021-11-12 10:28:38 UTC
Created attachment 1841416 [details]
du_reboot_console_log

Description of problem:

Sometimes the DU node reboot during policies configuration can take a lot of time

Version-Release number of selected component (if applicable):
4.9.6

How reproducible:
Not always

Steps to Reproduce:
1. Deploy DU node with the siteconfig and policygentemplates in http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.9
2. Wait for OCP to be deployed on the node
3. Wait for the policies to be created
4. Watch the node console while it is rebooting

Actual results:

Sometimes the node stays for tens of minutes while the node is not reachable over SSH. The console shows messages such as 'irq Affinity broken due to verctor space exhaution'(attaching screenshot)

Expected results:
Reboot times are consistent and do not take tens of minutes for the node to reboot.

Additional info:

Attaching console screenshot and must-gather from a node which experienced this issue.

Comment 1 Marius Cornea 2021-11-12 10:32:21 UTC
In the system journal we can see:


Nov 12 10:21:56 sno.kni-qe-1.lab.eng.rdu2.redhat.com ovs-vswitchd[1889]: ovs|04858|bridge|WARN|could not open network device 2ff77135d388d0c (No such device)
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: reboot.target: Job reboot.target/start timed out.
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Timed out starting Reboot.
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: reboot.target: Job reboot.target/start failed with result 'timeout'.
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Forcibly rebooting: job timed out
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Shutting down.
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Hardware watchdog 'HPE iLO2+ HW Watchdog Timer', version 0
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Set hardware watchdog to 10min.
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com kernel: watchdog: watchdog0: watchdog did not stop!
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd-shutdown[1]: Syncing filesystems and block devices.
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Nov 12 10:26:48 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd-journald[1430]: Journal stopped
-- Reboot --

Comment 6 Marius Cornea 2021-11-12 11:28:17 UTC
It also looks like the number of reboots is higher when this condition occurs:

ssh -6 core.lab.eng.rdu2.redhat.com    'last|grep reboot|wc -l'
Warning: Permanently added 'sno.kni-qe-1.lab.eng.rdu2.redhat.com,2620:52:0:198::10' (ECDSA) to the list of known hosts.
10

Comment 9 Ken Young 2021-11-24 13:50:56 UTC

*** This bug has been marked as a duplicate of bug 2021151 ***


Note You need to log in before you can comment on or make changes to this bug.