2218577 – [OSP16.1] Systemd can't start/restart Chronyd in some nodes on OSP

Bug 2218577 - [OSP16.1] Systemd can't start/restart Chronyd in some nodes on OSP

Summary: [OSP16.1] Systemd can't start/restart Chronyd in some nodes on OSP

Keywords:
Status:	CLOSED DUPLICATE of bug 1903091
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	RHOSP:NFV_Eng
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-29 14:17 UTC by Ricardo Ramos Thomas
Modified:	2023-07-19 12:20 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-07-19 12:20:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-26225	0	None	None	None	2023-06-29 14:18:22 UTC

Description Ricardo Ramos Thomas 2023-06-29 14:17:24 UTC

Description of problem:

After a deploy (adding new ceph nodes) CU hit this issue where systemd can't start:

"fatal: [xxx-controller-0]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service chronyd: Job for chronyd.service failed because a timeout was exceeded.\\nSee \\\"systemctl status chronyd.service\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
        "fatal: [xxx-computeovsdpdkhtoff-0]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service chronyd: Job for chronyd.service failed because a timeout was exceeded.\\nSee \\\"systemctl status chronyd.service\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
        "fatal: [xxx-controller-1]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service chronyd: Job for chronyd.service failed because a timeout was exceeded.\\nSee \\\"systemctl status chronyd.service\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
        "fatal: [xxx-computesriov-1]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service chronyd: Job for chronyd.service failed because a timeout was exceeded.\\nSee \\\"systemctl status chronyd.service\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
        "fatal: [xxx-computesriovhtoff-1]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service chronyd: Job for chronyd.service failed because a timeout was exceeded.\\nSee \\\"systemctl status chronyd.service\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
        "fatal: [xxx-computesriov-2]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service chronyd: Job for chronyd.service failed because a timeout was exceeded.\\nSee \\\"systemctl status chronyd.service\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
        "fatal: [xxx-computesriovhtoff-0]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service chronyd: Job for chronyd.service failed because a timeout was exceeded.\\nSee \\\"systemctl status chronyd.service\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
        "fatal: [xxx-controller-2]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service chronyd: Job for chronyd.service failed because a timeout was exceeded.\\nSee \\\"systemctl status chronyd.service\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
        "NO MORE HOSTS LEFT *************************************************************",
        "PLAY RECAP *********************************************************************",
        "xxx-cephstorage-0        : ok=39   changed=2    unreachable=0    failed=0    skipped=87   rescued=0    ignored=0   ",
        "xxx-cephstorage-1        : ok=39   changed=2    unreachable=0    failed=0    skipped=87   rescued=0    ignored=0   ",
        "xxx-cephstorage-2        : ok=39   changed=2    unreachable=0    failed=0    skipped=87   rescued=0    ignored=0   ",
        "xxx-cephstorage-3        : ok=39   changed=2    unreachable=0    failed=0    skipped=87   rescued=0    ignored=0   ",
        "xxx-computeovsdpdk-0     : ok=29   changed=2    unreachable=0    failed=0    skipped=85   rescued=0    ignored=0   ",
        "xxx-computeovsdpdk-1     : ok=29   changed=2    unreachable=0    failed=0    skipped=85   rescued=0    ignored=0   ",
        "xxx-computeovsdpdkhtoff-0 : ok=28   changed=1    unreachable=0    failed=1    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriov-0       : ok=29   changed=2    unreachable=0    failed=0    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriov-1       : ok=28   changed=1    unreachable=0    failed=1    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriov-2       : ok=28   changed=1    unreachable=0    failed=1    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriov-3       : ok=29   changed=2    unreachable=0    failed=0    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriov-4       : ok=29   changed=2    unreachable=0    failed=0    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriovhtoff-0  : ok=28   changed=1    unreachable=0    failed=1    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriovhtoff-1  : ok=28   changed=1    unreachable=0    failed=1    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriovhtoff-2  : ok=29   changed=2    unreachable=0    failed=0    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriovhtoff-3  : ok=29   changed=2    unreachable=0    failed=0    skipped=85   rescued=0    ignored=0   ",
        "xxx-computesriovhtoff-4  : ok=29   changed=2    unreachable=0    failed=0    skipped=85   rescued=0    ignored=0   ",
        "xxx-controller-0         : ok=43   changed=2    unreachable=0    failed=1    skipped=86   rescued=0    ignored=0   ",
        "xxx-controller-1         : ok=33   changed=1    unreachable=0    failed=1    skipped=81   rescued=0    ignored=0   ",
        "xxx-controller-2         : ok=33   changed=1    unreachable=0    failed=1    skipped=81   rescued=0    ignored=0   ",
        "Monday 19 June 2023  17:39:22 +0200 (0:01:42.411)       0:04:54.795 *********** ",


We try to restart chronyd manually but same result fail

From strace and sos report we notice the following

~~~
$ grep /openvswitch proc/1/mountinfo | wc -l
8191
entries are similar to
82888 82887 0:23 /openvswitch /run/systemd/unit-root/run/openvswitch rw,nosuid,nodev master
~~~


and

~~~
27798 26 0:23 /openvswitch /run/openvswitch rw,nosuid,nodev shared:26 - tmpfs tmpfs rw,seclabel,mode=755
~~~

why those mount points are happen  ?


A reboot looks like solved the issue.


Version-Release number of selected component (if applicable):

RHOSP 16.1.3 (Train)

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Chronyd fail and 8k of calls from OVS

Expected results:

Chronyd start normally

Additional info:

SOS reports, strace and info available on the case

Comment 6 Robin Jarry 2023-07-19 12:20:09 UTC


*** This bug has been marked as a duplicate of bug 1903091 ***

Note You need to log in before you can comment on or make changes to this bug.