Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2040915

Summary: Rebooting OSP16.1 nodes can leave chronyd with all of its time sources marked as offline
Product: Red Hat Enterprise Linux 8 Reporter: Mark Jones <marjones>
Component: chronyAssignee: Miroslav Lichvar <mlichvar>
Status: CLOSED WONTFIX QA Contact: rhel-cs-infra-services-qe <rhel-cs-infra-services-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.4CC: aschultz, astupnik, bdobreli
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-25 22:28:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mark Jones 2022-01-14 21:12:26 UTC
Description of problem:

A customer has deployed OSP 16.1 and they find that sometimes after rebooting nodes that although chronyd had started that all of its time sources are marked as offline. Those time sources remain offline unless they manually restart chronyd via systemd.

This was not happening on all OSP 16.1 nodes but depended on whether they were fresh installs or whether they were upgraded from an earlier OSP release.

For those nodes that are impacted, checking the connectivity to the time sources in question shows they are accessible via the network.

Enabling debug logs for NetworkManager shows that the dispatcher script for chronyd (20-chrony) that triggers a "chronyc onoffline" command was being run after the management network interface through which NTP traffic flowed was up in the case of the working node.

However, in the failing case, that same chrony dispatcher script was not run after the management network interface was brought up. This suggests that chronyd was started before the management network was available and hence time sources are offline - but without the chrony dispatcher script being triggered there was no event that brought the time sources back online once the management network was functioning.

The OSP compute node's network configuration does not include any interfaces that are marked as being managed by NetworkManager.

However, in the case of the working nodes, Network Manager was acting against virbr0 (libvirt's default network) and it was the availability of virbr0 that was triggering the chrony dispatcher. This observation was further reinforced by establishing that those nodes that have this issue did not have virbr0 while those nodes that worked did have virbr0 defined.

This BZ is looking ensure that chronyd is able to start and have active time sources reliably across all boots regardless of the network configurations defined within OSP.

Version-Release number of selected component (if applicable):
16.1

How reproducible:
Consistantly

Steps to Reproduce:
1. Configure an OSP 16.1.6 system with no networks being managed via NetworkManager
2. Make sure that the default network for libvirt has been removed
3. Reboot the node

Actual results:

Time sources for chronyd are not marked as online after the boot

Expected results:

Time sources for chronyd should be marked as online

Additional info:

Having a compute and controller environments where chronyd is not running properly can lead to time drift which will eventually impact the functioning cloud.

Comment 2 Alex Schultz 2022-01-14 22:16:21 UTC
OSP Does not control the startup of chrony so this is likely an issue with chrony shipped with RHEL8.4 as OSP16.2 is based on 8.4 while OSP16.1 is based on 8.2.  I found Bug 1930468 which may be related to the reported issue. It should be noted that we do not use NetworkManager to manage interfaces in OSP but it is still running.

Comment 4 Miroslav Lichvar 2022-01-17 10:52:29 UTC
The only difference between the RHEL8.2 and RHEL8.4 chrony packages is that the later recommends dhcp-client in order to enable the NetworkManager dispatcher script (bug #1930468). In RHEL8.5 the script was reworked to not rely on the dhcp-client package and it is no longer recommended by the chrony package.

From the logs in referenced sosreport it seems the management interface is controlled by NetworkManager. The NTP servers specified in chrony.conf are in a different network and they are specified by IP addresses. That means if the management interface is up before the interface needed to access the NTP servers (which is not controlled by NetworkManager), the chronyc onoffline command called from the dispatcher script will set the sources to the offline state and there is no other call of the command later to set them online.

This is a limitation of the chrony dispatcher script. It doesn't work with interfaces and routes (which can change over time) to set the state of the sources individually. The assumption of the script is that if it is called at least once with the up or down event, it will get all events needed to cover all configured NTP servers. Using NetworkManager only for an interface which doesn't give access to the servers breaks that assumption.

I don't see a good fix that would work in the default configuration. I guess we could add a new option to disable the dispatcher script completely, similarly to how PEERNTP=no disables NTP servers from DHCP. In RHEL9 the dispatcher scripts are located in /usr/lib/NetworkManager/dispatcher.d and can be disabled by making a /dev/null symlink of the same name in /etc/NetworkManager/dispatcher.d. I'm not sure if moving the chrony dispatcher to that directory in a RHEL8 update would be acceptable.

What is controlling the other interfaces? Could it be patched to call the "chronyc onoffline" command when the interfaces are up?

Comment 5 Alex Schultz 2022-01-17 14:23:11 UTC
The interfaces in OSP are managed via the legacy network-scripts package as we do not support NetworkManager due to other limitations.  OSP16.2 will never get updates to 8.5 as it is following 8.4 EUS.  A fix would be required to 8.4 to address the customer issue.