1837105 – Check for clock errors and time skew

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1837105 - Check for clock errors and time skew

Summary: Check for clock errors and time skew

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	7.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	mreynolds
QA Contact:	RHDS QE
Docs Contact:	Marc Muehlfeld
URL:
Whiteboard:
Depends On:
Blocks:	1843157
TreeView+	depends on / blocked

Reported:	2020-05-18 20:23 UTC by mreynolds
Modified:	2023-12-15 17:56 UTC (History)
CC List:	9 users (show)
Fixed In Version:	389-ds-base-1.3.10.2-4.el7
Doc Type:	Bug Fix
Doc Text:	.Directory Server rejects update operations if retrieving the system time fails or the time difference is too large Previously, when calling the system time() function failed or the function returned an unexpected value, Change Sequence Numbers (CSN) in Directory Server could become corrupted. As a consequence, the administrator had to re-initialize all replicas in the environment. With this update, Directory Server rejects the update operation if the time() function failed, and Directory Server no longer generates corrupt CSNs in the mentioned scenario. Note that, if the time difference is greater than one day, the server logs a `INFO - csngen_new_csn - Detected large jump in CSN time` message in the `/var/log/dirsrv/slapd-<instance_name>/error` file. However, Directory Server still creates the CSN and does not reject the update operation.
Clone Of:
Clones:	1843157 (view as bug list)
Environment:
Last Closed:	2020-09-29 19:46:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 4148	0	None	closed	Check for clock errors and time skew and log message	2021-02-15 03:23:42 UTC
Red Hat Product Errata	RHBA-2020:3894	0	None	None	None	2020-09-29 19:48:24 UTC

Description mreynolds 2020-05-18 20:23:44 UTC

Description of problem:

In an effort to troubleshoot CSN time skew we should log a message when a large jump is detected, and we should also be checking for the errors returned by the "time" functions.

Right now if a clock function fails it goes unnoticed and a garbage value would be used for the time.  This impacts things like replication, and logging.

We should at least be detecting large jumps in time and errors returned by the clock/time functions.

Comment 1 mreynolds 2020-05-18 20:24:32 UTC

Upstream ticket:

https://pagure.io/389-ds-base/issue/51095

Comment 5 Viktor Ashirov 2020-07-03 15:48:37 UTC

Build tested: 389-ds-base-1.3.10.2-6.el7.x86_64

Testing was done using libfaketime:

systemctl edit dirsrv@server

[Service]
Environment=
Environment=LD_PRELOAD=/usr/lib64/faketime/libfaketimeMT.so.1
Environment=FAKETIME_NO_CACHE=1
Environment=FAKETIME_TIMESTAMP_FILE=/run/dirsrv/faketime.rc
Environment=FAKETIME_DONT_RESET=1

Changing time for ns-slapd in a replicated topology causes it to emit the following error messages:

echo "-1d x1" > /run/dirsrv/faketime.rc


=> /var/log/dirsrv/slapd-master1/errors <==
[01/Jul/2020:14:07:24.721287371 +0000] - INFO - csngen_new_csn - Detected large jump in CSN time.  Delta: -172681 (current time: 1593612444  vs  previous time: 1593785125)

==> /var/log/dirsrv/slapd-master2/errors <==
[01/Jul/2020:14:07:25.264048182 +0000] - INFO - csngen_new_csn - Detected large jump in CSN time.  Delta: -172680 (current time: 1593612445  vs  previous time: 1593785125)



To reproduce clock_gettime() errors I LD_PRELOADed the following library:

// gcc faulty_clock.c -ldl -shared -fPIC -o faulty_clock.so
#define _GNU_SOURCE
#include "constant.h"
#include <dlfcn.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

typedef int (*original_clock_gettime_f_type)(clockid_t clk_id,
                                             struct timespec *tp);

int clock_gettime(clockid_t clk_id, struct timespec *tp) {
  int result = err;
  original_clock_gettime_f_type orig_clock;
  orig_clock = (original_clock_gettime_f_type)dlsym(RTLD_NEXT, "clock_gettime");
  int fault = rand() % 100;
  if (fault == 5) {
    printf("fair dice roll says %d\n", fault);
    errno = EFAULT;
    result = err;
  } else
    result = orig_clock(clk_id, tp);
  return result;
}


If the failure happens too often, it crashes pretty fast: 
Jul 03 15:11:25 server.example.com ns-slapd[9497]: clock_gettime failed: 14 
Jul 03 15:11:25 server.example.com ns-slapd[9497]: fair dice roll says 5
...
Jul 03 15:11:25 server.example.com ns-slapd[9497]: fair dice roll says 5
Jul 03 15:11:25 server.example.com systemd[1]: dirsrv: main process exited, code=killed, status=6/ABRT
Jul 03 15:11:25 server.example.com systemd[1]: Unit dirsrv entered failed state.
Jul 03 15:11:25 server.example.com systemd[1]: dirsrv failed.

If I reduce the number of occurrences, I see the newly added debug messages:  

[03/Jul/2020:15:32:58.393416521 +0000] - ERR - csngen_new_csn - Failed to get system time (Bad address)
[03/Jul/2020:15:32:58.396762553 +0000] - ERR - ldbm_back_modify - failed to generate modify CSN for entry (cn=user00001,ou=People,dc=example,dc=com), aborting operation
[03/Jul/2020:15:32:58.615110395 +0000]  - EMERG - vslapd_log_access, Unable to determine system time for message :: conn=4 op=8 MOD dn="cn=user00003,ou=people,dc=example,dc=com"


Under normal conditions server should not crash, as the situation where clock_gettime() returns too many errors is very rare.

Marking as VERIFIED.

Comment 11 errata-xmlrpc 2020-09-29 19:46:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (389-ds-base bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3894

Note You need to log in before you can comment on or make changes to this bug.