Bug 1837105

Summary: Check for clock errors and time skew
Product: Red Hat Enterprise Linux 7 Reporter: mreynolds
Component: 389-ds-baseAssignee: mreynolds
Status: CLOSED ERRATA QA Contact: RHDS QE <ds-qe-bugs>
Severity: unspecified Docs Contact: Marc Muehlfeld <mmuehlfe>
Priority: unspecified    
Version: 7.9CC: kbanerje, mrhodes, msauton, pasik, rcain, spichugi, tbordaz, tmihinto, vashirov
Target Milestone: rcKeywords: TestCannotAutomate
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-base-1.3.10.2-4.el7 Doc Type: Bug Fix
Doc Text:
.Directory Server rejects update operations if retrieving the system time fails or the time difference is too large Previously, when calling the system time() function failed or the function returned an unexpected value, Change Sequence Numbers (CSN) in Directory Server could become corrupted. As a consequence, the administrator had to re-initialize all replicas in the environment. With this update, Directory Server rejects the update operation if the time() function failed, and Directory Server no longer generates corrupt CSNs in the mentioned scenario. Note that, if the time difference is greater than one day, the server logs a `INFO - csngen_new_csn - Detected large jump in CSN time` message in the `/var/log/dirsrv/slapd-<instance_name>/error` file. However, Directory Server still creates the CSN and does not reject the update operation.
Story Points: ---
Clone Of:
: 1843157 (view as bug list) Environment:
Last Closed: 2020-09-29 19:46:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1843157    

Description mreynolds 2020-05-18 20:23:44 UTC
Description of problem:

In an effort to troubleshoot CSN time skew we should log a message when a large jump is detected, and we should also be checking for the errors returned by the "time" functions.

Right now if a clock function fails it goes unnoticed and a garbage value would be used for the time.  This impacts things like replication, and logging.

We should at least be detecting large jumps in time and errors returned by the clock/time functions.

Comment 1 mreynolds 2020-05-18 20:24:32 UTC
Upstream ticket:

https://pagure.io/389-ds-base/issue/51095

Comment 5 Viktor Ashirov 2020-07-03 15:48:37 UTC
Build tested: 389-ds-base-1.3.10.2-6.el7.x86_64

Testing was done using libfaketime:

systemctl edit dirsrv@server

[Service]
Environment=
Environment=LD_PRELOAD=/usr/lib64/faketime/libfaketimeMT.so.1
Environment=FAKETIME_NO_CACHE=1
Environment=FAKETIME_TIMESTAMP_FILE=/run/dirsrv/faketime.rc
Environment=FAKETIME_DONT_RESET=1

Changing time for ns-slapd in a replicated topology causes it to emit the following error messages:

echo "-1d x1" > /run/dirsrv/faketime.rc


=> /var/log/dirsrv/slapd-master1/errors <==
[01/Jul/2020:14:07:24.721287371 +0000] - INFO - csngen_new_csn - Detected large jump in CSN time.  Delta: -172681 (current time: 1593612444  vs  previous time: 1593785125)

==> /var/log/dirsrv/slapd-master2/errors <==
[01/Jul/2020:14:07:25.264048182 +0000] - INFO - csngen_new_csn - Detected large jump in CSN time.  Delta: -172680 (current time: 1593612445  vs  previous time: 1593785125)



To reproduce clock_gettime() errors I LD_PRELOADed the following library:

// gcc faulty_clock.c -ldl -shared -fPIC -o faulty_clock.so
#define _GNU_SOURCE
#include "constant.h"
#include <dlfcn.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

typedef int (*original_clock_gettime_f_type)(clockid_t clk_id,
                                             struct timespec *tp);

int clock_gettime(clockid_t clk_id, struct timespec *tp) {
  int result = err;
  original_clock_gettime_f_type orig_clock;
  orig_clock = (original_clock_gettime_f_type)dlsym(RTLD_NEXT, "clock_gettime");
  int fault = rand() % 100;
  if (fault == 5) {
    printf("fair dice roll says %d\n", fault);
    errno = EFAULT;
    result = err;
  } else
    result = orig_clock(clk_id, tp);
  return result;
}


If the failure happens too often, it crashes pretty fast: 
Jul 03 15:11:25 server.example.com ns-slapd[9497]: clock_gettime failed: 14 
Jul 03 15:11:25 server.example.com ns-slapd[9497]: fair dice roll says 5
...
Jul 03 15:11:25 server.example.com ns-slapd[9497]: fair dice roll says 5
Jul 03 15:11:25 server.example.com systemd[1]: dirsrv: main process exited, code=killed, status=6/ABRT
Jul 03 15:11:25 server.example.com systemd[1]: Unit dirsrv entered failed state.
Jul 03 15:11:25 server.example.com systemd[1]: dirsrv failed.

If I reduce the number of occurrences, I see the newly added debug messages:  

[03/Jul/2020:15:32:58.393416521 +0000] - ERR - csngen_new_csn - Failed to get system time (Bad address)
[03/Jul/2020:15:32:58.396762553 +0000] - ERR - ldbm_back_modify - failed to generate modify CSN for entry (cn=user00001,ou=People,dc=example,dc=com), aborting operation
[03/Jul/2020:15:32:58.615110395 +0000]  - EMERG - vslapd_log_access, Unable to determine system time for message :: conn=4 op=8 MOD dn="cn=user00003,ou=people,dc=example,dc=com"


Under normal conditions server should not crash, as the situation where clock_gettime() returns too many errors is very rare.

Marking as VERIFIED.

Comment 11 errata-xmlrpc 2020-09-29 19:46:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (389-ds-base bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3894