Description of problem: In an effort to troubleshoot CSN time skew we should log a message when a large jump is detected, and we should also be checking for the errors returned by the "time" functions. Right now if a clock function fails it goes unnoticed and a garbage value would be used for the time. This impacts things like replication, and logging. We should at least be detecting large jumps in time and errors returned by the clock/time functions.
Upstream ticket: https://pagure.io/389-ds-base/issue/51095
Build tested: 389-ds-base-1.3.10.2-6.el7.x86_64 Testing was done using libfaketime: systemctl edit dirsrv@server [Service] Environment= Environment=LD_PRELOAD=/usr/lib64/faketime/libfaketimeMT.so.1 Environment=FAKETIME_NO_CACHE=1 Environment=FAKETIME_TIMESTAMP_FILE=/run/dirsrv/faketime.rc Environment=FAKETIME_DONT_RESET=1 Changing time for ns-slapd in a replicated topology causes it to emit the following error messages: echo "-1d x1" > /run/dirsrv/faketime.rc => /var/log/dirsrv/slapd-master1/errors <== [01/Jul/2020:14:07:24.721287371 +0000] - INFO - csngen_new_csn - Detected large jump in CSN time. Delta: -172681 (current time: 1593612444 vs previous time: 1593785125) ==> /var/log/dirsrv/slapd-master2/errors <== [01/Jul/2020:14:07:25.264048182 +0000] - INFO - csngen_new_csn - Detected large jump in CSN time. Delta: -172680 (current time: 1593612445 vs previous time: 1593785125) To reproduce clock_gettime() errors I LD_PRELOADed the following library: // gcc faulty_clock.c -ldl -shared -fPIC -o faulty_clock.so #define _GNU_SOURCE #include "constant.h" #include <dlfcn.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <time.h> typedef int (*original_clock_gettime_f_type)(clockid_t clk_id, struct timespec *tp); int clock_gettime(clockid_t clk_id, struct timespec *tp) { int result = err; original_clock_gettime_f_type orig_clock; orig_clock = (original_clock_gettime_f_type)dlsym(RTLD_NEXT, "clock_gettime"); int fault = rand() % 100; if (fault == 5) { printf("fair dice roll says %d\n", fault); errno = EFAULT; result = err; } else result = orig_clock(clk_id, tp); return result; } If the failure happens too often, it crashes pretty fast: Jul 03 15:11:25 server.example.com ns-slapd[9497]: clock_gettime failed: 14 Jul 03 15:11:25 server.example.com ns-slapd[9497]: fair dice roll says 5 ... Jul 03 15:11:25 server.example.com ns-slapd[9497]: fair dice roll says 5 Jul 03 15:11:25 server.example.com systemd[1]: dirsrv: main process exited, code=killed, status=6/ABRT Jul 03 15:11:25 server.example.com systemd[1]: Unit dirsrv entered failed state. Jul 03 15:11:25 server.example.com systemd[1]: dirsrv failed. If I reduce the number of occurrences, I see the newly added debug messages: [03/Jul/2020:15:32:58.393416521 +0000] - ERR - csngen_new_csn - Failed to get system time (Bad address) [03/Jul/2020:15:32:58.396762553 +0000] - ERR - ldbm_back_modify - failed to generate modify CSN for entry (cn=user00001,ou=People,dc=example,dc=com), aborting operation [03/Jul/2020:15:32:58.615110395 +0000] - EMERG - vslapd_log_access, Unable to determine system time for message :: conn=4 op=8 MOD dn="cn=user00003,ou=people,dc=example,dc=com" Under normal conditions server should not crash, as the situation where clock_gettime() returns too many errors is very rare. Marking as VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (389-ds-base bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3894