Bug 1775708

Summary: High CPU consumption when log file I/O is slow
Product: Red Hat Enterprise Linux 8 Reporter: Renaud Métrich <rmetrich>
Component: unboundAssignee: Petr Menšík <pemensik>
Status: CLOSED ERRATA QA Contact: Petr Sklenar <psklenar>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.1CC: aklimov, pemensik, psklenar, thozza, tkorbar
Target Milestone: betaKeywords: AutoVerified, Patch, Reproducer
Target Release: 8.2   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: unbound-1.7.3-9.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-28 16:02:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Renaud Métrich 2019-11-22 16:05:58 UTC
This bug was initially created as a copy of Bug #1775706

I am copying this bug because: this also applies to RHEL8 (and Upstream)



Description of problem:

When having the log file hosted on a slow file system (e.g. NFS + sync option) or file system has "issues" (e.g. XFS + use of inotify on the log file), "unbound" threads start spinning on the CPUs.

This is due to having all locks (including I/O locks) implemented as "pthread_spin_lock" instead of "pthread_mutex_lock":

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

log.c:

194 void
195 log_vmsg(int pri, const char* type,
196         const char *format, va_list args)
197 {
 :
242         lock_quick_lock(&log_lock);
243         if(!logfile) {
244                 lock_quick_unlock(&log_lock);
245                 return;
246         }
 :
271         lock_quick_unlock(&log_lock);
272 }

locks.h: (we have HAVE_PTHREAD_SPINLOCK_T defined)

145 #define lock_quick_lock(lock) LOCKRET(pthread_spin_lock(lock))
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

I considered that using pthread_spin_lock() is a bug when it deals with I/Os, since these can take a long time.


Version-Release number of selected component (if applicable):

unbound-1.6.6-1.el7.x86_64 (RHEL 7.7)
unbound-1.7.3-8.el8.x86_64 (RHEL 8.1)
Also latest upstream (same code)

How reproducible:

Always


Steps to Reproduce:

1. vm-unbound7: install "unbound" and configure 4 threads (see attached file "unbound.tgz"), redirect log to NFS mount point with "sync" option

  Set system to "permissive" if needed.

2. From another system, execute "dnsperf"

  $ cat goog 
  www.google.co.jp a
  $ dnsperf -s vm-unbound7 -d goog -c 1024 -q 1024 -Q 2000 -l 600

3. vm-unbound7: execute pidstat (sysstat package) to see CPU consumption

  # pidstat -t -p $(pidof unbound) 10


Actual results:

04:14:46 PM   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
04:14:56 PM   998      1966         -  100.00    0.00    0.00  100.00     0  unbound
04:14:56 PM   998         -      1966   53.50    0.00    0.00   53.50     0  |__unbound
04:14:56 PM   998         -      1968   56.30    0.00    0.00   56.30     1  |__unbound
04:14:56 PM   998         -      1969   52.40    0.00    0.00   52.40     3  |__unbound
04:14:56 PM   998         -      1970   52.00    0.00    0.00   52.00     2  |__unbound


Expected results:

04:12:25 PM   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
04:12:35 PM   998      1841         -   19.50   25.80    0.00   45.30     2  unbound
04:12:35 PM   998         -      1841    4.70    6.80    0.00   11.50     2  |__unbound
04:12:35 PM   998         -      1844    4.40    6.10    0.00   10.50     0  |__unbound
04:12:35 PM   998         -      1845    4.80    7.00    0.00   11.80     3  |__unbound
04:12:35 PM   998         -      1846    5.60    5.80    0.00   11.40     1  |__unbound


Additional info:

Taking a coredump shows this:
- 3 threads locked on pthread_spin_lock() (consuming CPU)
- 1 thread writing to lock file

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) info threads 
  Id   Target Id         Frame 
  4    Thread 0x7fcb86f04840 (LWP 10875) pthread_spin_lock ()
    at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:33
  3    Thread 0x7fcb82acb700 (LWP 10879) pthread_spin_lock ()
    at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:34
  2    Thread 0x7fcb832cc700 (LWP 10878) pthread_spin_lock ()
    at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:33
* 1    Thread 0x7fcb83acd700 (LWP 10877) 0x00007fcb85b12a0d in write ()
    at ../sysdeps/unix/syscall-template.S:81


(gdb) thread 4
[Switching to thread 4 (Thread 0x7fcb86f04840 (LWP 10875))]
#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:33
33		cmpl	$0, 0(%rdi)

(gdb) bt
#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:33
#1  0x00005611efbdea7c in log_vmsg (pri=6, type=0x5611efc0ab08 "info", 
    format=0x5611efc0a5a8 "%s %s %s %s %s %lld.%6.6d %d %d", args=0x7ffde1639220) at util/log.c:242
#2  0x00005611efbe4248 in log_info (format=<optimized out>) at util/log.c:283
#3  0x00005611efb8c93c in log_reply_info (qinf=<optimized out>, addr=<optimized out>, 
    addrlen=<optimized out>, dur=..., cached=1, rmsg=0x5611f1544050, v=NO_VERBOSE)
    at util/data/msgreply.c:849
#4  0x00005611efbc5b61 in worker_handle_request (c=<optimized out>, arg=0x5611f15033c0, 
    error=<optimized out>, repinfo=<optimized out>) at daemon/worker.c:1433
#5  0x00005611efbdb0f9 in comm_point_udp_ancil_callback (fd=3, event=<optimized out>, 
    arg=<optimized out>) at util/netevent.c:646
#6  0x00007fcb8684ca14 in event_process_active_single_queue (activeq=0x5611f1543b90, 
    base=0x5611f1543750) at event.c:1350
#7  event_process_active (base=<optimized out>) at event.c:1420
#8  event_base_loop (base=0x5611f1543750, flags=0) at event.c:1621
#9  0x00005611efbc1afc in ub_event_base_dispatch (base=<optimized out>) at util/ub_event.c:274
#10 comm_base_dispatch (b=<optimized out>) at util/netevent.c:241
#11 0x00005611efbc9531 in worker_work (worker=<optimized out>) at daemon/worker.c:1742
#12 daemon_fork (daemon=<optimized out>) at daemon/daemon.c:651
#13 0x00005611efb5ecb8 in run_daemon (need_pidfile=1, log_default_identity=0x7ffde163bf82 "unbound", 
    debug_mode=1, cmdline_verbose=0, cfgfile=0x5611efc246c6 "/etc/unbound/unbound.conf")
    at daemon/unbound.c:648
#14 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:745
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Comment 1 Renaud Métrich 2019-12-03 09:23:26 UTC
Fix accepted upstream: https://github.com/NLnetLabs/unbound/commit/4edb16296b342b9cc3809a7ae5758a72c44c9b3c

Comment 3 Tomáš Hozza 2019-12-06 12:04:54 UTC
*** Bug 1779156 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2020-04-28 16:02:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1716