Bug 1084747 - dnscache gets 100% CPU usage with version 1.05.9
Summary: dnscache gets 100% CPU usage with version 1.05.9
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: ndjbdns
Version: el6
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: pjp
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-06 08:17 UTC by Francisco Miguel Biete
Modified: 2020-06-30 09:31 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-30 09:31:07 UTC
Type: Bug


Attachments (Terms of Use)

Description Francisco Miguel Biete 2014-04-06 08:17:22 UTC
Description of problem:
Dnscache process gets 100% CPU usage after some minutes running.

top - 10:04:50 up 3 days, 19:25,  1 user,  load average: 1.01, 1.04, 1.03
Tasks:  74 total,   2 running,  72 sleeping,   0 stopped,   0 zombie
Cpu(s): 24.2%us, 43.5%sy,  0.0%ni, 31.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:    501240k total,   422808k used,    78432k free,   138028k buffers
Swap:  2064376k total,        0k used,  2064376k free,   181440k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                        
 1335 daemon    20   0  9908 6308  524 R 99.0  1.3   2466:49 dnscache


dns:~# strace -c -p 1335
Process 1335 attached - interrupt to quit
^CProcess 1335 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 63.23    0.092923           0    276703           poll
 35.54    0.052223           0    278385           gettimeofday
  0.24    0.000360           1       295           write
  0.24    0.000354           3       106           sendto
  0.21    0.000315           3       106           close
  0.19    0.000276           3       106           bind
  0.11    0.000167           1       147           recvmsg
  0.07    0.000107           1       106           socket
  0.04    0.000066           0       146           sendmsg
  0.04    0.000063           1       106           connect
  0.04    0.000059           0       212           fcntl
  0.03    0.000037           0       147           stat
  0.00    0.000000           0       106           recvfrom
  0.00    0.000000           0       106           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    0.146950                556777           total


I'm seeing too many gettimeofday calls

Version-Release number of selected component (if applicable):
ndjbdns-1.05.9-1.el6.x86_64
CentOS 6.5

How reproducible:

Steps to Reproduce:
1. Restart dnscache
2. Wait a few minutes, the process receives 10-25 request/sec
3. dnscache process will set to 100% CPU usage

Actual results:
DNSCACHE process gets 100% CPU usage

Expected results:
dnscache process has low CPU usage with that number of requests.


Additional info:
I have downgraded a server to ndjbdns-1.05.8-1.el6.x86_64, and with that version I'm not seeing 100% CPU usage

dns:~# strace -c -p 18682 
Process 18682 attached - interrupt to quit
^CProcess 18682 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 86.52    0.033993          28      1216           poll
  3.94    0.001549           1      1745           write
  3.38    0.001327           0      8532           gettimeofday
  0.90    0.000353           0       731           stat
  0.86    0.000336           1       454           close
  0.76    0.000300           1       454           connect
  0.76    0.000299           0       755           sendmsg
  0.76    0.000298           1       454           bind
  0.54    0.000214           0       731           recvmsg
  0.46    0.000179           0       454           socket
  0.42    0.000164           0       425           recvfrom
  0.28    0.000111           0       454           sendto
  0.28    0.000110           0       908           fcntl
  0.14    0.000056           0       454           setsockopt
  0.00    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.039289                 17768           total

Comment 1 pjp 2014-04-06 12:26:50 UTC
  Hello Francisco,

Thank you for reporting this issue. Though it is not easily reproducible on my machine. Is there any specific setting/condition that needs to occur for it to hit the 99% cpu usage level?

'gettimeofday' call are made when processing new requests or when request times out.

I'll try to reproduce it to debug more.

Thank you.

Comment 2 pjp 2014-04-06 18:04:25 UTC
I'm able to reproduce it now. It seems to happen over a slow or saturated(~40% packet loss) connection, with MERGEQUERIES enabled(ie. non-null).

Comment 3 Francisco Miguel Biete 2014-04-07 07:27:45 UTC
Hi,

I don't see any error in the interface, but it's a virtualized server over XenServer, and my clients are over GPRS and 3G networks so they could be slow.

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16719340 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1087759 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5936484279 (5.5 GiB)  TX bytes:142185683 (135.5 MiB)
          Interrupt:247

This is my dnscache.conf, and yes MERGEQUERIES is enabled:

    DATALIMIT=8000000
    CACHESIZE=5000000
    IP=10.2.19.81
    IPSEND=0.0.0.0
    UID=2
    GID=2
    ROOT=/etc/ndjbdns
    HIDETTL=
    FORWARDONLY=
    MERGEQUERIES=1
    DEBUG_LEVEL=1


I'm going to test disabling MERGEQUERIES...

Regards,

Comment 4 Francisco Miguel Biete 2014-04-07 07:52:00 UTC
Setting MERGEQUERIES=0 doesn't help. After a few minutes dnscache CPU is 100% again.
Removing the line MERGEQUERIES helps, I'm not seeing high CPU usage.

So, it seems to be a change between 1.05.8 and 1.05.9 related with MERGEQUERIES.

Comment 5 pjp 2014-04-07 12:00:57 UTC
(In reply to Francisco Miguel Biete from comment #4)
> Setting MERGEQUERIES=0 doesn't help. After a few minutes dnscache CPU is
> 100% again.

  Setting it to 0 would not disable it, but not setting any value at all would. Ie. it needs to be NULL to disable it.

MERGEQUERIES=

> Removing the line MERGEQUERIES helps, I'm not seeing high CPU usage.

  Right. It needs to be NULL to disable it.

> So, it seems to be a change between 1.05.8 and 1.05.9 related with
> MERGEQUERIES.

  I think the issue is with the MERGEQUERIES function. I'm debugging it more to see how to fix it best.

Thank you so much for the update and for confirming the issue.

Thank you.

Comment 6 Matthew Kent 2015-07-04 06:14:55 UTC
Seeing this issue here as well with ndjbdns 1.06: after a brief period of time dnscache starts using 100% of a cpu and won't come down. Initially I thought it was from cache eviction, but DEBUG=3 confirmed we aren't using all the cache. Disabling MERGEQUERIES seems to have fixed it.

Comment 7 Tim Stewart 2017-03-08 00:27:21 UTC
I have submitted a patch to the ndjbdns project on GitHub that may resolve this issue.  It can be found at https://github.com/pjps/ndjbdns/pull/29.

Comment 8 pjp 2017-03-08 06:18:30 UTC
Hello Tim,

(In reply to Tim Stewart from comment #7)
> I have submitted a patch to the ndjbdns project on GitHub that may resolve
> this issue.  It can be found at https://github.com/pjps/ndjbdns/pull/29.

  Great! Thank you so much for the patch. I'll pull and merge it soon.

Thank you.

Comment 9 pjp 2017-03-13 06:15:27 UTC
Proposed patch(above) from Tim has been merged upstream
  -> https://github.com/pjps/ndjbdns/commit/73527fbe28c6e18229b28f9d437be0ab5960c21f

Comment 10 Simone Caronni 2020-06-30 09:31:07 UTC
Package has been retired, project is unmaintained since 2017.


Note You need to log in before you can comment on or make changes to this bug.