Bug 1084747

Summary:	dnscache gets 100% CPU usage with version 1.05.9
Product:	[Fedora] Fedora EPEL	Reporter:	Francisco Miguel Biete <fbiete>
Component:	ndjbdns	Assignee:	pjp <pj.pandit>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	el6	CC:	amb, mkent, negativo17, pj.pandit, tstewart, weeks
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-30 09:31:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Francisco Miguel Biete 2014-04-06 08:17:22 UTC

Description of problem:
Dnscache process gets 100% CPU usage after some minutes running.

top - 10:04:50 up 3 days, 19:25,  1 user,  load average: 1.01, 1.04, 1.03
Tasks:  74 total,   2 running,  72 sleeping,   0 stopped,   0 zombie
Cpu(s): 24.2%us, 43.5%sy,  0.0%ni, 31.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:    501240k total,   422808k used,    78432k free,   138028k buffers
Swap:  2064376k total,        0k used,  2064376k free,   181440k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                        
 1335 daemon    20   0  9908 6308  524 R 99.0  1.3   2466:49 dnscache


dns:~# strace -c -p 1335
Process 1335 attached - interrupt to quit
^CProcess 1335 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 63.23    0.092923           0    276703           poll
 35.54    0.052223           0    278385           gettimeofday
  0.24    0.000360           1       295           write
  0.24    0.000354           3       106           sendto
  0.21    0.000315           3       106           close
  0.19    0.000276           3       106           bind
  0.11    0.000167           1       147           recvmsg
  0.07    0.000107           1       106           socket
  0.04    0.000066           0       146           sendmsg
  0.04    0.000063           1       106           connect
  0.04    0.000059           0       212           fcntl
  0.03    0.000037           0       147           stat
  0.00    0.000000           0       106           recvfrom
  0.00    0.000000           0       106           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    0.146950                556777           total


I'm seeing too many gettimeofday calls

Version-Release number of selected component (if applicable):
ndjbdns-1.05.9-1.el6.x86_64
CentOS 6.5

How reproducible:

Steps to Reproduce:
1. Restart dnscache
2. Wait a few minutes, the process receives 10-25 request/sec
3. dnscache process will set to 100% CPU usage

Actual results:
DNSCACHE process gets 100% CPU usage

Expected results:
dnscache process has low CPU usage with that number of requests.


Additional info:
I have downgraded a server to ndjbdns-1.05.8-1.el6.x86_64, and with that version I'm not seeing 100% CPU usage

dns:~# strace -c -p 18682 
Process 18682 attached - interrupt to quit
^CProcess 18682 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 86.52    0.033993          28      1216           poll
  3.94    0.001549           1      1745           write
  3.38    0.001327           0      8532           gettimeofday
  0.90    0.000353           0       731           stat
  0.86    0.000336           1       454           close
  0.76    0.000300           1       454           connect
  0.76    0.000299           0       755           sendmsg
  0.76    0.000298           1       454           bind
  0.54    0.000214           0       731           recvmsg
  0.46    0.000179           0       454           socket
  0.42    0.000164           0       425           recvfrom
  0.28    0.000111           0       454           sendto
  0.28    0.000110           0       908           fcntl
  0.14    0.000056           0       454           setsockopt
  0.00    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.039289                 17768           total

Comment 1 pjp 2014-04-06 12:26:50 UTC

  Hello Francisco,

Thank you for reporting this issue. Though it is not easily reproducible on my machine. Is there any specific setting/condition that needs to occur for it to hit the 99% cpu usage level?

'gettimeofday' call are made when processing new requests or when request times out.

I'll try to reproduce it to debug more.

Thank you.

Comment 2 pjp 2014-04-06 18:04:25 UTC

I'm able to reproduce it now. It seems to happen over a slow or saturated(~40% packet loss) connection, with MERGEQUERIES enabled(ie. non-null).

Comment 3 Francisco Miguel Biete 2014-04-07 07:27:45 UTC

Hi,

I don't see any error in the interface, but it's a virtualized server over XenServer, and my clients are over GPRS and 3G networks so they could be slow.

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16719340 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1087759 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5936484279 (5.5 GiB)  TX bytes:142185683 (135.5 MiB)
          Interrupt:247

This is my dnscache.conf, and yes MERGEQUERIES is enabled:

    DATALIMIT=8000000
    CACHESIZE=5000000
    IP=10.2.19.81
    IPSEND=0.0.0.0
    UID=2
    GID=2
    ROOT=/etc/ndjbdns
    HIDETTL=
    FORWARDONLY=
    MERGEQUERIES=1
    DEBUG_LEVEL=1


I'm going to test disabling MERGEQUERIES...

Regards,

Comment 4 Francisco Miguel Biete 2014-04-07 07:52:00 UTC

Setting MERGEQUERIES=0 doesn't help. After a few minutes dnscache CPU is 100% again.
Removing the line MERGEQUERIES helps, I'm not seeing high CPU usage.

So, it seems to be a change between 1.05.8 and 1.05.9 related with MERGEQUERIES.

Comment 5 pjp 2014-04-07 12:00:57 UTC

(In reply to Francisco Miguel Biete from comment #4)
> Setting MERGEQUERIES=0 doesn't help. After a few minutes dnscache CPU is
> 100% again.

  Setting it to 0 would not disable it, but not setting any value at all would. Ie. it needs to be NULL to disable it.

MERGEQUERIES=

> Removing the line MERGEQUERIES helps, I'm not seeing high CPU usage.

  Right. It needs to be NULL to disable it.

> So, it seems to be a change between 1.05.8 and 1.05.9 related with
> MERGEQUERIES.

  I think the issue is with the MERGEQUERIES function. I'm debugging it more to see how to fix it best.

Thank you so much for the update and for confirming the issue.

Thank you.

Comment 6 Matthew Kent 2015-07-04 06:14:55 UTC

Seeing this issue here as well with ndjbdns 1.06: after a brief period of time dnscache starts using 100% of a cpu and won't come down. Initially I thought it was from cache eviction, but DEBUG=3 confirmed we aren't using all the cache. Disabling MERGEQUERIES seems to have fixed it.

Comment 7 Tim Stewart 2017-03-08 00:27:21 UTC

I have submitted a patch to the ndjbdns project on GitHub that may resolve this issue.  It can be found at https://github.com/pjps/ndjbdns/pull/29.

Comment 8 pjp 2017-03-08 06:18:30 UTC

Hello Tim,

(In reply to Tim Stewart from comment #7)
> I have submitted a patch to the ndjbdns project on GitHub that may resolve
> this issue.  It can be found at https://github.com/pjps/ndjbdns/pull/29.

  Great! Thank you so much for the patch. I'll pull and merge it soon.

Thank you.

Comment 9 pjp 2017-03-13 06:15:27 UTC

Proposed patch(above) from Tim has been merged upstream
  -> https://github.com/pjps/ndjbdns/commit/73527fbe28c6e18229b28f9d437be0ab5960c21f

Comment 10 Simone Caronni 2020-06-30 09:31:07 UTC

Package has been retired, project is unmaintained since 2017.