Bug 2209031

Summary: regression: can not start dnsmasq / systemctl timeout / 100% cpu [rhel9]
Product: Red Hat Enterprise Linux 9 Reporter: Leon Fauster <leonfauster>
Component: dnsmasqAssignee: Petr Menšík <pemensik>
Status: CLOSED ERRATA QA Contact: Petr Dancak <pdancak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: CentOS StreamCC: bstinson, jwboyer, pdancak, psklenar, sbroz
Target Milestone: rcKeywords: Regression, TestCaseNeeded, TestCaseProvided, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: dnsmasq-2.85-11.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2233542 (view as bug list) Environment:
Last Closed: 2023-11-07 08:36:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2188712    
Bug Blocks: 2233542    

Description Leon Fauster 2023-05-22 10:36:59 UTC
Description of problem:

We tested  dnsmasq-2.85-10.el9.x86_64 and it results here in a regression

May 21 16:20:53 s2 systemd[1]: dnsmasq.service: start operation timed out. Terminating.
May 21 16:20:53 s2 systemd[1]: dnsmasq.service: Control process exited, code=exited status=5
May 21 16:20:53 s2 systemd[1]: dnsmasq.service: Failed with result 'timeout'.
May 21 16:20:58 s2 systemd[1]: dnsmasq.service: Service RestartSec=5s expired, scheduling restart.
May 21 16:20:58 s2 systemd[1]: dnsmasq.service: Scheduled restart job, restart counter is at 1.
May 21 16:22:31 s2 systemd[1]: dnsmasq.service: start operation timed out. Terminating.
May 21 16:22:31 s2 systemd[1]: dnsmasq.service: Control process exited, code=exited status=5
May 21 16:22:31 s2 systemd[1]: dnsmasq.service: Failed with result 'timeout'.
May 21 16:22:36 s2 systemd[1]: dnsmasq.service: Service RestartSec=5s expired, scheduling restart.
May 21 16:22:36 s2 systemd[1]: dnsmasq.service: Scheduled restart job, restart counter is at 2.
May 21 16:23:43 s2 systemd[1]: dnsmasq.service: Control process exited, code=exited status=5
May 21 16:23:43 s2 systemd[1]: dnsmasq.service: Failed with result 'exit-code'.


Dnsmasq does not start and CPUs are at 100%, then systemd times out (the restarts are coming from a local drop-in config).

Downgrade to dnsmasq-2.85-7.el9.x86_64 resolves the start problems


Version-Release number of selected component (if applicable):
dnsmasq-2.85-10.el9.x86_64

How reproducible:
Update to dnsmasq-2.85-10.el9.x86_64 from dnsmasq-2.85-6.el9.x86_64 or dnsmasq-2.85-7.el9.x86_64

Actual results:
does not start


Expected results:
it starts normally


Additional info:

The point is for sure that we have here a big list of (>100000)

# tail /etc/dnsmasq.d/dns-hosts-void.conf 
...
address=/foo.bar/0.0.0.0
address=/bob.alice/0.0.0.0

but dnsmasq-2.85-6.el9.x86_64 or dnsmasq-2.85-7.el9.x86_64 do not have problems reading it and starting the daemon?

Is this regression coming from #2188712 from release 2.85-8 ??

Comment 1 Petr Menšík 2023-06-09 17:58:19 UTC
Would you mind attaching your /etc/dnsmasq.d/dns-hosts-void.conf compressed by gzip?

I can try to create my own long list of similar addresses, but I cannot guarantee to reproduce it.

Comment 2 Leon Fauster 2023-06-09 18:22:18 UTC
Thanks for taking a look. Please try following process to create the list:

# Download raw list
curl -s "https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts" | \
  grep -v newrelic.com | grep -v ^"#" | grep -v "27\-\-" | grep ^0.0.0.0 | \
  sort > "${CURRENTDAY}-dns-void-StevenBlack.conf"

# translate to dnsmasq config
  cat "${CURRENTDAY}-dns-void-StevenBlack.conf" | \
  sed "s/0\\.0\\.0\\.0\\ www\\./0\\.0\\.0\\.0\\ /" | \
  awk '{FS=" "}{print $2}' |sed s/^/address=\\// | sed s/$/\\/0.0.0.0/ | \
  sort | uniq > "local-${CURRENTDAY}-dns-void-StevenBlack.conf"

Comment 3 Petr Menšík 2023-06-09 18:25:43 UTC
Oh, I confirm that there is a problem with that. dnsmasq has started searching for previous used domain entries. It uses simple linear walk without any optimization. That does not scale well if used domains is a high number. It searches it linearly, which gets slow when there is a lot of them.

It slows down just simple --test mode enough to be visible.

# for I in {1..10000}; do printf "address=/block.%x.%x/0.0.0.0\n" $RANDOM $RANDOM >> block-10k.conf; done
# time dnsmasq --test --conf-file=block-10k.conf 
dnsmasq: syntax check OK.

real	0m0.968s
user	0m0.958s
sys	0m0.005s
# for I in {1..50000}; do printf "address=/block.%x.%x/0.0.0.0\n" $RANDOM $RANDOM >> block-50k.conf; done
# time dnsmasq --test --conf-file=block-50k.conf 
dnsmasq: syntax check OK.

real	0m21.197s
user	0m21.060s
sys	0m0.030s

# for I in {1..100000}; do printf "address=/block.%x.%x/0.0.0.0\n" $RANDOM $RANDOM >> block-100k.conf; done
# time dnsmasq --test --conf-file=block-100k.conf 
dnsmasq: syntax check OK.

real	1m33.076s
user	1m32.534s
sys	0m0.060s

Comment 4 Petr Menšík 2023-06-09 18:44:07 UTC
This is exactly problem that larger rewrite solved in version 2.86, but which caused a lot of issues later. The reason why I chose to not just rebase to newer version. I tried to use simpler method to implement similar result, which does not change so much. But I am not sure this can be solved in my downstream changes.

I have avoided introducing sorted array, but I need to save last_server per domain somewhere. That requires searching if that domain already has a record, which leads to exponential complexity.

I would use unbound to create so many blocked domains entries myself. But fixing this would not be simple.

I guess I could use a trick to not search for existing domains for record types like --local=/blocked/ or --address=/blocked/#. Those do not need records stored in struct server_domain, those are relevant just for forwarding to servers having an IP. Similar logic exists in upstream version too.

Comment 5 Petr Menšík 2023-06-09 21:02:46 UTC
Prepared a fix candidate. It walks the existing records just for normal servers, not --local or --address=/x/#.

Not yet properly tested.

https://gitlab.com/redhat/centos-stream/rpms/dnsmasq/-/merge_requests/19

Comment 6 Petr Menšík 2023-06-09 22:04:05 UTC
Needed a fix for default forwarders, seems okay after basic checks.

Comment 7 Petr Menšík 2023-06-09 22:32:15 UTC
Created a simple regression test:
https://src.fedoraproject.org/tests/dnsmasq/c/e54f9dd42fc33e8a7e63d4dc278f631bda34ed49

Comment 18 errata-xmlrpc 2023-11-07 08:36:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: dnsmasq security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6524