Bug 1109651 - Unbound attempts to repopulate cache on reconnect, resulting in large quantity of outbound requests.
Summary: Unbound attempts to repopulate cache on reconnect, resulting in large quantit...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: unbound
Version: 21
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Paul Wouters
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: dnssec
TreeView+ depends on / blocked
 
Reported: 2014-06-16 04:03 UTC by William Brown
Modified: 2015-01-28 17:43 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description William Brown 2014-06-16 04:03:33 UTC
Description of problem:
Unbound attempts to prefect records as they are about to expire. Consider google. It has a 300 second TTL. That means, every 300 seconds unbound will expire google.com. The next lookup to google generates ~400 requests / responses. For a mobile 3G user, this does not aid me, and in fact wastes my quota by repeatedly querying dns for records I am not using (and unbound because I am on a mobile hotspot, won't use my mobile hotspot's dns which is free, it recursively looks up which does cost me data)

When a zone is flushed, it expires all records, but they are marked expired. They are not actually deleted from the cache memory. So when you change network, and make the first dns query, unbound attempts to prefetch *every single* dns record you have ever visited while your OS was booted. When I go from home to mobile data on my laptop this generated 2976 dns lookups and responses. Lets assume 1 to 1 mapping of request to response, and wireshark shows a request is 81 bytes. The average response was ~400 bytes. This means my laptop generated 715728 bytes, or 698 KB of traffic. Given I live in Australia, which practically has 2nd world mobile data, I get about 100kb/s.  Initial network connect then, with no packet loss takes 7 seconds of DNS queries alone to repopulate my cache. My telco charges $0.10 per 1MB (Which is cheap in Australia ... ) which means unbound literally costs me approximately $0.07 cents every time I connect to my mobile hotspot. Which for me is twice daily. 

The point of unbound was to make DNS faster. This behaviour on a mobile device is wasteful, and costly to a user. I end up with a non responsive network for a number of seconds when this runs. Prefetching must be disabled for workstation installs.

How reproducible:
Always

Comment 1 Paul Wouters 2014-06-16 17:00:39 UTC
First of all, prefetching is only enabled for cached entries that are still being asked for. So not pre-fetching google.com will just lead to identical behaviour a few seconds later, except then the user is actively waiting on it because there is no cached entry.

Second, there should not be 400 requests just to rename a google.com A record. That looks like a bug in the interaction of flushing and prefetching.

(another shot in the foot by split DNS views and the requirement to flush caches)

I'll talk to upstream about this one.

Comment 2 William Brown 2014-06-16 23:26:22 UTC
(In reply to Paul Wouters from comment #1)
> First of all, prefetching is only enabled for cached entries that are still
> being asked for.

And how do you determine that something is "still being asked for"?

Comment 3 Paul Wouters 2014-06-17 01:18:27 UTC
If it is still being queried for, i.e. the number of cache hits

Comment 4 Paul Wouters 2014-06-17 14:23:39 UTC
from upstream:

Unbound does not pre-fetch expired records.  They are fetched
normally, i.e. when a user asks for it.  So the cache flush does not
cause lookups.  It also does not cause pre-fetch.

When a record is 'near expiry' then a pre-fetch happens (if prefetch
is enabled).  This pre-fetch only happens when a user queries for that
record.  The pre-fetch does not happen if the user does not query for
the record.

For that 400 queries part of the report.  So unbound will only
generate an outbound query (except for trust-anchor updates on
schedule) when there is an inbound user query.  Also for pre-fetch.
So, those 400 queries would only happen if the user is performing 400
lookups.

I am trying to deduce what unbound would do in theory.  If in practice
this is not happening, some bug hunting is required.  Does the user
have real data or are these his estimates?  Perhaps his mobile
performs a lot of lookups when it connects to a hotspot, which is
possible, with, I don't know, a lot of apps.

Comment 5 William Brown 2014-06-18 03:27:12 UTC
Then the issue isn't prefetching: But it's not explaining the symptoms I see.

I closed all my applications (Except gedit) and opened wireshark. I made sure to deselect hostname to ip resoultion via dns so as not to polute the data. I then ran "unbound-control flush_zone ." as network manager does on a network change. ~4000 request / responses for DNS were made. I have provided stats before and after below.

thread0.num.queries=2761597
thread0.num.cachehits=1848794
thread0.num.cachemiss=912803
thread0.num.prefetch=0
thread0.num.recursivereplies=912803
thread0.requestlist.avg=1.90029
thread0.requestlist.max=370
thread0.requestlist.overwritten=0
thread0.requestlist.exceeded=0
thread0.requestlist.current.all=0
thread0.requestlist.current.user=0
thread0.recursion.time.avg=0.040583
thread0.recursion.time.median=0.0325666
thread1.num.queries=3107851
thread1.num.cachehits=2121638
thread1.num.cachemiss=986213
thread1.num.prefetch=0
thread1.num.recursivereplies=986213
thread1.requestlist.avg=1.9175
thread1.requestlist.max=292
thread1.requestlist.overwritten=0
thread1.requestlist.exceeded=0
thread1.requestlist.current.all=0
thread1.requestlist.current.user=0
thread1.recursion.time.avg=0.041769
thread1.recursion.time.median=0.0322179
total.num.queries=5869448
total.num.cachehits=3970432
total.num.cachemiss=1899016
total.num.prefetch=0
total.num.recursivereplies=1899016
total.requestlist.avg=1.90923
total.requestlist.max=370
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=0
total.requestlist.current.user=0
total.recursion.time.avg=0.041199
total.recursion.time.median=0.0323922
time.now=1403061635.517116
time.up=69012.793916
time.elapsed=69012.793916
mem.total.sbrk=16527360
mem.cache.rrset=2040582
mem.cache.message=2622140
mem.mod.iterator=16532
mem.mod.validator=245809
histogram.000000.000000.to.000000.000001=677
histogram.000000.000001.to.000000.000002=0
histogram.000000.000002.to.000000.000004=4
histogram.000000.000004.to.000000.000008=87
histogram.000000.000008.to.000000.000016=286
histogram.000000.000016.to.000000.000032=376
histogram.000000.000032.to.000000.000064=457
histogram.000000.000064.to.000000.000128=1407
histogram.000000.000128.to.000000.000256=5171
histogram.000000.000256.to.000000.000512=5931
histogram.000000.000512.to.000000.001024=12765
histogram.000000.001024.to.000000.002048=48677
histogram.000000.002048.to.000000.004096=62429
histogram.000000.004096.to.000000.008192=93367
histogram.000000.008192.to.000000.016384=146763
histogram.000000.016384.to.000000.032768=584974
histogram.000000.032768.to.000000.065536=739025
histogram.000000.065536.to.000000.131072=141874
histogram.000000.131072.to.000000.262144=15671
histogram.000000.262144.to.000000.524288=36335
histogram.000000.524288.to.000001.000000=1514
histogram.000001.000000.to.000002.000000=855
histogram.000002.000000.to.000004.000000=210
histogram.000004.000000.to.000008.000000=114
histogram.000008.000000.to.000016.000000=26
histogram.000016.000000.to.000032.000000=20
histogram.000032.000000.to.000064.000000=1
histogram.000064.000000.to.000128.000000=0
histogram.000128.000000.to.000256.000000=0
histogram.000256.000000.to.000512.000000=0
histogram.000512.000000.to.001024.000000=0
histogram.001024.000000.to.002048.000000=0
histogram.002048.000000.to.004096.000000=0
histogram.004096.000000.to.008192.000000=0
histogram.008192.000000.to.016384.000000=0
histogram.016384.000000.to.032768.000000=0
histogram.032768.000000.to.065536.000000=0
histogram.065536.000000.to.131072.000000=0
histogram.131072.000000.to.262144.000000=0
histogram.262144.000000.to.524288.000000=0
num.query.type.A=2935965
num.query.type.SOA=8
num.query.type.PTR=56
num.query.type.AAAA=2933218
num.query.type.SRV=201
num.query.class.IN=5869448
num.query.opcode.QUERY=5869448
num.query.tcp=0
num.query.ipv6=0
num.query.flags.QR=0
num.query.flags.AA=0
num.query.flags.TC=0
num.query.flags.RD=5869448
num.query.flags.RA=0
num.query.flags.Z=0
num.query.flags.AD=1
num.query.flags.CD=0
num.query.edns.present=1
num.query.edns.DO=0
num.answer.rcode.NOERROR=15361
num.answer.rcode.SERVFAIL=1897707
num.answer.rcode.NXDOMAIN=3956380
num.answer.rcode.nodata=2225
num.answer.secure=358
num.answer.bogus=1971
num.rrset.bogus=68
unwanted.queries=0
unwanted.replies=8


AFTER

thread0.num.queries=2761606
thread0.num.cachehits=1848799
thread0.num.cachemiss=912807
thread0.num.prefetch=0
thread0.num.recursivereplies=912807
thread0.requestlist.avg=1.90028
thread0.requestlist.max=370
thread0.requestlist.overwritten=0
thread0.requestlist.exceeded=0
thread0.requestlist.current.all=0
thread0.requestlist.current.user=0
thread0.recursion.time.avg=0.040586
thread0.recursion.time.median=0.0325667
thread1.num.queries=3107864
thread1.num.cachehits=2121644
thread1.num.cachemiss=986220
thread1.num.prefetch=0
thread1.num.recursivereplies=986220
thread1.requestlist.avg=1.91751
thread1.requestlist.max=292
thread1.requestlist.overwritten=0
thread1.requestlist.exceeded=0
thread1.requestlist.current.all=0
thread1.requestlist.current.user=0
thread1.recursion.time.avg=0.041777
thread1.recursion.time.median=0.0322181
total.num.queries=5869470
total.num.cachehits=3970443
total.num.cachemiss=1899027
total.num.prefetch=0
total.num.recursivereplies=1899027
total.requestlist.avg=1.90923
total.requestlist.max=370
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=0
total.requestlist.current.user=0
total.recursion.time.avg=0.041205
total.recursion.time.median=0.0323924
time.now=1403061790.716212
time.up=69167.993012
time.elapsed=69167.993012
mem.total.sbrk=16527360
mem.cache.rrset=2041436
mem.cache.message=2624320
mem.mod.iterator=16532
mem.mod.validator=245531
histogram.000000.000000.to.000000.000001=677
histogram.000000.000001.to.000000.000002=0
histogram.000000.000002.to.000000.000004=4
histogram.000000.000004.to.000000.000008=87
histogram.000000.000008.to.000000.000016=286
histogram.000000.000016.to.000000.000032=376
histogram.000000.000032.to.000000.000064=457
histogram.000000.000064.to.000000.000128=1407
histogram.000000.000128.to.000000.000256=5171
histogram.000000.000256.to.000000.000512=5931
histogram.000000.000512.to.000000.001024=12765
histogram.000000.001024.to.000000.002048=48677
histogram.000000.002048.to.000000.004096=62429
histogram.000000.004096.to.000000.008192=93367
histogram.000000.008192.to.000000.016384=146763
histogram.000000.016384.to.000000.032768=584974
histogram.000000.032768.to.000000.065536=739025
histogram.000000.065536.to.000000.131072=141874
histogram.000000.131072.to.000000.262144=15673
histogram.000000.262144.to.000000.524288=36335
histogram.000000.524288.to.000001.000000=1515
histogram.000001.000000.to.000002.000000=862
histogram.000002.000000.to.000004.000000=211
histogram.000004.000000.to.000008.000000=114
histogram.000008.000000.to.000016.000000=26
histogram.000016.000000.to.000032.000000=20
histogram.000032.000000.to.000064.000000=1
histogram.000064.000000.to.000128.000000=0
histogram.000128.000000.to.000256.000000=0
histogram.000256.000000.to.000512.000000=0
histogram.000512.000000.to.001024.000000=0
histogram.001024.000000.to.002048.000000=0
histogram.002048.000000.to.004096.000000=0
histogram.004096.000000.to.008192.000000=0
histogram.008192.000000.to.016384.000000=0
histogram.016384.000000.to.032768.000000=0
histogram.032768.000000.to.065536.000000=0
histogram.065536.000000.to.131072.000000=0
histogram.131072.000000.to.262144.000000=0
histogram.262144.000000.to.524288.000000=0
num.query.type.A=2935977
num.query.type.SOA=8
num.query.type.PTR=56
num.query.type.AAAA=2933226
num.query.type.SRV=203
num.query.class.IN=5869470
num.query.opcode.QUERY=5869470
num.query.tcp=0
num.query.ipv6=0
num.query.flags.QR=0
num.query.flags.AA=0
num.query.flags.TC=0
num.query.flags.RD=5869470
num.query.flags.RA=0
num.query.flags.Z=0
num.query.flags.AD=1
num.query.flags.CD=0
num.query.edns.present=1
num.query.edns.DO=0
num.answer.rcode.NOERROR=15383
num.answer.rcode.SERVFAIL=1897707
num.answer.rcode.NXDOMAIN=3956380
num.answer.rcode.nodata=2233
num.answer.secure=358
num.answer.bogus=1971
num.rrset.bogus=68
unwanted.queries=0
unwanted.replies=8

Comment 6 Pavel Šimerda (pavlix) 2015-01-22 09:39:09 UTC
Any news on why this happens?

Comment 7 Paul Wouters 2015-01-22 16:46:40 UTC
flushing "." means dropping the _entire_ cache. So a single application doing a DNS call will trigger a huge cache fill.

It has nothing to do with prefetching.

It might even be that unbound itself triggers these, for example by priming the root and fetching the root key and records to see if the buildin root hints needs updating.

The test is not a good test at all. If you want to test prefetching, query for a record that you know no one else will query for. with a short ttl

then later dump the cache and look for the entry.

Comment 8 William Brown 2015-01-23 01:13:00 UTC
So a single lookup to foo.com, after a cache flush of ., causes unbound to try and refetch every record that it ever held it's cache? That seems very inefficient. Why not just fetch "what's needed when requested?"

It may not have been prefetching that caused my issue, but the core of my concern is that when you flush the cache, that unbound attempts to fetch a large number of records to repopulate itself.

I will update description accordingly, but I think this behavior should be corrected.

Comment 9 Paul Wouters 2015-01-28 01:18:30 UTC
It is not "re-populate" the cache. It is "querying starting from an empty cache".

So yes if you flush everything, it needs to go to the root. to COM to ORG to root-nameservers.net to affilias.* and all those thousands of queries that you normally already find in a cache that's been up.

You seem to want to flush the cache and have it to

This is not a bug. This is the behaviour of a caching resolver that starts up from scratch.

Comment 10 William Brown 2015-01-28 08:10:39 UTC
No, I want it to flush the cache and not proactively attempt to fill the cache with every DNS record that I have ever looked at. For example, just because I used foo.example.com a few hours ago, doesn't mean that unbound should zealously repopulate the record unless I ask for it again.

I have tested this with the current version and it appears that this over aggresive behaviour is no longer the case.

As a result, I will close the issue.

Comment 11 William Brown 2015-01-28 08:10:40 UTC
No, I want it to flush the cache and not proactively attempt to fill the cache with every DNS record that I have ever looked at. For example, just because I used foo.example.com a few hours ago, doesn't mean that unbound should zealously repopulate the record unless I ask for it again.

I have tested this with the current version and it appears that this over aggresive behaviour is no longer the case.

As a result, I will close the issue.

Comment 12 Pavel Šimerda (pavlix) 2015-01-28 10:11:47 UTC
(In reply to William Brown from comment #11)
> I have tested this with the current version and it appears that this over
> aggresive behaviour is no longer the case.

Thanks!

Comment 13 Paul Wouters 2015-01-28 17:43:28 UTC
This is not about PRO_ACTIVELY filling the cache!

If you nuke the entire cache, and do one single lookup for fedoraproject.org

You need to:

- query the root nameservers for NS
  - query the ORG nameservers for .org
    - query for afilias-nst.info
      - query for info.
    - query for afilias-nst.info.

and so on and so. you will require hundreds or thousands of queries to get all the domains involved in NS and DNSKEY and records.

Please do not notice later on that this is "happening again" and re-open the bug.

when you flush the cache, unbound has NO LIST OF ANYTHING you queried before, so it cannot be "proactively" be doing anything.


Note You need to log in before you can comment on or make changes to this bug.