Description of problem: Unbound caches DNS records even across interface state changes. As a result, my workplace which uses Exchange at cas.ad.example.com is assigned a 10.0.0.X address. This is cached into unbound. I then go home, or to some other network (Say a wireless hotspot). My laptop is *still* able to resolve cas.ad.example.com and attempts to connect to it. This means an attacker could potentially fake 10.0.0.X on a network, and get traffic destined for my work mail server. On a home network in 192.168.0.0/24, likely this traffic will be leaked to the ISP via default routes. If for some reason 10.0.0.X happens to resolve, my laptop may begin to attempt to authenticate to it via HTTP, and may potentially disclose credentials or lead the way to other attacks. This is a severe issue IMO as: * It means the internal state of a network with split view DNS may be leaked outside of that network. * It makes MITM of some of my services more convinient. * ISPs charge by upload, so if I am leaking traffic I am (literally) paying for that. Unbound should not bring cache across two networks. My work network and home network have fundamentally different views of the DNS world. Caching the DNS view from my work place, and imposing that on my home network (Or vice versa) is not appropriate for an OS. You cannot assume that data is valid anymore. Now, one may argue that BIND is already a caching server and that works. Yes: Because it has one view of the DNS world, and that view is static. My laptop is mobile, and needs to percieve many different views of the DNS world that exist. In summary: Unbound should flush the cache between network interface state changes. (Default route changes). Additionally information: I will add that you cannot set the forwarder for ad.example.com as part of DHCP process as this is not advertised by DHCP on my work network (We have 100s of zones ...)
This is the same issue as with negative cache, see https://bugzilla.redhat.com/show_bug.cgi?id=1089767 for a discussion It seems a per-network option to flush the entire cache is an agreed upon workaround. We will look at integrating that into NM and dnssec-trigger.
(In reply to William Brown from comment #0) > I then go home, or to some other network (Say a wireless hotspot). My laptop > is *still* able to resolve cas.ad.example.com and attempts to connect to it. At that time, you should have a different nameserver and with the current dnssec-trigger from git, you should already have the cache flushed (will prepare a test build soon). Moving to dnssec-trigger as that's where the integration work is being done.
(In reply to Pavel Šimerda (pavlix) from comment #2) > (In reply to William Brown from comment #0) > > I then go home, or to some other network (Say a wireless hotspot). My laptop > > is *still* able to resolve cas.ad.example.com and attempts to connect to it. > > At that time, you should have a different nameserver and with the current > dnssec-trigger from git, you should already have the cache flushed (will > prepare a test build soon). My bad, that doesn't apply to global DNS configuration.
This has happened again. Moving from home to work, my home servers ip (172.26.X.X) was cached by a program on my laptop. At work, this service was still functional. My work place is testing a VPN to a new hospital and noticed that my laptop was sending LDAP traffic over their test infrastructure. This is because the new link uses 172.26.X.X/16. The root cause is that unbound has cached my home's ldap server, and now other software being able to resolve those names, "assumes" it's contactable. This is further evidence to the desire that unbound should flush cache between networks.
I'm not convinced this is a DNS issue. If the application involved never did another lookup again and kept using the resolved IP infinitely you would also have this problem. It looks like the application needs to become aware that a network change has happened, and respond to that. It might be that the application does do that, but that the TTL is too long and thus interfering. DNS views with a "private" view should always use very short TTLs for this reason. Two reasons why not to flush the cache per network switch is that it is both costly and interferes with privacy. If a connection keeps flipping between 3G/LTE and wifi, you don't want to lose your entire cache all the time. And when joining a public wifi, it would be better if the application can start with TLS without performing a DNS lookup (assuming TLS 1.3 fill fix leaking SNI which is planned for TLS 1.3). Such DNS lookups can be used to fingerprint users (becoming more common in shops, although right now still done based on MAC) or otherwise learn a lot more from the user than they should. unbound has implemented the flushing of the negative cache and related records, and we will flush those during a network change. That said, I do want to have a switch for this per network, so that we can allow users to flush the entire cache on a per-network basis if they trust the network. This would need to be implemented by NM which would display the security level of this network for DNS so that dnssec-trigger can make the decision to flush the entire cache.
(In reply to Paul Wouters from comment #5) > That said, I do want to have a switch for this per network, so that we can > allow users to flush the entire cache on a per-network basis if they trust > the network. This would need to be implemented by NM which would display the > security level of this network for DNS so that dnssec-trigger can make the > decision to flush the entire cache. That means now that it's not handled per network yet, we need a global switch at the least, that could later be used as a global default. I'm ok with providing a new option in dnssec.conf for that and I'm ok with any default as that can be changed either by the user or later in Fedora *if* the current default proves impractical. From what you already wrote, I assume you only want negative flush to happen by default in Fedora.
What about (dnssec.conf): flush_global_zone=yes|negative-only|no
(In reply to Pavel Šimerda (pavlix) from comment #7) > flush_global_zone=yes|negative-only|no I don't like 'global' and 'zone' because it means different things for different people: What about "flush_dns_cache"?
(In reply to Pavel Šimerda (pavlix) from comment #7) > What about (dnssec.conf): > > flush_global_zone=yes|negative-only|no I'm not sure if global option like this makes sense. If you're connecting to a network that provides a domain with it, you may want to flush the negative cache or flush the cache completely. Whereas when you are disconnecting from a network you may want to just flush cache completely (since you've caches RRs that are accessible only on that particular network), but not flush "only" negative cache. As we see in this bug, not flushing also non-negative cache entries is an issue, so does it really make sense to "only" flush negative cache in all situations? Flushing everything solves also negative cache problem. If we introduce a global option for negative cache only, then some users will start to use it and rely on it. In the future if we come up with a better solution for not flushing cache completely, we may see complains if removing the global "flush negative-only" cache option. I would welcome wider discussion before we introduce such a thing. I thought we somehow agreed that flushing cache completely is "good enough" for now.
(In reply to Tomas Hozza from comment #9) > (In reply to Pavel Šimerda (pavlix) from comment #7) > > What about (dnssec.conf): > > > > flush_global_zone=yes|negative-only|no > > I'm not sure if global option like this makes sense. If you're connecting to > a network that provides a domain with it, you may want to flush the negative > cache or flush the cache completely. Whereas when you are disconnecting from > a network you may want to just flush cache completely (since you've caches > RRs > that are accessible only on that particular network), but not flush "only" > negative cache. > > As we see in this bug, not flushing also non-negative cache entries is an > issue, Good point! > I would welcome wider discussion before we introduce such a thing. I thought > we > somehow agreed that flushing cache completely is "good enough" for now. For the record: I still think that full-flush is good enough for now. Do not over-engineer it from the beginning. I can see privacy concerns but we are not making situation worse because there is no cache at all nowadays. We can improve it iteratively...
(In reply to Petr Spacek from comment #10) > (In reply to Tomas Hozza from comment #9) > > (In reply to Pavel Šimerda (pavlix) from comment #7) > > > What about (dnssec.conf): > > > > > > flush_global_zone=yes|negative-only|no > > > > I'm not sure if global option like this makes sense. If you're connecting to > > a network that provides a domain with it, you may want to flush the negative > > cache or flush the cache completely. Note that connection zones are currently flushed selectively and completely, exactly as you did in your original script. > > Whereas when you are disconnecting from > > a network you may want to just flush cache completely (since you've caches > > RRs Just note that the current solution only detects NetworkManager's view on global DNS list only, so it can theoretically flush the cache when not needed, depending on when NM calls the dispatcher.d script. This will be solved in the NM unbound plugin. > > that are accessible only on that particular network), but not flush "only" > > negative cache. > > > > As we see in this bug, not flushing also non-negative cache entries is an > > issue, > Good point! > > > I would welcome wider discussion before we introduce such a thing. I thought > > we > > somehow agreed that flushing cache completely is "good enough" for now. > > For the record: I still think that full-flush is good enough for now. Do not > over-engineer it from the beginning. Note that I'm considering this feature on Paul's request. > I can see privacy concerns but we are not making situation worse because > there is no cache at all nowadays. We can improve it iteratively... I have no objections.
You are making it worse for everyone who is now running dnssec-trigger and unbound and don't have their cache flushed and haven't had any problems for years with that approach.
You can not deny that the current situation is not correct. Flushing the cache completely will not break the current users set-ups. From my point of view "worse" in this situation means "somehow (how?) worse performance" after changing, connecting and disconnecting to a network. I'm OK with flushing negative-only cache when connecting to a new network that might provide a new nameserver capable of resolving more (internal) domain names. However when changing/disconnecting from a network flushing negative-only cache does not seems to be enough.
(In reply to Paul Wouters from comment #12) > You are making it worse for everyone who is now running dnssec-trigger and > unbound and don't have their cache flushed and haven't had any problems for > years with that approach. Works for me != Works for everyone. I used this software for less than a day and had quite a number of problems, so please stop using this argument. (In reply to Tomas Hozza from comment #13) > You can not deny that the current situation is not correct. Flushing the > cache > completely will not break the current users set-ups. From my point of view > "worse" in this situation means "somehow (how?) worse performance" after > changing, connecting and disconnecting to a network. But we currently have no cache, so we won't "lose performance", but it will be regained over time as the cache is populated. So we would actually gain performance by having it, and also have correct behaviour by flushing it. The only time it would be sub-optimal would be after a network state change, the cache is flushed and now you have to wait a little as you currently do now. So I fail to see how we "lose performance" over what we have now. IMO worse is when services don't work as a result of incorrect data caching. The default OS and it's configuration shouldn't break networks. Consider also Fedora is targeted not just at sys admins and developers like you and I but also the population in general. Do you think say you <insert relative X here> would be able to solve this kind of networking issue? (In reply to Pavel Šimerda (pavlix) from comment #11) > > For the record: I still think that full-flush is good enough for now. Do not > > over-engineer it from the beginning. > > > I can see privacy concerns but we are not making situation worse because > > there is no cache at all nowadays. I agree that the full flush is needed - Not only does it solve this issue, but would solve the negative caching issues also. What are the privacy concerns? If the record has no DNSSEC: We don't gain or lose anything because it "cant be trusted" anyway. If the record has DNSSEC, we will get the signed record back again which can be validated. >> We can improve it iteratively... I don't see how it can be "iteratively improved". This is a world of imperfect networks, and I don't see how caching on a mobile device can "just work" without some issue be it mis-cached data, leaking traffic or something none of us have considered yet.
I hit this problem in my ISPs last billing cycle. As far as I can tell what happened is that I used my fedora laptop on the work wifi during my lunch break to stream TV video content. I left the laptop running and it went to sleep. I then took it home, fired it back up and it gave it to my wife to watch the same TV service that night. The service uses a split view DNS config and the cached entrys caused it to connect to the public streaming service, rather than my ISPs local unmeted server. I was not aware of this until I later noticed the usage on the ISPs usage meter. Luckily in this case I did not exceed the cap, though I chewed through 3GB that should have been free. It seems to me that the proposed solution of saving money for 3g/wifi users (where they may save less than 512 bytes per inital lookup each time their network jumps between 3g or wifi) could cost others 10s of GB or more, in addition to confusion over unexpected behavior. IMHO the default setting should be to flush all cache entries on network change as this will work for everyone. Then if a 'power user' wishes for alternate behaviour for their environment then it should be up to them to change the setting. They are the only one who has the visability in to their situation to make that decision.
I'd like to chime in and say that full-flush is a desirable default. I have another stupid reason, but it is stupid only on the surface as it allows non experts to "fix" caching issues with a simple action. IE: in the case you have caching iussues (see above examples) you can simply turn off and the on again your network card (kill switch with wifi or plug/unplug hardwired) and the cache is gone, problem solved. So I strongly favor a default of flush all, and would defer to later, after careful testing, optimizations where you may decide not to flush in specific circumstances.
This has been (hopefully) fixed in 0.12 now available in rawhide (and work in progress in F20), dnssec-trigger-script now flushes the whole cache on each change of the dynamically configured list of DNS servers. If necessary, please start a separate bug report for issues caused by excessive cache flushing and link to this bug report so that it's taken into account.
Is there a manual override for this in /etc/sysconfig/ that I can set to prevent this?
Currently there's no override in /etc/sysconfig/ nor /etc/dnssec.conf. You would have to edit /usr/libexec/dnssec-trigger-script python code directly. Just search for 'unbound-control' if interested.
dnssec-trigger-0.12-12.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/dnssec-trigger-0.12-12.fc20
Package dnssec-trigger-0.12-12.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing dnssec-trigger-0.12-12.fc20' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-7942/dnssec-trigger-0.12-12.fc20 then log in and leave karma (feedback).
dnssec-trigger-0.12-13.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/dnssec-trigger-0.12-13.fc20
dnssec-trigger-0.12-13.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.
Hi, This version of dnssec-trigger does not behave as expected. Between interface state changes, no flush is triggered. One can test this with: * Disable interface. ; unbound-control stats_noreset * Enable interface ; unbound-control stats_noreset * Disable interface. ; unbound-control stats_noreset * Enable interface ; unbound-control stats_noreset The values never reset to 0. This also may cause issues when dual interfaces are available. For example, on lan, enable wireless then disconnect lan. I need to ask that: * When *any* interface is brought UP a flush cache should be run. IE ethernet + wireless on different networks, different default routes. * When *any* interface is brought DOWN a flush cache should be run, IE ethernet + wireless on different network, move of default route etc. Since the agreed outcome was a full flush, I would like to see dnssec-trigger fixed to work in the manner as above.
(In reply to William Brown from comment #24) > Hi, > > This version of dnssec-trigger does not behave as expected. Between > interface state changes, no flush is triggered. One can test this with: > > * Disable interface. ; unbound-control stats_noreset > * Enable interface ; unbound-control stats_noreset > * Disable interface. ; unbound-control stats_noreset > * Enable interface ; unbound-control stats_noreset What exactly does that say? > I need to ask that: > > * When *any* interface is brought UP a flush cache should be run. IE > ethernet + wireless on different networks, different default routes. > * When *any* interface is brought DOWN a flush cache should be run, IE > ethernet + wireless on different network, move of default route etc. We currently perform a full flush when the list of global nameserver changes and a selective flush when forward zones are changed. That fits much better to the DNS workflow. Please specify whether you request to change this behavior (in that case please give us actual use cases) or whether the described behavior is ok and just differs from actual behavior.
(In reply to Pavel Šimerda (pavlix) from comment #25) > (In reply to William Brown from comment #24) > > Hi, > > > > This version of dnssec-trigger does not behave as expected. Between > > interface state changes, no flush is triggered. One can test this with: > > > > * Disable interface. ; unbound-control stats_noreset > > * Enable interface ; unbound-control stats_noreset > > * Disable interface. ; unbound-control stats_noreset > > * Enable interface ; unbound-control stats_noreset > > What exactly does that say? As in you want the output of stats_noreset? > > > I need to ask that: > > > > * When *any* interface is brought UP a flush cache should be run. IE > > ethernet + wireless on different networks, different default routes. > > * When *any* interface is brought DOWN a flush cache should be run, IE > > ethernet + wireless on different network, move of default route etc. > > We currently perform a full flush when the list of global nameserver changes > and a selective flush when forward zones are changed. That fits much better > to the DNS workflow. The behaviour I am observing does not match the behaviour you describe. I see no cache flushes happening on interface or DNS resolver change, or interface state change. As a result moving from network A to B, or connecting to a VPN still causes traffic leaks. > > Please specify whether you request to change this behavior (in that case > please give us actual use cases) or whether the described behavior is ok and > just differs from actual behavior. The theoretical behaviour you describe sounds like it would correct the issue, but as noted, this is not being applied.
(In reply to William Brown from comment #26) > The behaviour I am observing does not match the behaviour you describe. I > see no cache flushes happening on interface or DNS resolver change, or > interface state change. As a result moving from network A to B, or > connecting to a VPN still causes traffic leaks. Which version-release of dnssec-trigger do you use?
dnssec-trigger-0.12-15.fc21.x86_64
> > > > Please specify whether you request to change this behavior (in that case > > please give us actual use cases) or whether the described behavior is ok and > > just differs from actual behavior. Thinking about this behaviour, consider some user who has a cache populated with bad data, and they want to flush that. (Or they may not know to flush it?). So they turn the interface off and on. In this scenario, with cache only being flushed if NS changes, then they will continue to experience a broken network. So I think that in the interest of paranoia, cache should be flushed on any interface state change, regardless of change in nameservers. It will make it so that "off and on" in NetworkManager can clear up poluted cache issues. IE say I'm roaming between access points and traffic starts being dropped between AP A and B. Unbound will cache all the dropped responses an negatives, NetworkManager will Disconnect and Reconnect when I become in range of B. If we don't flush on interface state change, my laptop will then cache all the bad responses (Which is in itself, an issue). If the paranoid approach is taken, I won't have any issues when I reconnect to B.
This example is rather contrived. Although I have no problem with the hardware killswitch for wifi to trigger a complete cache drop. Additionally, network manager simply needs to gain some DNS options for corner cases involving odd DNS setups, including networks with hundreds of internal-only zone data to offer a DNS total cache flush. Or someone can write a NM plugin or gnome3 applet people who need this feature can run. caching of data is a feature, not a bug. If you get "bad data" or "bad responses", the source of badness is not the resolver on your roaming laptop - it is the networks that depend on an obsoleted split view DNS.
Hi Paul, I disagree with that approach. I know I do not carry weight in the fedora project but here goes. I think think 3rd party plugins, and more code/code complexity to try and catch all corner cases is not a suitable solution, and an impossibly hard target (all corner cases) to hit. The problem is simple and a complex answer is not required. The amount of bandwidth saved from not performig a cache flush on some interface changes is very small and a minor bandwidth gain is not a worthy trade off for resilience, reliability and most of all usability. I can see other times there will be problems. Some organisations would have a common name server for different segregated networks, and a dns configuration where different results are served up depending on the users IP or interface. I could hypothesise of many different scenareos where the flush is required and certainly everyone will not agree on an approach for network configuration, but that is outside the scope of unbound. Network configuration should not need to be considered because of the design of a DNS caching mechenism on a particular OS. I think unbound should take most flexible and reliable approach, and that is to flush the cache on all network changes. This approach is simple and has no corner cases beyond caching bad responses (which is of course unavoidable with a cache). But, haivng said all that, I think that was allready agreed and the problem is that it is not functioning as described. Please, keep it simple.
(In reply to William Brown from comment #29) > Thinking about this behaviour, consider some user who has a cache populated > with bad data, and they want to flush that. You can theoretically get away with all that by restarting unbound. There's a bug so you need to wait for the next update to make it practical. (In reply to Anthony Symons from comment #31) > But, haivng said all that, I think that was allready agreed and the problem > is that it is not functioning as described. Please, keep it simple. Yep. Let's focus on why user's don't see the intended behavior.
Pavel: restarting unbound manually is dangerous, as you lose all your runtime forwarder entries (eg those obtained/installed from a VPN link) Antony: it's a little out of scope, but I do wish to answer to your reply so others visiting here will see the other side of the story as well. 1) I often have interface bumps on the wifi where it reconnects. It would be bad to have a cache flush all the time. Especially if the link is already bad. Hundreds of queries are needed to fill up the initial cache for common queries. 2) every time you empty your cache, an attacker has a spoofing chance for non-DNSSEC data. eg sitting at starbucks with an empty cache and a local attacker is quite dangerous. 3) an empty cache being filled with my specific programs that do DNS traffic will leave a "fingerprint" of DNS queries that can be used to de-anonymize me The onus of fixing a problem should focus on the ultimate root cause of the problem. In this case, a network that requires split DNS. I can see how I could end up in the rough of the "rough consensus" here, and we pick another default. But regardless of the default, the real issue is that we need more resources into updating NetworkManager with the various DNS option requirements we have.
> 1) I often have interface bumps on the wifi where it reconnects. It would be > bad to have a cache flush all the time. Especially if the link is already > bad. Hundreds of queries are needed to fill up the initial cache for common > queries. > If the interface bumps the following will occur. * Interface loses wireless traffic. * Unbound begins to populate bogus-data negative cache. * Interface then goes down * Interface comes up. So lets describe what happens next. In the "no cache flush" scenario, I now have a cache populated with negative entries, and my queries do not work. In the "cache flush" scenario, queries begin to operate, and network resumes operation. > 2) every time you empty your cache, an attacker has a spoofing chance for > non-DNSSEC data. eg sitting at starbucks with an empty cache and a local > attacker is quite dangerous. If you are at a starbucks, on open wifi, you have plenty of other issues anyway. You also assume that my cache hasn't already been tampered with. Finally, you also assume that I have visited some Site X before, and the cache is still valid. > > 3) an empty cache being filled with my specific programs that do DNS traffic > will leave a "fingerprint" of DNS queries that can be used to de-anonymize me Unbound should not attempt to repopulate every record in cache when the network resumes. This would prevent the attack you are describing. I have discussed this issue in https://bugzilla.redhat.com/show_bug.cgi?id=1109651 > > The onus of fixing a problem should focus on the ultimate root cause of the > problem. In this case, a network that requires split DNS. > > I can see how I could end up in the rough of the "rough consensus" here, and > we pick another default. But regardless of the default, the real issue is > that we need more resources into updating NetworkManager with the various > DNS option requirements we have. No: There is more than split view at work here. Split view is one area that this configuration damages. I have described a number of others, such as unbound caching dropped traffic as negatives, attackers in said star bucks populating bad data, or in the case I saw the other day, a zone returning correct data, but unbound automatically invalidated it due to an incorrect NS record in the zone. The world of networks is bigger than you or I. We agreed earlier in this thread to flush cache on interface state change. We have seen plenty of arguments around how networks are imperfect. Why are we back at this, trying to convince you to implement something, that we already agreed you would go ahead with?
Unbound cache should be flushed on every network configuration change. If it is not implemented so, it should be. I'll check it. If there is no network configuration change (interface goes up/down, connecting to another network, etc.) you can not expect that the cache will be flushed.
Re: William We went through this before and unbound upstread added: unbound-control flush_negative and unbound-control flush_bogus Together with unbound-control flush_requestlist that should resolve your issues We are trying to keep your cache secure, so yes we use the assumption that we are preventing the cache from getting poisoned. Just like selinux assumes you don't have root access to disable selinux entirely. As for "site X" being valid or not when switching networks, you are shooting the messenger. We have talked about this in various email lists and bugzillas. Of course the nuclear option is easy to use. But it is not the most secure one to use. If you campus network uses hunderds of domains that you cannot white list, those domains should use a low TTL! And those domains are completely broken now already even in your cache where full cache flushes happen and the device swtiches from wifi to LTE when someone next to you turns on the microwave. Split view DNS is problematic from a security point of view. unbound rejected bad zones with incorrect NS records is a feature, not a bug. Note that I am not changing or implementing anything with unbound right now. I'm merely advocating good design and talking to upstream for feature requests for legitimate issues we run into and can solve, such as negative cache flush.
(In reply to Paul Wouters from comment #36) > Re: William > > We went through this before and unbound upstread added: > > unbound-control flush_negative > > and > > unbound-control flush_bogus > > Together with > > unbound-control flush_requestlist > > that should resolve your issues I'm glad that these have been added, but this still doesn't change anything around the behaviour I am currently seeing in dnssec-trigger. When you fill a cache with bogus data and reconnect to the same network, cache flushing isn't performed. > > We are trying to keep your cache secure, so yes we use the assumption that > we are preventing the cache from getting poisoned. Just like selinux assumes > you don't have root access to disable selinux entirely. The argument you pose is the age old issue of security and usability. > you turns on the microwave. Split view DNS is problematic from a security > point of view. Split view DNS exists. No matter how much we preach that it's bad, we have to deal with it and make it work. I think the issue of traffic leaking due to cached dns isn't the fault of split view, but of the caching process. However, we have flogged this dead horse enough. > > unbound rejected bad zones with incorrect NS records is a feature, not a bug. The world is full of horribly configured DNS. Be prepared to break things with this enforcement ... Perhaps a toggle in unbound to enforce NS validation (Note, I didn't read any docs about if this exists or not already) > > Note that I am not changing or implementing anything with unbound right now. > I'm merely advocating good design and talking to upstream for feature > requests for legitimate issues we run into and can solve, such as negative > cache flush. I'm also not advocating changing unbound right now. I want to have the behaviour of dnssec-trigger changed so that it performs a full flush even when DNS resolvers stay the same.
Let's get back to the topic. I'm testing a setup with NetworkManager, unbound and dnssec-trigger and no other tools that would intervene. We may talk about other scenarios later, but I would prefer a separate bug report. The code I'm testing will be delivered in the upcoming update. The easiest way to clean up everything manually is to restart unbound. All data from NetworkManager will get back to unbound once it's restarted via the implied restart of dnssec-trigger. systemctl restart unbound There's also an explicit way to flush the whole cache. unbound-control flush_zone . The whole cache is also to be flushed under well define circumstances. That is when the source of information changes. The source of information is currently identified by a set of nameserver IP addresses. As far as I understand, this bug report is about *not* getting the cache flushed under those circumstances. So far I identified a number of issues. What I miss is a specific issue identified to lead to such a situation and also enough information about the bug. Thus I'm going to issue an update. I invite everyone to help with testing. And I would be especially happy if you could check the logs for the listing of global nameservers as follows: Global forwarders: ... ... It should appear in the logs but can come from different sources, nm-dispatcher and dnssec-trigger services in particular. If you don't have those lines in your logs, then dnssec-trigger-script doesn't know about the change and we need to fix that. If you have them, then dnssec-trigger-script has already called the commend above to flush the whole cache. So far the bug report is incomplete. I would yet need to see a clear indication that the flush hasn't been flushed. Please compare the symptoms with those after a manual flush. Feel free to wait for the update, though. If everything goes well, you're getting it today.
(In reply to William Brown from comment #37) > I want to have the behaviour of dnssec-trigger changed so that it performs a > full flush even when DNS resolvers stay the same. We want to make dnssec-trigger usable for all sorts of people and providing the security features at the same time. You want to have a full flush on basically any change whether it's related or not. Paul wants to avoid flushing positive answers entirely. Note that we are talking full flush and thas about global resolution in the first place, and thus about resolution using a DNSSEC capable server. I think we certainly do not want to drop the cache on each and every event from NetworkManager. Flushing the cache should IMO be always for a good reason. Currently a modified set of name servers is generally regarded as such a good reason because the source of information has changed and thus the information may be different. If you have another good reason that happens while the set of name servers stays the same, please let us know. Timing is also important, so it's not as easy as saying that we should flush the cash at some random time like when any interface gets up or down (in whatever sence). I'm afraid this bug report has become a general discussion forum instead. It would be good to start individual bug reports for individual bugs and feature requests so that we can focus on improving the software.
(In reply to William Brown from comment #0) > Description of problem: > Unbound caches DNS records even across interface state changes. > > As a result, my workplace which uses Exchange at cas.ad.example.com is > assigned a 10.0.0.X address. This is cached into unbound. > > I then go home, or to some other network (Say a wireless hotspot). My laptop > is *still* able to resolve cas.ad.example.com and attempts to connect to it. Expected behavior is to flush the cache when you disconnect from the original wireless hotspot. This can be tested with dig. dig cas.ad.example.com @localhost
(In reply to Pavel Šimerda (pavlix) from comment #38) > Global forwarders: ... ... Just realized it gets logged always, just the action happens only when the set is changed.
(In reply to Pavel Šimerda (pavlix) from comment #40) > (In reply to William Brown from comment #0) > > Description of problem: > > Unbound caches DNS records even across interface state changes. > > > > As a result, my workplace which uses Exchange at cas.ad.example.com is > > assigned a 10.0.0.X address. This is cached into unbound. > > > > I then go home, or to some other network (Say a wireless hotspot). My laptop > > is *still* able to resolve cas.ad.example.com and attempts to connect to it. > > Expected behavior is to flush the cache when you disconnect from the > original wireless hotspot. This can be tested with dig. > > dig cas.ad.example.com @localhost I'd also need to know the result of the following commands before and after: unbound-control list_forwards unbound-control forward If such a fake domain comes from global configuration, it won't work because it breaks the DNSSEC chain. If that is part of a forward zone, then it has nothing to do with the global zone and you should rely on the partial flush for that zone that is also done by dnssec-trigger-script together with its removal. unbound-control forward_remove example.com unbound-control flush_zone example.com This also happens when you're disconnecting.
(In reply to Pavel Šimerda (pavlix) from comment #40) > Expected behavior is to flush the cache when you disconnect from the > original wireless hotspot. This can be tested with dig. Please define "disconnect"? I am often at cafes with flaky wifi. Especially on flaky wifi, getting a full cache flush and requiring hundreds of DNS packets to successfully be sent and received would in many cases render the wifi connection useless. In other words, I think your use case is much more rare than my use case. re: William Can you use a different term for "bogus data" because that very specifically means something which I do not think you mean: Bogus: The validating resolver has a trust anchor and a secure delegation indicating that subsidiary data is signed, but the response fails to validate for some reason: missing signatures, expired signatures, signatures with unsupported algorithms, data missing that the relevant NSEC RR says should be present, and so forth. when you say bogus, I think you mean "data from a different dns view than expected"
(In reply to Paul Wouters from comment #43) > (In reply to Pavel Šimerda (pavlix) from comment #40) > > > Expected behavior is to flush the cache when you disconnect from the > > original wireless hotspot. This can be tested with dig. > > Please define "disconnect"? Sure. By disconnect I mean that a Fedora system with NetworkManager no longer maintains a wireless connection. > I am often at cafes with flaky wifi. Especially on flaky wifi, getting a > full cache flush and requiring hundreds of DNS packets to successfully be > sent and received would in many cases render the wifi connection useless. Discussion about negative effects of flushing is more suitable for bug #1105685 that I still keep in mind. Let's stay with William's case here as his experience seems to diverge from what the current dnssec-trigger-script is expected to do. The outcome of this bug report should be one of the following: 1) We find out that there is a bug in dnssec-trigger-script that somehow alters the current expected behavior. That is to flush the cache once the list of nameservers changes, which happens upon disconnection from wifi. 2) We find out that there is a bug in NetworkManager that prevents dnssec-trigger-script from doing its job. 3) William learns that the behavior was affected by manual configuration of his own and thus is out of scope of both dnssec-trigger-script and NetworkManager. 4) The behavior is no longer observed, at least with the upcoming update that fixes a number of subtle issues that should not affect the behavior but it's hard to rule it out entirely. 5) The worst case is that we don't receive enough data to reproduce or explain William's observations. There's a high risk of ending up this way as I tried hard to reproduce yesterday (though not with the same version) and I didn't observe such behavior at all. For anything else, please start new bug reports and assign them to me, so that I can handle one issue at a time.
> > The outcome of this bug report should be one of the following: > > 1) We find out that there is a bug in dnssec-trigger-script that somehow > alters the current expected behavior. That is to flush the cache once the > list of nameservers changes, which happens upon disconnection from wifi. As discussed, I believe there are cases where this is insufficient. The cache should be flushed on any interface state change. Please see comment 16 and 24. There are cases where we should be removing the full cache even if resolvers are not changing, IE, lost data being cached, to assist in basic troubleshooting etc. Additionally, I also can see an issue where I may be at site A, with dns 192.168.0.1, and I suspend my laptop. I move to site B, on a different ISP and different DNS view of the world, also with dns 192.168.0.1. In this case, a flush *should* happen to prevent traffic leaking. However under the current scheme it would not. We should be taking conservative defaults that do not break user experiences. > > 2) We find out that there is a bug in NetworkManager that prevents > dnssec-trigger-script from doing its job. I do not believe this is the case. > > 3) William learns that the behavior was affected by manual configuration of > his own and thus is out of scope of both dnssec-trigger-script and > NetworkManager. If this happens, I will communicate this. > > 4) The behavior is no longer observed, at least with the upcoming update > that fixes a number of subtle issues that should not affect the behavior but > it's hard to rule it out entirely. Does the update fix the issue above, IE, flushing even when name servers haven't changed? Because that is the core of my issue. > > 5) The worst case is that we don't receive enough data to reproduce or > explain William's observations. There's a high risk of ending up this way as > I tried hard to reproduce yesterday (though not with the same version) and I > didn't observe such behavior at all. What were you testing for? > > For anything else, please start new bug reports and assign them to me, so > that I can handle one issue at a time. Will do.
(In reply to William Brown from comment #45) > As discussed, I believe there are cases where this is insufficient. The > cache should be flushed on any interface state change. Please see comment 16 <snip> > We should be taking conservative defaults that do not break user experiences. In this case I agree with William. We could start with conservative default at least in Fedora 22 so we can be be reasonably sure that we do not break some weird configuration. We can be more strict in next release. We should not disgust people right in the first release. IMHO the first release is crucial: I don't want to see dnssec-trigger/DNSSEC in the same position where SELinux were few years ago. Back then every how-to related to Fedora/CentOS had 'setenforce 0' right on the first line ;-)
(In reply to William Brown from comment #45) > As discussed, I believe there are cases where this is insufficient. I am still waiting for a description of a case where this applies. In a separate bug report preferably. > Please see comment 16 Comment 16 is a rationale against switching to negative-only flush. Therefore we haven't switched. > and 24. Comment 24 doesn't explain why you think the cache hasn't been flushed. > There are cases where we should be removing the full cache even if > resolvers are not changing, Still waiting for one. Any switch between networks results in two full flushes, one when nameservers are removed, one when they are added back. That doesn't depend on whether the networks use the same nameservers. Any situation where this doesn't happen should be reported as a bug. I'm not convinced that there is such a bug. > Additionally, I also can see an issue where I may be at site A, with dns > 192.168.0.1, and I suspend my laptop. I move to site B, on a different ISP > and different DNS view of the world, also with dns 192.168.0.1. In that case, you're getting two full flushes, one when disconnecting from A, one when connecting to B. > We should be taking conservative defaults that do not break user experiences. We're in 100% agreement. > > 4) The behavior is no longer observed, at least with the upcoming update > > that fixes a number of subtle issues that should not affect the behavior but > > it's hard to rule it out entirely. > > Does the update fix the issue above, IE, flushing even when name servers > haven't changed? Because that is the core of my issue. I couldn't confirm that there was an issue with the previous build nor that it would be the core of your issue. Therefore I can't possibly tell you that the (for me unconfirmed) issue is gone. > > 5) The worst case is that we don't receive enough data to reproduce or > > explain William's observations. There's a high risk of ending up this way as > > I tried hard to reproduce yesterday (though not with the same version) and I > > didn't observe such behavior at all. > > What were you testing for? The original scenario in the bug report. I used NetworkManager to connect to a network with split view DNS, successfully used `dig` to resolve a name specific to that network, disconnected, used `dig` to confirm that the name can no longer be resolved. Expected behavior: First dig succeeds, second dig fails. Actual behavior: Worked as expected. Tested version: My local development version which is pretty much the same as the latest koji build. http://koji.fedoraproject.org/koji/packageinfo?packageID=13240 > > For anything else, please start new bug reports and assign them to me, so > > that I can handle one issue at a time. > > Will do. Thanks! (In reply to Petr Spacek from comment #46) > (In reply to William Brown from comment #45) > > As discussed, I believe there are cases where this is insufficient. The > > cache should be flushed on any interface state change. Please see comment 16 > <snip> > > We should be taking conservative defaults that do not break user experiences. > > In this case I agree with William. Taking conservative defaults was our approach from the beginning. But that doesn't mean we're going to put cache flushes at random times and *hope* it will improve something. Let's do engineering, not black magic. > We could start with conservative default > at least in Fedora 22 so we can be be reasonably sure that we do not break > some weird configuration. We can be more strict in next release. > > We should not disgust people right in the first release. IMHO the first > release is crucial: I don't want to see dnssec-trigger/DNSSEC in the same > position where SELinux were few years ago. Back then every how-to related to > Fedora/CentOS had 'setenforce 0' right on the first line ;-) You can help by testing the latest builds in the described scenarios and testing for any issues including the described one. I currently fail to see even a theoretical possibility to see such behavior. As described above, in cases that William describes, we do *two full flushes*, one when disconnecting from the old network, one when connecting to the new one. If the cache doesn't get flushed (twice!), we need to *find the issue*, not change the expected behavior to perform three full flushes instead of two. Please refer to comment #44.
This time I'm setting a general needinfo to anyone who could: a) Confirm that he's getting the described issue, i.e. who can still resolve a site specific name after disconnecting from the site's network. Using `dig` or a similar tool is a preferred way to check that. Using the latest koji version is preferred, but any confirmed version is good. b) Confirm that the issue is absent. Preferably in the latest version or the suspected version. I did it for myself but I may have done something different. Please help me to avoid the following: > 5) The worst case is that we don't receive enough data to reproduce or > explain William's observations. There's a high risk of ending up this way as > I tried hard to reproduce yesterday (though not with the same version) and I > didn't observe such behavior at all.
> In that case, you're getting two full flushes, one when disconnecting from > A, one when connecting to B. That doesn't bother me. It fixes issues where you are connected to ethernet and join a wireless etc. > > > > 4) The behavior is no longer observed, at least with the upcoming update > > > that fixes a number of subtle issues that should not affect the behavior but > > > it's hard to rule it out entirely. ...snip... > > > > What were you testing for? > > The original scenario in the bug report. I used NetworkManager to connect to > a network with split view DNS, successfully used `dig` to resolve a name > specific to that network, disconnected, used `dig` to confirm that the name > can no longer be resolved. This isn't the behaviour I have an issue with. I am on some network A, I can dig a name from cache. I disconnect and rejoin A. I run dig. * Actual: cache returns name * Expected: A new lookup is completed. > > > As discussed, I believe there are cases where this is insufficient. The > > > cache should be flushed on any interface state change. Please see comment 16 > > <snip> > > > We should be taking conservative defaults that do not break user experiences. > > > > In this case I agree with William. > > Taking conservative defaults was our approach from the beginning. But that > doesn't mean we're going to put cache flushes at random times and *hope* it > will improve something. Let's do engineering, not black magic. I have been trying to provide examples and reasons why I want these defaults. I'm not trying to hand wave, I have actually been burnt in the last few months repeatedly by this DNS cache and the odd behaviour that it has caused. For example, it's not fun at work when your head of network security approaches you and asks what you are doing, because DNS caching caused a traffic leak from home to a network range that is shared by a hospital. > > We should not disgust people right in the first release. IMHO the first > > release is crucial: I don't want to see dnssec-trigger/DNSSEC in the same > > position where SELinux were few years ago. Back then every how-to related to > > Fedora/CentOS had 'setenforce 0' right on the first line ;-) I have a suggestion then, in the spirit of engineering you suggest. You make the "flush policy" a configurable of dnssec-trigger. I propose the following cache flush policies. 1) No flush is ever performed (Paul will like this one) 2) A negative flush is performed between all interface changes. No positive flush is performed. 3) A negative flush is performed between all interface changes. A poisitve flush is performed when resolvers change. 4) A full flush is performed between all interface changes. I would like to suggest that policy 3 be the default. After thinking carefully about a great number of network configurations and potential states, I believe that 3 fixes 90% of the issues that can potentially arise. There are a few exceptions where things can fall through the cracks (IE Moving between different networks that use the same DNS resolver IP, would not trigger a positive flush which may cause traffic leaks). However, that scenario tends to be more for home DNS users. 3 would allow "turn it off and on" actions for fixing issues where DNS traffic was dropped / cached in the negative. It also means that moving between two sites (provided that they have unique resolver IP's) would cause no traffic leaks. Finally, by making the policy configurable in this way, people like me would probably set 4. Most people would get a good default that causes "Few" issues. And others who want to run with the cache persisting til the end of time can also be free to do so, provided they accept responsibility for the consquences of their choice. (PS: I returned the needinfo flag that you said you would add)
(In reply to William Brown from comment #49) > > In that case, you're getting two full flushes, one when disconnecting from > > A, one when connecting to B. > > That doesn't bother me. It fixes issues where you are connected to ethernet > and join a wireless etc. Then I don't see any issue at all except that I haven't explained the workings to the detail of the double flush before. > > Taking conservative defaults was our approach from the beginning. But that > > doesn't mean we're going to put cache flushes at random times and *hope* it > > will improve something. Let's do engineering, not black magic. > > I have been trying to provide examples and reasons why I want these > defaults. I was convinced to use conservative defaults from the beginning. We had some discussions with Paul as he's also conservative but from an entirely different perspective and the result is pretty good I think. > I'm not trying to hand wave, I have actually been burnt in the > last few months repeatedly by this DNS cache and the odd behaviour that it > has caused. For example, it's not fun at work when your head of network > security approaches you and asks what you are doing, because DNS caching > caused a traffic leak from home to a network range that is shared by a > hospital. I understand you and the theoretical issue but we can't handle it unless it's (1) proven practical and (2) debugged enough to show us the reason. I'm well aware about DNS query and traffic leaks and I advocated taking care of those from the beginning. Those leaks are not new and many of them are not at all related to unbound. In the end getting unbound and dnssec-trigger in Fedora might well be the best way to avoid the leaks because it provides a single well managed cache. But you still need to be careful about your applications and their local caches and about any secondary system wide caches like nscd or systemd-resolved. > I have a suggestion then, in the spirit of engineering you suggest. Sure. > You make the "flush policy" a configurable of dnssec-trigger. I propose the > following cache flush policies. > > 1) No flush is ever performed (Paul will like this one) Actually what Paul wants is described in bug #1105685 and doesn't need to be covered here. > 2) A negative flush is performed between all interface changes. No positive > flush is performed. > 3) A negative flush is performed between all interface changes. A poisitve > flush is performed when resolvers change. > 4) A full flush is performed between all interface changes. 1) No. There is no relation to this bug report as in your cace cache two cache flushes are performed. Performing a third one doesn't change anything. 2) No. I talked about the details with other developers. It doesn't help to flush the cache at random. There are *very specific* cases when the cache needs to be flushed. This request is very close to suggesting that we flush the cache every N seconds. 3) Please back up your requests by specific real world issues. I will try to handle any real issues as quickly as possible. > I would like to suggest that policy 3 be the default. After thinking > carefully about a great number of network configurations and potential > states, I believe that 3 fixes 90% of the issues that can potentially arise. I believe that 99% of cache flushes are connected with an actual configuration change. It may well be 100% of *your* real use cases but I will not know until you come up with something. Those cases area already covered with a rare exception of a theoretical race condition described in bug #1183981. The remaining 1% can only occur when you use multiple network interfaces and are switching between them. The majority of the 1% are cases where you don't really switch to another network. Still it is perfectly fixable but noone has yet complained so there's not even a bugzilla ticket for it. > There are a few exceptions where things can fall through the cracks (IE > Moving between different networks that use the same DNS resolver IP, would > not trigger a positive flush which may cause traffic leaks). No. That is expected to cause a double full flush just like moving to a network with different DNS resolver IP. > 3 would allow "turn it off and on" actions for fixing issues where DNS > traffic was dropped / cached in the negative. No. Turn it off and on already triggers a double full flush. > It also means that moving > between two sites (provided that they have unique resolver IP's) would cause > no traffic leaks. Same as above. > Finally, by making the policy configurable in this way, people like me would > probably set 4. Most people would get a good default that causes "Few" > issues. And others who want to run with the cache persisting til the end of > time can also be free to do so, provided they accept responsibility for the > consquences of their choice. The cache flushing will certainly be configurable because some people including Paul requested a policy that will reduce issues caused by cache flushes. But the default will be a safe variant usable by people that expect no guarantees about DNS information consistency. As noone came up with an actual scenario where the original issue would happen with a reasonably recent build of dnssec-trigger, I'm closing the bug report with INSUFFICIENT DATA. Please reopen the bug report if you experience the absence of flushing when disconnecting from a network and provide additional information including the output of `dig` before and after and the system log with information from all related tools. We do appreciate your feedback. Please start new bug reports for specific scenarios where dnssec-trigger doesn't work as you would expect. We'll see what can be done. If you want more detailed information on why your suggestions weren't included in the design of dnssec-trigger, feel free to reach me on IRC Freenode as `pavlix`.