Bug 1812333
| Summary: | egressnetworkpolicy cannot work if use custom dnsname | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | huirwang |
| Component: | Networking | Assignee: | Alexander Constantinescu <aconstan> |
| Networking sub component: | openshift-sdn | QA Contact: | huirwang |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | low | ||
| Priority: | medium | CC: | aconstan, aos-bugs, ehashman, mmasters, ricarril, weliang |
| Version: | 4.4 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-26 09:32:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
huirwang
2020-03-11 02:24:07 UTC
Hi Huiran Could you confirm that your cluster is running the "redhat/openshift-ovs-multitenant" network plugin? This is the not the default network plugin in 4.4, hence why I ask. And egressnetworkpolicy is only expected to work with that network plugin. Do: oc get clusternetwork -o yaml to find out Thanks in advance! -Alex Hi again Could you also provide me with a kubeconfig when you reproduce this next time so that I can jump on the cluster and have a look? Thanks again -Alex Hi Alex, QE tracked this bug and found it can also be reproduced in v3.11 with "redhat/openshift-ovs-multitenant" network plugin Here is information to login v3.11 cluster with "redhat/openshift-ovs-multitenant" network plugin Master: sudo ssh -i "/home/weliang/.ssh/openshift-qe.pem" ci-vm-10-0-150-160.hosted.upshift.rdu2.redhat.com Router node: sudo ssh -i "/home/weliang/.ssh/openshift-qe.pem" ci-vm-10-0-148-140.hosted.upshift.rdu2.redhat.com application node: sudo ssh -i "/home/weliang/.ssh/openshift-qe.pem" ci-vm-10-0-149-197.hosted.upshift.rdu2.redhat.com According to test case from https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-13503, both pod and egressnetworkpolicy policy-test are applied under project p1, you can easily reproduce the bug by ping pod2.ec2.internal after you ssh to hello-pod. Thanks, Weibin Hi Alex, From the doc, https://docs.openshift.com/container-platform/4.3/networking/openshift_sdn/configuring-egress-firewall.html, "You must have OpenShift SDN configured to use either the network policy or multitenant modes to configure egress firewall policy." That means it "egressnetworkpolicy" supports both network policy and multitenant modes. Thanks Huiran Hi
> From the doc, https://docs.openshift.com/container-platform/4.3/networking/openshift_sdn/configuring-egress-firewall.html, "You must have OpenShift SDN configured to use either the network policy or multitenant modes to configure egress firewall policy." That means it "egressnetworkpolicy" supports both network policy and multitenant modes.
Fair enough
I spent time looking at the 3.11 cluster Weibin created yesterday and the problem is definately not in the SDN, it has to do with the fact that /etc/resolv.conf is not populated with your customer nameserver when the container starts. Which means one out of two things:
1) Either they way you're testing it (by adding it to the host's /etc/resolv.conf and re-starting the SDN pod) is not supported
2) Or, there's a bug
However, 3.11 is waayyyy too far back in time to be worth the effort to investigate properely. Could you guys thus create a cluster with >= 4.2? I could verify that the same issue happens there, investigate a bit and then send it off to the network-edge team for a final verdict (they are the ones handling DNS).
Thanks in advance guys!
-Alex
Hi again guys Just to finish on the last comment, I figured out how to make it work on 3.11: The SDN daemonset needs to be edited, and you need to change: dnsPolicy: ClusterFirst -> dnsPolicy: Default But this is if you want the nameserver record to be taken into account when modifying /etc/resolv.conf on the host, as you guys have been doing. I am not sure if it's worth doing on 3.11 just to satisfy this use-case (no customer has ever complained about this from my knowledge), and it might impact other functionality so I don't think it's worth pushing a PR to 3.11 for. Like I said, get back to me with a 4.X cluster and I'll debug on a more recent tech stack. -Alex Hi Alex,
I installed v4.4 cluster and update dnsPolicy: ClusterFirst -> dnsPolicy: Default, but dnsPolicy value change back to ClusterFirst after a while.
[weliang@weliang FILE]$ while true; do oc get daemonset.apps/sdn -o yaml | grep dnsPolicy;sleep 5;done
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: Default
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
dnsPolicy: ClusterFirst
Hi Weibin Yes, that solution was for a 3.11 cluster, not 4.X. I suspect the network-operator modifies the SDN daemonset back to it's original configuration. On 4.X cluster I suspect you should follow this: https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#configuration-of-stub-domain-and-upstream-nameserver-using-coredns If that does not work, then I'd recommend that you assign the bug to network-edge so that they can advise. Anyways, as I mentioned: this is not a bug in SDN, it only tries to resolve the DNS entry in the egressnetworkpolicy object, it's up to the user to make sure that it's correctly configured and able to be resolved in every pod across his/hers cluster. -Alex Network policy is SDN, reassigning so the right folks can take a look Alex, assigning to you since it seems you were looking into it. If you too many bugs on your plate, just unassign or ping me and we find a better home. @Dan As mentioned in #comment 9, the problem is not egressnetworkpolicy, the problem is how the custom nameserver in /etc/resolv.conf is/is supposed to be propagated from the node -> pods, and if what the QE guys are doing when testing this is valid or not. Could you please have a look at the procedure they've described in #comment 1 and advise if: 1) It's the proper way to add a custom nameserver to openshift in 4.X (confirm if what I advised in #comment 9 is correct) 2) It's the proper way to add a custom nameserver to openshift in 3.11 (confirm if what I advised in #comment 7 is correct) -Alex (In reply to Alexander Constantinescu from comment #13) > @Dan > > As mentioned in #comment 9, the problem is not egressnetworkpolicy, the > problem is how the custom nameserver in /etc/resolv.conf is/is supposed to > be propagated from the node -> pods, and if what the QE guys are doing when > testing this is valid or not. > > Could you please have a look at the procedure they've described in #comment > 1 and advise if: > > 1) It's the proper way to add a custom nameserver to openshift in 4.X > (confirm if what I advised in #comment 9 is correct) > 2) It's the proper way to add a custom nameserver to openshift in 3.11 > (confirm if what I advised in #comment 7 is correct) > > -Alex I don't know if user modifications to /etc/resolv.conf are supported in v4. I believe the machine-config-operator owns it. My team owns dns-operator and ingress-operator, neither of which manage /etc/resolv.conf on nodes. If you're concerned only with DNS forwarding from _pods_, upstream nameservers can be added using the DNS operator's forwarding API[1]. Note this will have _no effect_ on resolution from the node context — the DNS forwarding API applies only to DNS queries from pod clients. Looking at the procedure you cited, I suspect (but can't confirm at this very moment) those SDN pods are using host network. If that's true, I recommend engaging the MCO team to understand the option available for managing node /etc/resolv.conf. Make sense? [1] https://docs.openshift.com/container-platform/4.3/networking/dns-operator.html#nw-dns-forward_dns-operator Hi Understood, thanks for the help! I will re-assign to the MCO team in that case as the SDN pods are indeed running hostNetwork: true. Could the MCO team please advise the QE guys how they're supposed to add a custom nameserver to their cluster, so that it in turn can be picked up by the SDN pod, so that the SDN can perform DNS name resolution against it? -Alex (In reply to Alexander Constantinescu from comment #15) > Hi > > Understood, thanks for the help! > > I will re-assign to the MCO team in that case as the SDN pods are indeed > running hostNetwork: true. > > Could the MCO team please advise the QE guys how they're supposed to add a > custom nameserver to their cluster, so that it in turn can be picked up by > the SDN pod, so that the SDN can perform DNS name resolution against it? > > -Alex apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 50-nameserver spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,<base_64_encoding_of_etc_resolv_conf> filesystem: root mode: 0644 path: /etc/resolv.conf The above should work, you'd rewrite the whole resolv.conf tho, so check what's inside before overwriting - and this is just for testing I suppose I'm moving back to Routing, clearly shipping the new resolv.conf worked right? so, if it gets overwritten or isn't picked up by sdn pod it's not MCO Based on the following, this appears to be an SDN issue: * /etc/resolv.conf has the expected name server, * ping can resolve the name, and * SDN fails to resolve the name. Looking at the getIPsAndMinTTL function (https://github.com/openshift/sdn/blob/3bc47150a058cce2e6c5d01518b6ff6239e0959b/pkg/network/common/dns.go#L130-L161), it looks like SDN tries every name server in /etc/resolv.conf and fails with "failed to get a valid answer" if any one of the name servers responds with NXDOMAIN. To solve this issue, I believe the getIPsAndMinTTL function should ignore NXDOMAIN responses. There isn't an obvious reason to return an error if some servers respond with NXDOMAIN, so long as at least one server responds with an address. A possible workaround for QE would be to (1) create another name server that forwarded the ec2.internal zone to the custom name server and forwarded everything else to the cluster DNS service (172.30.0.10), (2) delete any existing nameserver entries from /etc/resolv.conf, and (3) add a single entry for the name server created in the first step to /etc/resolv.conf. This way, getIPsAndMinTTL could resolve pod2.ec2.internal as well as other addresses (such as the API's) and would not get any NXDOMAIN responses for pods2.ec2.internal or other resolvable addresses. Bumping the priority and moving to 4.6 so we can think about this. Unassigning as I'm on long vacation. Hi Huiran Could you reproduce this bug and provide me with a kubeconfig? Thanks in advance! /Alex *** Bug 1861925 has been marked as a duplicate of this bug. *** |