1903451 – [RHOCP4.5] OCP failed to resolve service's name while some nodes scaling in

Bug 1903451 - [RHOCP4.5] OCP failed to resolve service's name while some nodes scaling in

Summary: [RHOCP4.5] OCP failed to resolve service's name while some nodes scaling in

Keywords:
Status:	CLOSED DUPLICATE of bug 1919737
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-02 05:56 UTC by Masaki Furuta ( RH )
Modified:	2024-10-01 17:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-09 17:27:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Masaki Furuta ( RH ) 2020-12-02 05:56:23 UTC

Description of problem:

  When using ClusterAutoScale, to scale in and out of nodes, DNS name resolution sometimes fails on pods running on nodes that are not scale-in targets.
      
      In OCP4, a dns-default pod is running on each node in the daemonset, but because the name resolution access from the pod is via service, the dns-default pod is randomly assigned to which dns-default pod it is assigned.
      
      If the node on which the dns-default pod is running is scaling in, the dns-default pod will also exit, which will cause the DNS query to be rejected and result in an error.
      
      NEC has the following ideas for workarounds and retrofits, but let's start with a workaround Check to see if there is a problem, and if so,  Would you please proceed to show us another workaround as soon as possible ?

Version-Release number of selected component (if applicable):

  4.5

How reproducible:

  DNS name resolution sometimes fails on pods running on nodes that are not scale-in targets.

Steps to Reproduce:

  See "Description of problem:"

Actual results:

  DNS name resolution fails

Expected results:

  Shouldn't fail.

Additional info:

    Workaround
    
      Set the nodeSelector to the openshift-dns project, and
      Make the dns-default pod work only on master or infra
                            +
      Instead of the dns-node-resolver container in the dns-default pod
      image-registry.openshift-image-registry.svc entries
      Embed the service to write to /etc/hosts in MachineConfig
    
    Suggested fixes to the problem
    
      - Plan A: Launch the dns-default pod on the hostNetwork, so that each pod has access to the dns-default pod on the same node.
                 =>K8s should be able to achieve the same mechanism as the local dns cache
                   https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
    
      - Plan B: Put a preStop in the dns-default pod to include sleep to wait for it to leave its service destination.
                 (In k8s, the removal of pods and destination exclusion from the service works asynchronously, so preStop is required to prevent problems like this one)
    
      - Plan C: Make the dns-default pod work only in part, as a workaround
    
    Since the current implementation does not benefit from the performance improvement of running the dns-default pod on each node, NEC believes that Plan A is the best.

Comment 1 Masaki Hatada 2020-12-02 06:28:01 UTC

> Instead of the dns-node-resolver container in the dns-default pod
> image-registry.openshift-image-registry.svc entries
> Embed the service to write to /etc/hosts in MachineConfig

The following is the MachineConfig we defined for the above.

---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-my-etc-hosts-update
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 2.2.0
    networkd: {}
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKc2V0IC11byBwaXBlZmFpbAp0cmFwICdqb2JzIC1wIHwgeGFyZ3Mga2lsbCB8fCB0cnVlOyB3YWl0OyBleGl0IDAnIFRFUk0KT1BFTlNISUZUX01BUktFUj0ib3BlbnNoaWZ0LWdlbmVyYXRlZC1ub2RlLXJlc29sdmVyIgpTRVJWSUNFUz0iaW1hZ2UtcmVnaXN0cnkub3BlbnNoaWZ0LWltYWdlLXJlZ2lzdHJ5LnN2YyIKQ0xVU1RFUl9ET01BSU49Y2x1c3Rlci5sb2NhbApOQU1FU0VSVkVSPTE3Mi4zMC4wLjEwCkhPU1RTX0ZJTEU9Ii9ldGMvaG9zdHMiClRFTVBfRklMRT0iL2V0Yy9ob3N0cy50bXAiCklGUz0nLCAnIHJlYWQgLXIgLWEgc2VydmljZXMgPDw8ICIke1NFUlZJQ0VTfSIKIyBNYWtlIGEgdGVtcG9yYXJ5IGZpbGUgd2l0aCB0aGUgb2xkIGhvc3RzIGZpbGUncyBhdHRyaWJ1dGVzLgpjcCAtZiAtLWF0dHJpYnV0ZXMtb25seSAiJHtIT1NUU19GSUxFfSIgIiR7VEVNUF9GSUxFfSIKd2hpbGUgdHJ1ZTsgZG8KICBkZWNsYXJlIC1BIHN2Y19pcHMKICBpZiBbICQoY3JpY3RsIHBvZHMgLS1sYWJlbCBkbnMub3BlcmF0b3Iub3BlbnNoaWZ0LmlvL2RhZW1vbnNldC1kbnM9ZGVmYXVsdCB8IHdjIC1sKSAtZXEgMSBdCiAgdGhlbgogICAgZm9yIHN2YyBpbiAiJHtzZXJ2aWNlc1tAXX0iOyBkbwogICAgICAjIEZldGNoIHNlcnZpY2UgSVAgZnJvbSBjbHVzdGVyIGRucyBpZiBwcmVzZW50LiBXZSBtYWtlIHNldmVyYWwgdHJpZXMKICAgICAgIyB0byBkbyBpdDogSVB2NCwgSVB2NiwgSVB2NCBvdmVyIFRDUCBhbmQgSVB2NiBvdmVyIFRDUC4gVGhlIHR3byBsYXN0IG9uZXMKICAgICAgIyBhcmUgZm9yIGRlcGxveW1lbnRzIHdpdGggS3VyeXIgb24gb2xkZXIgT3BlblN0YWNrIChPU1AxMykgLSB0aG9zZSBkbyBub3QKICAgICAgIyBzdXBwb3J0IFVEUCBsb2FkYmFsYW5jZXJzIGFuZCByZXF1aXJlIHJlYWNoaW5nIEROUyB0aHJvdWdoIFRDUC4KICAgICAgY21kcz0oJ2RpZyAtdCBBIEAiJHtOQU1FU0VSVkVSfSIgK3Nob3J0ICIke3N2Y30uJHtDTFVTVEVSX0RPTUFJTn0iJwogICAgICAgICAgICAnZGlnIC10IEFBQUEgQCIke05BTUVTRVJWRVJ9IiArc2hvcnQgIiR7c3ZjfS4ke0NMVVNURVJfRE9NQUlOfSInCiAgICAgICAgICAgICdkaWcgLXQgQSArdGNwIEAiJHtOQU1FU0VSVkVSfSIgK3Nob3J0ICIke3N2Y30uJHtDTFVTVEVSX0RPTUFJTn0iJwogICAgICAgICAgICAnZGlnIC10IEFBQUEgK3RjcCBAIiR7TkFNRVNFUlZFUn0iICtzaG9ydCAiJHtzdmN9LiR7Q0xVU1RFUl9ET01BSU59IicpCiAgICAgIGZvciBpIGluICR7IWNtZHNbKl19CiAgICAgIGRvCiAgICAgICAgaXBzPSgkKGV2YWwgIiR7Y21kc1tpXX0iKSkKICAgICAgICBpZiBbWyAiJD8iIC1lcSAwICYmICIkeyNpcHNbQF19IiAtbmUgMCBdXTsgdGhlbgogICAgICAgICAgc3ZjX2lwc1siJHtzdmN9Il09IiR7aXBzW0BdfSIKICAgICAgICAgIGJyZWFrCiAgICAgICAgZmkKICAgICAgZG9uZQogICAgZG9uZQogICAgIyBVcGRhdGUgL2V0Yy9ob3N0cyBvbmx5IGlmIHdlIGdldCB2YWxpZCBzZXJ2aWNlIElQcwogICAgIyBXZSB3aWxsIG5vdCB1cGRhdGUgL2V0Yy9ob3N0cyB3aGVuIHRoZXJlIGlzIGNvcmVkbnMgc2VydmljZSBvdXRhZ2Ugb3IgYXBpIHVuYXZhaWxhYmlsaXR5CiAgICAjIFN0YWxlIGVudHJpZXMgY291bGQgZXhpc3QgaW4gL2V0Yy9ob3N0cyBpZiB0aGUgc2VydmljZSBpcyBkZWxldGVkCiAgICBpZiBbWyAtbiAiJHtzdmNfaXBzWypdLX0iIF1dOyB0aGVuCiAgICAgICMgQnVpbGQgYSBuZXcgaG9zdHMgZmlsZSBmcm9tIC9ldGMvaG9zdHMgd2l0aCBvdXIgY3VzdG9tIGVudHJpZXMgZmlsdGVyZWQgb3V0CiAgICAgIGdyZXAgLXYgIiMgJHtPUEVOU0hJRlRfTUFSS0VSfSIgIiR7SE9TVFNfRklMRX0iID4gIiR7VEVNUF9GSUxFfSIKICAgICAgIyBBcHBlbmQgcmVzb2x2ZXIgZW50cmllcyBmb3Igc2VydmljZXMKICAgICAgZm9yIHN2YyBpbiAiJHshc3ZjX2lwc1tAXX0iOyBkbwogICAgICAgIGZvciBpcCBpbiAke3N2Y19pcHNbJHtzdmN9XX07IGRvCiAgICAgICAgICBlY2hvICIke2lwfSAke3N2Y30gJHtzdmN9LiR7Q0xVU1RFUl9ET01BSU59ICMgJHtPUEVOU0hJRlRfTUFSS0VSfSIgPj4gIiR7VEVNUF9GSUxFfSIKICAgICAgICBkb25lCiAgICAgIGRvbmUKICAgICAgIyBUT0RPOiBVcGRhdGUgL2V0Yy9ob3N0cyBhdG9taWNhbGx5IHRvIGF2b2lkIGFueSBpbmNvbnNpc3RlbnQgYmVoYXZpb3IKICAgICAgIyBSZXBsYWNlIC9ldGMvaG9zdHMgd2l0aCBvdXIgbW9kaWZpZWQgdmVyc2lvbiBpZiBuZWVkZWQKICAgICAgY21wICIke1RFTVBfRklMRX0iICIke0hPU1RTX0ZJTEV9IiB8fCBjcCAtZiAiJHtURU1QX0ZJTEV9IiAiJHtIT1NUU19GSUxFfSIKICAgICAgIyBURU1QX0ZJTEUgaXMgbm90IHJlbW92ZWQgdG8gYXZvaWQgZmlsZSBjcmVhdGUvZGVsZXRlIGFuZCBhdHRyaWJ1dGVzIGNvcHkgY2h1cm4KICAgIGZpCiAgZmkKICBzbGVlcCA2MCAmIHdhaXQKICB1bnNldCBzdmNfaXBzCmRvbmUKCg==
        mode: 0640
        overwrite: true
        filesystem: root
        path: /usr/local/my-etc-hosts-update.sh
    systemd:
      units:
      - contents: |
          [Unit]
          Description=My /etc/hosts update script instead of coredns
          Wants=kubelet.service crio.service
          After=kubelet.service crio.service

          [Service]
          Type=simple
          Restart=always
          RestartSec=10
          ExecStart=/bin/bash /usr/local/my-etc-hosts-update.sh

          [Install]
          WantedBy=multi-user.target
        enabled: true
        name: my-etc-hosts-update.service

Comment 2 Andrew McDermott 2020-12-03 17:15:06 UTC

There may be a readiness probe change that we could pull out of https://github.com/openshift/cluster-dns-operator/pull/205#discussion_r502509116.

Comment 3 Masaki Hatada 2020-12-04 01:34:01 UTC

Dear Red Hat,

Thank you for the update.

However:
- Why is the target version 4.7? It's a big problem and a regression compared with OCP3 as Furuta-san commented at Comment #0. This must be fixed even in 4.5 and 4.6. 
- Why is this problem related to readiness probe? This is caused by the asynchronous behavior between the removal of pods and destination exclusion from the service as Furuta-san mentioned at Comment #0. What can readiness probe have a effect for this issue?

And, as Furuta-san mentioned at Comment #0, we would like to know first whether Red Hat will approve the workaround we suggested.
Our customer is still facing this issue and their migrating plan(OCP3->OCP4) has been delayed due to that.

Please provide workaround information first.
We already suggested the workaround as Furuta-san mentioned at Comment #0. Why does it take for a long time to answer?
Please consider that our customer is in a deep trouble due to this issue even now.

Best Regards,
Masaki Hatada

Comment 4 Masaki Furuta ( RH ) 2020-12-04 04:40:00 UTC

(In reply to Andrew McDermott from comment #2)

Hello Andrew McDermott, 

Thank you for your feedback, I am looking forward to incorporating that in 4.7 and backported into current version.

However, as Hatada-san said, for the time being, NEC need to clarify workaround at this moment.

As NEC asking,

> Additional info:
> 
>     Workaround
>     
>       Set the nodeSelector to the openshift-dns project, and
>       Make the dns-default pod work only on master or infra

NEC will do it as follows; I (TAM) think this is completely valid as it's just work this around as "taints"'s currently not working.

  ~~~
  Step1: Set node-selector of openshift-dns project to "node-role.kubernetes.io/master=".

         $ oc annotate namespace openshift-dns openshift.io/node-selector="node-role.kubernetes.io/master=" --overwrite

  Step2: Delete all pods in openshift-dns project to recreate dns pods.

         $ oc delete pods --all -n openshift-dns
  ~~~

And

>                             +
>       Instead of the dns-node-resolver container in the dns-default pod
>       image-registry.openshift-image-registry.svc entries
>       Embed the service to write to /etc/hosts in MachineConfig

NEC will use MachineConfig seen at comment 1, to modify /etc/hosts to fit this change.
TAM personally thinks that this setting would be valid too , but would you please confirm if there's alternative or valid for this case ?
Since this issue has affected customer's production environment already, NEC strongly need to confirm this would be the valid temporal workaround with Red Hat cooperation sooner.

I am grateful for your help and clarification.

Thank you,

BR,
Masaki

Comment 5 Andrew McDermott 2020-12-04 16:33:39 UTC

(In reply to Masaki Hatada from comment #3)
> Dear Red Hat,
> 
> Thank you for the update.
> 
> However:
> - Why is the target version 4.7? It's a big problem and a regression
> compared with OCP3 as Furuta-san commented at Comment #0. This must be fixed
> even in 4.5 and 4.6. 

The fix would first be made to 4.7 (i.e., master branch) and then, if appropriate, we would do a back port.

We would first fix
> - Why is this problem related to readiness probe? This is caused by the

This was an observation that occurred during our bug triage process.
The intent was to capture the comment here so that it wasn't lost and to double-check whether the lack of a readiness probe was not helping this issue.

> asynchronous behavior between the removal of pods and destination exclusion
> from the service as Furuta-san mentioned at Comment #0. What can readiness
> probe have a effect for this issue?
> 
> And, as Furuta-san mentioned at Comment #0, we would like to know first
> whether Red Hat will approve the workaround we suggested.
> Our customer is still facing this issue and their migrating plan(OCP3->OCP4)
> has been delayed due to that.
> 
> Please provide workaround information first.
> We already suggested the workaround as Furuta-san mentioned at Comment #0.
> Why does it take for a long time to answer?
> Please consider that our customer is in a deep trouble due to this issue
> even now.
> 
> Best Regards,
> Masaki Hatada

Comment 8 Masaki Furuta ( RH ) 2020-12-08 06:36:09 UTC

(In reply to Andrew McDermott from comment #5)

Hello Andrew McDermott ,

Thank you for responding to Hatada-san, NEC.
I understand that this'll be going to backport to 4.5.

> (In reply to Masaki Hatada from comment #3)
<...> 
> We would first fix
> > - Why is this problem related to readiness probe? This is caused by the
> 
> This was an observation that occurred during our bug triage process.
> The intent was to capture the comment here so that it wasn't lost and to
> double-check whether the lack of a readiness probe was not helping this
> issue.

To be on the same page, would you please share the exact steps to confirm above ?

> > asynchronous behavior between the removal of pods and destination exclusion
> > from the service as Furuta-san mentioned at Comment #0. What can readiness
> > probe have a effect for this issue?
> > 
> > And, as Furuta-san mentioned at Comment #0, we would like to know first
> > whether Red Hat will approve the workaround we suggested.
> > Our customer is still facing this issue and their migrating plan(OCP3->OCP4)
> > has been delayed due to that.
> > 
> > Please provide workaround information first.
> > We already suggested the workaround as Furuta-san mentioned at Comment #0.
> > Why does it take for a long time to answer?
> > Please consider that our customer is in a deep trouble due to this issue
> > even now.

As I've forwarded addtional feedback from NEC onto comment 7 , and Hatada-san, NEC said at comment 3, NEC needs urgent response from RH to obtain comment on their idea of workaround, as the problem is on-going and affecting huge negative impact on their end customer's business.

So,, My huge apologies for urging you again, but would you please take a look at their comment at comment 3 and any thought on their workaround; if that would be a valid or not ?

comment 7:
    ~~~
    Sorry for bothering you again, but could you show when we got the answer?
    
    
    As we mentioned before, now customer is facing the problem.
    And we already shared our idea of the workaround with Red Hat. Then the only task of Red Hat is just checking whether it is valid or not.
    Why has it taken for a long time? We don't even know what Red Hat is doing for it... How can we explain this situation to our customer?
    ~~~

Comment 10 Andrew McDermott 2020-12-08 20:10:57 UTC

(In reply to Masaki Furuta from comment #8)
> (In reply to Andrew McDermott from comment #5)
> 
> Hello Andrew McDermott ,
> 
> Thank you for responding to Hatada-san, NEC.
> I understand that this'll be going to backport to 4.5.
> 
> > (In reply to Masaki Hatada from comment #3)
> <...> 
> > We would first fix
> > > - Why is this problem related to readiness probe? This is caused by the
> > 
> > This was an observation that occurred during our bug triage process.
> > The intent was to capture the comment here so that it wasn't lost and to
> > double-check whether the lack of a readiness probe was not helping this
> > issue.
> 
> To be on the same page, would you please share the exact steps to confirm
> above ?
> 
> > > asynchronous behavior between the removal of pods and destination exclusion
> > > from the service as Furuta-san mentioned at Comment #0. What can readiness
> > > probe have a effect for this issue?
> > > 
> > > And, as Furuta-san mentioned at Comment #0, we would like to know first
> > > whether Red Hat will approve the workaround we suggested.
> > > Our customer is still facing this issue and their migrating plan(OCP3->OCP4)
> > > has been delayed due to that.
> > > 
> > > Please provide workaround information first.
> > > We already suggested the workaround as Furuta-san mentioned at Comment #0.
> > > Why does it take for a long time to answer?
> > > Please consider that our customer is in a deep trouble due to this issue
> > > even now.
> 
> As I've forwarded addtional feedback from NEC onto comment 7 , and
> Hatada-san, NEC said at comment 3, NEC needs urgent response from RH to
> obtain comment on their idea of workaround, as the problem is on-going and
> affecting huge negative impact on their end customer's business.
> 
> So,, My huge apologies for urging you again, but would you please take a
> look at their comment at comment 3 and any thought on their workaround; if
> that would be a valid or not ?
> 
> comment 7:
>     ~~~
>     Sorry for bothering you again, but could you show when we got the answer?
>     
>     
>     As we mentioned before, now customer is facing the problem.
>     And we already shared our idea of the workaround with Red Hat. Then the
> only task of Red Hat is just checking whether it is valid or not.
>     Why has it taken for a long time? We don't even know what Red Hat is
> doing for it... How can we explain this situation to our customer?
>     ~~~

We (Miciah) did some testing and due diligence with the proposed
workaround and are happy with it.

Quoting Miciah:

  I experimented by annotating the namespace:

  $ oc annotate --overwrite namespaces openshift-dns openshift.io/node-selector=node-role.kubernetes.io/master=

  3 of the 6 pods were deleted, as expected.

  Then I got the yaml for one of the remaining pods, deleted it,
  waited for its replacement to become available, got the
  replacement's yaml, and diffed. I see nodeSelector has the expected
  change. I've checked `oc -n openshift-dns get pods` a couple times
  and don't see any churn. So the daemonset controller and
  scheduler are not fighting.

Comment 11 Masaki Hatada 2020-12-09 00:17:51 UTC

Dear Andrew,

Thank you for testing!

Did you guys check Comment #3 as well?
As Comment #0 explained, we have to create a service to update /etc/hosts instead of dns-default pod if we set node-selector in openshift-dns namespace.

Did you check whether Comment #3 is valid?

Best Regards,
Masaki Hatada

Comment 12 Masaki Furuta ( RH ) 2020-12-09 04:34:04 UTC

(In reply to Masaki Hatada from comment #11)
Hello Andrew McDermott,

I'd like to provide some additional input from TAM.

As for Hatada-san's request, your quoted comment from Miciah only expilitly mentioned about annotation not to run dns pod and its result (that works fine.), but please also clarify as well about proposed workaround for DNS name resolution; 

Here's reason;

- Because of default resolver setting in dns pods, the resolver on the service would be randomly pointing to any of nodes to resolve name and trying to get a result within a certain time window,regardless of whether in the middle of scaling in/out or not. 

- This time , as NEC has already emphasized, customer's clusters are in production and considers that there would be a non-negligible risk even on applying this workaround on the fly on their sites.
 
- Thus NEC seems to think that not only preventing running dns pods on worker nodes, need to modify resolver settings as well so that it won't refer to dns-operator and take some response to remove their customers from such situations completely.

- This would be the reason why NEC's also thinking of fall back from dns-default pod to file based static table (/etc/hosts file) at the same time .

Dear Hatada-san,

Please feel free to add anything insufficient in my understanding, and/or would you please correct me if my understanding is incorrect ?

I am grateful for your continued help and clarification.

Thank you so much,

BR,
Masaki

Comment 13 Masaki Hatada 2020-12-09 07:03:51 UTC

Dear Furuta-san,

Thank you for assisting.

> - Thus NEC seems to think that not only preventing running dns pods on worker nodes, need to modify resolver settings as well so that it won't refer to dns-operator and take some response to remove their customers from such situations completely.
>
> - This would be the reason why NEC's also thinking of fall back from dns-default pod to file based static table (/etc/hosts file) at the same time .

Why we want to configure the setting of Comment #1 is that someone needs to update /etc/hosts instead of dns-default pod in new worker nodes.

dns-default pod implements the following setting in /etc/hosts.
Thanks to that, each worker node can access OpenShift Image Registry.

  ---
  172.30.32.221 image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
  ---

If dns-default pod isn't launched on a new node after the node was added by ClusterAutoScaler, /etc/hosts has no image-registry entry so the new node cannot access OpenShift Image Registry.
So, the setting of Comment #1 is needed for updating /etc/hosts instead of dns-default pod.

(Sorry, I wrote Comment #3 by mistake.. Comment #1 is right)

Best Regards,
Masaki Hatada

Comment 14 Masaki Furuta ( RH ) 2020-12-09 07:16:32 UTC

(In reply to Masaki Hatada from comment #13)
Thank you so much for clarification, Hatada-san,

Andrew McDermott, 

Would you please go ahead to discuss this with engineering team  and  please clarify it and share feedback with us ?

I am grateful for your continued help and support.

Thank you,

BR,
Masaki

Comment 16 Andrew McDermott 2020-12-10 08:12:29 UTC

(In reply to Masaki Hatada from comment #11)
> Dear Andrew,
> 
> Thank you for testing!
> 
> Did you guys check Comment #3 as well?
> As Comment #0 explained, we have to create a service to update /etc/hosts
> instead of dns-default pod if we set node-selector in openshift-dns
> namespace.
> 
> Did you check whether Comment #3 is valid?
> 
> Best Regards,
> Masaki Hatada

We discussed this today and had some concerns and observations.

The script is a copy of the node resolver script. There is this
possible issue we have with the script as it is today and under some
circumstances it can result in a busted/corrupt /etc/hosts. See the
following bug:

  https://bugzilla.redhat.com/show_bug.cgi?id=1882485#c11

The customer needs to watch for and apply fixes to the resolver script
as long as the workaround is in place.

There's also a lot of customisation going on here, between annotating
the namespace and adding the custom script to manage /etc/hosts. Right
now, we don't reconcile the namespace at all, so we won't stomp the
annotation. The customer is using the machineconfig API, so that
should be fine in itself, and the modifications to /etc/hosts look
safe - though heed the warning about the corruption as just mentioned.

Looking to the future, An actual fix for this issue would look like
this PR:

  https://github.com/openshift/cluster-dns-operator/pull/209

Comment 17 Masaki Hatada 2020-12-10 09:46:01 UTC

Dear Andrew,

Thank you for your information.

> The script is a copy of the node resolver script. There is this
> possible issue we have with the script as it is today and under some
> circumstances it can result in a busted/corrupt /etc/hosts. See the
> following bug:
> 
>  https://bugzilla.redhat.com/show_bug.cgi?id=1882485#c11

Just yesterday we faced the above issue then we modified the script as follows.

        if [[ "$?" -eq 0 && "${#ips[@]}" -ne 0 ]]; then
                     |
                     V
        if [[ "$?" -eq 0 && "${#ips[@]}" -ne 0 ]] && [ "${ips[0]}" != ";;" ]; then

> Looking to the future, An actual fix for this issue would look like
> this PR:
> 
>   https://github.com/openshift/cluster-dns-operator/pull/209

I understood that the above is not for Bug 1882485 but the original issue of this bugzilla(Bug 1903451). 

That means:
- Red Hat took "Plan C" of Comment #0 for fixing the issue
- node-resolver container will be separated from dns-default pod because a blanket toleration cannot be removed from dns-default pod due to that

Please let us know if I was wrong.

Best Regards,
Masaki Hatada

Comment 18 Masaki Hatada 2020-12-10 10:05:32 UTC

Dear Furuta-san,

We got the comment from Andrew.
Although the script we implemented with MachineConfig has the problem as he said, it's the same even if we don't apply it since node-resolver script which is executed by dns-default pod also has the same problem, according to Bug 1882485. 
(And there is no information about how to fix in Bug 1882485 now)

So, the workaround we suggested is still the only way to avoid the issue customer faced.

Now we have a question. In Red Hat side, who will give the green light to our workaround?
Or, does Comment #12 mean that Red Had accepted our workaround?

Best Regards,
Masaki Hatada

Comment 19 Masaki Furuta ( RH ) 2020-12-11 03:28:58 UTC

(In reply to Masaki Hatada from comment #18)
> Now we have a question. In Red Hat side, who will give the green light to
> our workaround?
> Or, does Comment #12 mean that Red Had accepted our workaround?

Hatada-san,

  Thank you for your confirmation.

  In this case, TAM think that NEC request RH the workaround, which must be backed by technical grounds agreed with RH engineering team firmly.

  So I do recommend to continue to confirm with engineering team via Andrew McDermott so that it'll work for the customers in their use case.

Hello Andrew McDermott,

  As for Hatada-san's request, would you please take a look and any further suggestions (from the team) on revised script ?

I am grateful for your help and clarification.

Thank you so much,

BR,
Masaki

Comment 20 Andrew McDermott 2020-12-11 09:46:25 UTC

(In reply to Masaki Hatada from comment #17)
> Dear Andrew,
> 
> Thank you for your information.
> 
> > The script is a copy of the node resolver script. There is this
> > possible issue we have with the script as it is today and under some
> > circumstances it can result in a busted/corrupt /etc/hosts. See the
> > following bug:
> > 
> >  https://bugzilla.redhat.com/show_bug.cgi?id=1882485#c11
> 
> Just yesterday we faced the above issue then we modified the script as
> follows.
> 
>         if [[ "$?" -eq 0 && "${#ips[@]}" -ne 0 ]]; then
>                      |
>                      V
>         if [[ "$?" -eq 0 && "${#ips[@]}" -ne 0 ]] && [ "${ips[0]}" != ";;"
> ]; then

Thank you for this.

> > Looking to the future, An actual fix for this issue would look like
> > this PR:
> > 
> >   https://github.com/openshift/cluster-dns-operator/pull/209
> 
> I understood that the above is not for Bug 1882485 but the original issue of
> this bugzilla(Bug 1903451). 
> 
> That means:
> - Red Hat took "Plan C" of Comment #0 for fixing the issue
> - node-resolver container will be separated from dns-default pod because a
> blanket toleration cannot be removed from dns-default pod due to that
> 
> Please let us know if I was wrong.
> 
> Best Regards,
> Masaki Hatada

Comment 21 Andrew McDermott 2020-12-11 09:48:58 UTC

(In reply to Masaki Hatada from comment #17)
> Dear Andrew,
> 
> Thank you for your information.
> 
> > The script is a copy of the node resolver script. There is this
> > possible issue we have with the script as it is today and under some
> > circumstances it can result in a busted/corrupt /etc/hosts. See the
> > following bug:
> > 
> >  https://bugzilla.redhat.com/show_bug.cgi?id=1882485#c11
> 
> Just yesterday we faced the above issue then we modified the script as
> follows.
> 
>         if [[ "$?" -eq 0 && "${#ips[@]}" -ne 0 ]]; then
>                      |
>                      V
>         if [[ "$?" -eq 0 && "${#ips[@]}" -ne 0 ]] && [ "${ips[0]}" != ";;"
> ]; then
> 
> > Looking to the future, An actual fix for this issue would look like
> > this PR:
> > 
> >   https://github.com/openshift/cluster-dns-operator/pull/209
> 
> I understood that the above is not for Bug 1882485 but the original issue of
> this bugzilla(Bug 1903451). 

That is correct.

Comment 22 Andrew McDermott 2020-12-11 09:56:15 UTC

(In reply to Masaki Furuta from comment #12)
> (In reply to Masaki Hatada from comment #11)
> Hello Andrew McDermott,
> 
> I'd like to provide some additional input from TAM.
> 
> As for Hatada-san's request, your quoted comment from Miciah only expilitly
> mentioned about annotation not to run dns pod and its result (that works
> fine.), but please also clarify as well about proposed workaround for DNS
> name resolution; 
> 
> Here's reason;
> 
> - Because of default resolver setting in dns pods, the resolver on the
> service would be randomly pointing to any of nodes to resolve name and
> trying to get a result within a certain time window,regardless of whether in
> the middle of scaling in/out or not. 
> 
> - This time , as NEC has already emphasized, customer's clusters are in
> production and considers that there would be a non-negligible risk even on
> applying this workaround on the fly on their sites.

We haven't explicitly said "OK, supported" because this is not a configuration
that is actively tested and it is being applied to a production cluster.

I will work through "Plan C" today - I would like to be able to reproduce this
to understand the consequences.

>  
> - Thus NEC seems to think that not only preventing running dns pods on
> worker nodes, need to modify resolver settings as well so that it won't
> refer to dns-operator and take some response to remove their customers from
> such situations completely.
> 
> - This would be the reason why NEC's also thinking of fall back from
> dns-default pod to file based static table (/etc/hosts file) at the same
> time .
> 
> Dear Hatada-san,
> 
> Please feel free to add anything insufficient in my understanding, and/or
> would you please correct me if my understanding is incorrect ?
> 
> I am grateful for your continued help and clarification.
> 
> Thank you so much,
> 
> BR,
> Masaki

Comment 23 Masaki Hatada 2020-12-11 12:01:02 UTC

Dear Furuta-san, Andrew,

So, what should we do?
Is the only option for customer is to wait until Red Hat releases the errata?

> We haven't explicitly said "OK, supported" because this is not a configuration that is actively tested and it is being applied to a production cluster.

Since it's a workaround. Any workaround is made after someone faced an issue. Nobody has tested it deeply just after the workaround was found.
We know that it will take a time to release the errata. So providing a workaround is VERY important task for supporting any product.

Why did we file a support case for this issue? The reason is that we want Red Hat to approve our workaround.
But Red Hat support team does nothing and developer team just creates an errata...
What is "Support"? Is there no meaning that we created the support case? It's too terrible for customer... Who will handle this?
Honestly, it doesn't matter who will handle this. But it's a big problem if nobody handle this.

Best Regards,
Masaki Hatada

Comment 24 Andrew McDermott 2020-12-11 14:49:04 UTC

(In reply to Masaki Hatada from comment #23)
> Dear Furuta-san, Andrew,
> 
> So, what should we do?
> Is the only option for customer is to wait until Red Hat releases the errata?

No. As I mentioned previously today I was testing the "plan C" again to make doubly sure there is nothing we have missed.

> 
> > We haven't explicitly said "OK, supported" because this is not a configuration that is actively tested and it is being applied to a production cluster.
> 
> Since it's a workaround. Any workaround is made after someone faced an
> issue. Nobody has tested it deeply just after the workaround was found.
> We know that it will take a time to release the errata. So providing a
> workaround is VERY important task for supporting any product.
> 
> Why did we file a support case for this issue? The reason is that we want
> Red Hat to approve our workaround.
> But Red Hat support team does nothing and developer team just creates an
> errata...
> What is "Support"? Is there no meaning that we created the support case?
> It's too terrible for customer... Who will handle this?
> Honestly, it doesn't matter who will handle this. But it's a big problem if
> nobody handle this.
> 
> Best Regards,
> Masaki Hatada

Comment 25 Andrew McDermott 2020-12-11 15:32:39 UTC

(In reply to Masaki Furuta from comment #19)
> (In reply to Masaki Hatada from comment #18)
> > Now we have a question. In Red Hat side, who will give the green light to
> > our workaround?
> > Or, does Comment #12 mean that Red Had accepted our workaround?
> 
> Hatada-san,
> 
>   Thank you for your confirmation.
> 
>   In this case, TAM think that NEC request RH the workaround, which must be
> backed by technical grounds agreed with RH engineering team firmly.
> 
>   So I do recommend to continue to confirm with engineering team via Andrew
> McDermott so that it'll work for the customers in their use case.
> 
> Hello Andrew McDermott,
> 
>   As for Hatada-san's request, would you please take a look and any further
> suggestions (from the team) on revised script ?
> 
> I am grateful for your help and clarification.
> 
> Thank you so much,
> 
> BR,
> Masaki

I ran through the workaround steps today:

- used the steps from https://bugzilla.redhat.com/show_bug.cgi?id=1903451#c10
- applied the machineconfig to write /etc/hosts from comment #1
- did some sanity testing with the autoscaler, scaling out/in
- scale in/out existing and new worker nodes
- verified that /etc/hosts gets written in all cases worker nodes
- exercised DNS queries in test PODs

I'm comfortable to give the go ahead for the proposed workaround.

Comment 26 Masaki Hatada 2020-12-15 05:18:14 UTC

Dear Andrew,

Thank you for evaluating!

We installed our workaround to our customer and have checked that this issue no longer occurs.
We really appreciate your great help to us.

Best Regards,
Masaki Hatada

Comment 27 mfisher 2021-01-20 20:04:52 UTC

Updating target-release as it appears this will not make the 4.7 release schedule.

Comment 28 Miciah Dashiel Butler Masters 2021-02-05 23:34:05 UTC

I'll work on restoring the readiness probe change from <https://github.com/openshift/cluster-dns-operator/pull/205#discussion_r502509116> in the next sprint.  

In addition, we'll look into the possibility of separating the container that updates /etc/hosts and the container that runs CoreDNS into separate daemonsets, continuing work started in <https://github.com/openshift/cluster-dns-operator/pull/209>.  

We are also looking at ways to route queries to the local DNS pod, but this work is more tentative.

Comment 29 Andrew McDermott 2021-02-09 17:27:03 UTC


*** This bug has been marked as a duplicate of bug 1919737 ***

Note You need to log in before you can comment on or make changes to this bug.