Bug 2027836 - The spoke cluster deployment gets stuck when api-int entry for the cluster is missing in the external DNS.
Summary: The spoke cluster deployment gets stuck when api-int entry for the cluster is...
Keywords:
Status: CLOSED DUPLICATE of bug 2029438
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Infrastructure Operator
Version: rhacm-2.4.z
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: rhacm-2.5
Assignee: Ori Amizur
QA Contact: bjacot
Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-30 20:37 UTC by Alexander Chuzhoy
Modified: 2022-09-07 02:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-29 13:34:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github open-cluster-management backlog issues 18211 0 None None None 2021-12-05 21:21:48 UTC
Red Hat Issue Tracker MGMTBUGSM-551 0 None None None 2022-09-07 02:29:29 UTC

Description Alexander Chuzhoy 2021-11-30 20:37:22 UTC
Version:
OCP: 4.10.0-0.nightly-2021-11-29-142540
ACM: 2.4.1-DOWNSTREAM-2021-11-22-20-58-05

Steps to reproduce:
Try to deploy spoke compact cluster with 3 controllers (no workers).
Only have the api and the ingress entry in DNS.


Result:
The spoke compact cluster deployment gets stuck during the bootstrap phase.
The 2 masters don't complete starting all the containers. 

Looking at the started 2 masters the lat container (that is constantly restarting) is verify-api-int-resolvable.


After about one hour, I added an entry for api-int to the DNS server used by the setup. Right after the verify-api-int-resolvable restarted - many more containers started.

I re-attempted the spoke deployment on the same setup with the same config and it went smooth with the api-int entry in dns.




oc get agentclusterinstalls.extensions.hive.openshift.io -o json|jq .items[].spec
{
  "apiVIP": "192.168.123.106",
  "clusterDeploymentRef": {
    "name": "elvis2"
  },
  "clusterMetadata": {
    "adminKubeconfigSecretRef": {
      "name": "elvis2-admin-kubeconfig"
    },
    "adminPasswordSecretRef": {
      "name": "elvis2-admin-password"
    },
    "clusterID": "dce6b348-e15e-4a1b-8c43-ca326e41efad",
    "infraID": "f76b1183-e76d-42b3-95f2-095cb7ebbbc7"
  },
  "imageSetRef": {
    "name": "4.10"
  },
  "ingressVIP": "192.168.123.105",
  "networking": {
    "clusterNetwork": [
      {
        "cidr": "10.128.0.0/14",
        "hostPrefix": 23
      }
    ],
    "machineNetwork": [
      {
        "cidr": "192.168.123.0/24"
      }
    ],
    "serviceNetwork": [
      "172.30.0.0/16"
    ]
  },
  "provisionRequirements": {
    "controlPlaneAgents": 3
  },
  "sshPublicKey": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzwAz3fnZcrca7mY/kVFpQGS2yI1uGd/+t3PMJn/C7Ppj1uIG32ufHkTq+SXh8Zg3xcy9v/Uome1mo3FP7PoGsWms5B9wzbooGhbA3rdph0/NxSzrHO3qcudcJsBM4GVJhcbFfbkzJVCPZQ94O/Y17oKjKuaBz69clPD29BlzKF4xCWzzbJW5Q8Y9tvWvDpCdVBM7VorpAn3MaA95xL6e15douWwwlhdI4dIOk/+8HcfgJnZGyOeLTnLVpjxQaFzTj3ScEud/5yd5wHcICrHH8Fbq419nN7VWjxbMNWUn182mcCCs0RXx2eyYq27yJvgkJS86n09SyLynX6ySqkFXN"
}



 oc get cd -o json|jq .items[].spec
{
  "baseDomain": "qe.lab.redhat.com",
  "clusterInstallRef": {
    "group": "extensions.hive.openshift.io",
    "kind": "AgentClusterInstall",
    "name": "elvis2",
    "version": "v1beta1"
  },
  "clusterMetadata": {
    "adminKubeconfigSecretRef": {
      "name": "elvis2-admin-kubeconfig"
    },
    "adminPasswordSecretRef": {
      "name": "elvis2-admin-password"
    },
    "clusterID": "dce6b348-e15e-4a1b-8c43-ca326e41efad",
    "infraID": "f76b1183-e76d-42b3-95f2-095cb7ebbbc7"
  },
  "clusterName": "elvis2",
  "controlPlaneConfig": {
    "servingCertificates": {}
  },
  "installed": true,
  "platform": {
    "agentBareMetal": {
      "agentSelector": {
        "matchLabels": {
          "bla": "aaa"
        }
      }
    }
  },
  "pullSecretRef": {
    "name": "pull-secret"
  }
}



oc get infraenv -o json|jq .items[].spec
{
  "clusterRef": {
    "name": "elvis2",
    "namespace": "elvis2"
  },
  "nmStateConfigLabelSelector": {
    "matchLabels": {
      "nmstate_config_cluster_name": "ha-static"
    }
  },
  "pullSecretRef": {
    "name": "pull-secret"
  },
  "sshAuthorizedKey": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzwAz3fnZcrca7mY/kVFpQGS2yI1uGd/+t3PMJn/C7Ppj1uIG32ufHkTq+SXh8Zg3xcy9v/Uome1mo3FP7PoGsWms5B9wzbooGhbA3rdph0/NxSzrHO3qcudcJsBM4GVJhcbFfbkzJVCPZQ94O/Y17oKjKuaBz69clPD29BlzKF4xCWzzbJW5Q8Y9tvWvDpCdVBM7VorpAn3MaA95xL6e15douWwwlhdI4dIOk/+8HcfgJnZGyOeLTnLVpjxQaFzTj3ScEud/5yd5wHcICrHH8Fbq419nN7VWjxbMNWUn182mcCCs0RXx2eyYq27yJvgkJS86n09SyLynX6ySqkFXN"
}






oc get nmstateconfig -o json|jq .items[].spec
{
  "config": {
    "dns-resolver": {
      "config": {
        "server": [
          "192.168.123.1"
        ]
      }
    },
    "interfaces": [
      {
        "ipv4": {
          "address": [
            {
              "ip": "192.168.123.142",
              "prefix-length": 24
            }
          ],
          "dhcp": false,
          "enabled": true
        },
        "ipv6": {
          "enabled": false
        },
        "name": "eth0",
        "state": "up",
        "type": "ethernet"
      }
    ],
    "routes": {
      "config": [
        {
          "destination": "0.0.0.0/0",
          "next-hop-address": "192.168.123.1",
          "next-hop-interface": "eth0",
          "table-id": 254
        }
      ]
    }
  },
  "interfaces": [
    {
      "macAddress": "52:54:00:f7:d4:d1",
      "name": "eth0"
    }
  ]
}
{
  "config": {
    "dns-resolver": {
      "config": {
        "server": [
          "192.168.123.1"
        ]
      }
    },
    "interfaces": [
      {
        "ipv4": {
          "address": [
            {
              "ip": "192.168.123.143",
              "prefix-length": 24
            }
          ],
          "dhcp": false,
          "enabled": true
        },
        "ipv6": {
          "enabled": false
        },
        "name": "eth0",
        "state": "up",
        "type": "ethernet"
      }
    ],
    "routes": {
      "config": [
        {
          "destination": "0.0.0.0/0",
          "next-hop-address": "192.168.123.1",
          "next-hop-interface": "eth0",
          "table-id": 254
        }
      ]
    }
  },
  "interfaces": [
    {
      "macAddress": "52:54:00:f7:d4:d2",
      "name": "eth0"
    }
  ]
}
{
  "config": {
    "dns-resolver": {
      "config": {
        "server": [
          "192.168.123.1"
        ]
      }
    },
    "interfaces": [
      {
        "ipv4": {
          "address": [
            {
              "ip": "192.168.123.144",
              "prefix-length": 24
            }
          ],
          "dhcp": false,
          "enabled": true
        },
        "ipv6": {
          "enabled": false
        },
        "name": "eth0",
        "state": "up",
        "type": "ethernet"
      }
    ],
    "routes": {
      "config": [
        {
          "destination": "0.0.0.0/0",
          "next-hop-address": "192.168.123.1",
          "next-hop-interface": "eth0",
          "table-id": 254
        }
      ]
    }
  },
  "interfaces": [
    {
      "macAddress": "52:54:00:f7:d4:d3",
      "name": "eth0"
    }
  ]
}

Comment 2 Alexander Chuzhoy 2021-12-02 19:56:54 UTC
[core@master-1-2 ~]$ sudo crictl ps -a
CONTAINER           IMAGE                                                                                                                    CREATED              STATE               NAME                        ATTEMPT             POD ID
42aacf4f29b30       20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6                                                         About a minute ago   Exited              verify-api-int-resolvable   5                   3014109f5f9a5
fc88e276d90e0       20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6                                                         4 minutes ago        Running             keepalived-monitor          0                   98b4211786f3a
33c140ab88267       20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6                                                         4 minutes ago        Running             coredns-monitor             0                   af32fe20d7308
687f6f0cdb075       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e96c1755163ecb2827bf4b4d1dfdabf2a125e6aeef620a0b8ba52d0c450432c   4 minutes ago        Running             keepalived                  0                   98b4211786f3a
8285d01ab5def       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f0c1b89092c1966baa30586089f8698f2768b346717194f925cd80dfd84ed040   4 minutes ago        Running             coredns                     0                   af32fe20d7308
19dfcb9eba661       20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6                                                         4 minutes ago        Exited              render-config-keepalived    0                   98b4211786f3a
3a669c85ef122       20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6                                                         4 minutes ago        Exited              render-config-coredns       0                   af32fe20d7308
[core@master-1-2 ~]$ sudo crictl logs 42aacf4f29b30
Error in configuration: 
* unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory
* unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory

Comment 3 Ori Amizur 2021-12-05 09:08:53 UTC
The problem doesn't happen in 4.9.  When it happens the resolv.conf file does not contain a local IP nameserver.  Since the local coredns returns the address for api-int DNS,  if the nameserver does not exist in resolv.conf, the api-int DNS is not resolvable.
In addition, if installed without nmstate config installation completes successfully.
Seems like nmstate issue.

Comment 4 Michael Filanov 2021-12-08 06:33:25 UTC
@phoracek is there anyone that can take a look at this issue?

Comment 5 Petr Horáček 2021-12-08 08:09:19 UTC
The nmstate config here is going through the assisted installer, right? If so, you may want to direct this question to the assisted installer team on #forum-kni-assisted-deployment. There was a similar thread talking about nmstate, dispatcher script and DNS, updated yesterday https://coreos.slack.com/archives/CUPJTHQ5P/p1638483562237600.

Comment 6 Michael Filanov 2021-12-08 08:51:50 UTC
I am from the assisted-installer team, we are using nmstate as is, we are validating the yaml format and just put it on the host. 
This is why i'm asking somone from nmsate to take a look

Comment 7 Petr Horáček 2021-12-08 09:02:54 UTC
Adding @bnemec@bnemec for the dispatcher and @fge for nmstate. My team is only working on kubernetes-nmstate.

Comment 8 Ben Nemec 2021-12-17 22:37:15 UTC
This sounds an awful lot like https://bugzilla.redhat.com/show_bug.cgi?id=2029438. I'm not sure why it's suddenly become a problem now, but we've had multiple reports of this problem with api-int resolution on the bootstrap that appear to be caused by the same thing.

Comment 9 Michael Filanov 2021-12-29 13:34:41 UTC

*** This bug has been marked as a duplicate of bug 2029438 ***


Note You need to log in before you can comment on or make changes to this bug.