Hide Forgot
Description of problem: 'oc get node' cannot return the node which miss AWS DNS suffix on the cluster created with feature gate Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-05-11-054135 How reproducible: Always Steps to Reproduce: 1.Create dhcp-options-set liuhuali@Lius-MacBook-Pro huali-test % aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]' DHCPOPTIONS dopt-0c9dfcde919f49105 301721915996 DHCPCONFIGURATIONS domain-name-servers VALUES AmazonProvidedDNS liuhuali@Lius-MacBook-Pro huali-test % 2.Install a cluster with feature gate like this: ./openshift-install create install-config --log-level=debug --dir=cluster1 ./openshift-install create manifests --log-level=debug --dir=cluster1 vi cluster1/manifests/manifest_feature_gate.yaml apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" name: cluster spec: featureSet: TechPreviewNoUpgrade ./openshift-install create cluster --log-level=debug --dir=cluster1 3.After installation, check the cluster is ok, 'oc get node' return 6 nodes liuhuali@Lius-MacBook-Pro huali-test % oc get machines.machine.openshift.io -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE huliu-aws411ccm-ktgjv-master-0 Running m6i.xlarge us-east-2 us-east-2a 59m ip-10-0-142-194.us-east-2.compute.internal aws:///us-east-2a/i-05d96395caa887d8a running huliu-aws411ccm-ktgjv-master-1 Running m6i.xlarge us-east-2 us-east-2b 59m ip-10-0-188-250.us-east-2.compute.internal aws:///us-east-2b/i-062357b65874125d0 running huliu-aws411ccm-ktgjv-master-2 Running m6i.xlarge us-east-2 us-east-2c 59m ip-10-0-193-79.us-east-2.compute.internal aws:///us-east-2c/i-0a220248387b666a8 running huliu-aws411ccm-ktgjv-worker-us-east-2a-68lcj Running m6i.large us-east-2 us-east-2a 55m ip-10-0-137-131.us-east-2.compute.internal aws:///us-east-2a/i-07835e479d27914ea running huliu-aws411ccm-ktgjv-worker-us-east-2b-wsdr9 Running m6i.large us-east-2 us-east-2b 55m ip-10-0-190-236.us-east-2.compute.internal aws:///us-east-2b/i-0ff467ae0b64f5e97 running huliu-aws411ccm-ktgjv-worker-us-east-2c-mhf4h Running m6i.large us-east-2 us-east-2c 55m ip-10-0-193-47.us-east-2.compute.internal aws:///us-east-2c/i-0cda097a70aca5373 running liuhuali@Lius-MacBook-Pro huali-test % oc get machines.machine.openshift.io NAME PHASE TYPE REGION ZONE AGE huliu-aws411ccm-ktgjv-master-0 Running m6i.xlarge us-east-2 us-east-2a 60m huliu-aws411ccm-ktgjv-master-1 Running m6i.xlarge us-east-2 us-east-2b 60m huliu-aws411ccm-ktgjv-master-2 Running m6i.xlarge us-east-2 us-east-2c 60m huliu-aws411ccm-ktgjv-worker-us-east-2a-68lcj Running m6i.large us-east-2 us-east-2a 56m huliu-aws411ccm-ktgjv-worker-us-east-2b-wsdr9 Running m6i.large us-east-2 us-east-2b 56m huliu-aws411ccm-ktgjv-worker-us-east-2c-mhf4h Running m6i.large us-east-2 us-east-2c 56m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-137-131.us-east-2.compute.internal Ready worker 54m v1.23.3+69213f8 ip-10-0-142-194.us-east-2.compute.internal Ready master 59m v1.23.3+69213f8 ip-10-0-188-250.us-east-2.compute.internal Ready master 60m v1.23.3+69213f8 ip-10-0-190-236.us-east-2.compute.internal Ready worker 54m v1.23.3+69213f8 ip-10-0-193-47.us-east-2.compute.internal Ready worker 54m v1.23.3+69213f8 ip-10-0-193-79.us-east-2.compute.internal Ready master 59m v1.23.3+69213f8 4.Swap the dhcp-options-set for the VPC with the one above 5.Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine liuhuali@Lius-MacBook-Pro huali-test % oc delete machines.machine.openshift.io huliu-aws411ccm-ktgjv-worker-us-east-2c-mhf4h machine.machine.openshift.io "huliu-aws411ccm-ktgjv-worker-us-east-2c-mhf4h" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-137-131.us-east-2.compute.internal Ready worker 63m v1.23.3+69213f8 ip-10-0-142-194.us-east-2.compute.internal Ready master 68m v1.23.3+69213f8 ip-10-0-188-250.us-east-2.compute.internal Ready master 69m v1.23.3+69213f8 ip-10-0-190-236.us-east-2.compute.internal Ready worker 63m v1.23.3+69213f8 ip-10-0-193-79.us-east-2.compute.internal Ready master 68m v1.23.3+69213f8 liuhuali@Lius-MacBook-Pro huali-test % oc get machines.machine.openshift.io -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE huliu-aws411ccm-ktgjv-master-0 Running m6i.xlarge us-east-2 us-east-2a 70m ip-10-0-142-194.us-east-2.compute.internal aws:///us-east-2a/i-05d96395caa887d8a running huliu-aws411ccm-ktgjv-master-1 Running m6i.xlarge us-east-2 us-east-2b 70m ip-10-0-188-250.us-east-2.compute.internal aws:///us-east-2b/i-062357b65874125d0 running huliu-aws411ccm-ktgjv-master-2 Running m6i.xlarge us-east-2 us-east-2c 70m ip-10-0-193-79.us-east-2.compute.internal aws:///us-east-2c/i-0a220248387b666a8 running huliu-aws411ccm-ktgjv-worker-us-east-2a-68lcj Running m6i.large us-east-2 us-east-2a 66m ip-10-0-137-131.us-east-2.compute.internal aws:///us-east-2a/i-07835e479d27914ea running huliu-aws411ccm-ktgjv-worker-us-east-2b-wsdr9 Running m6i.large us-east-2 us-east-2b 66m ip-10-0-190-236.us-east-2.compute.internal aws:///us-east-2b/i-0ff467ae0b64f5e97 running huliu-aws411ccm-ktgjv-worker-us-east-2c-58cml Running m6i.large us-east-2 us-east-2c 8m44s ip-10-0-200-145 aws:///us-east-2c/i-00c3c1b8ac9e27704 running liuhuali@Lius-MacBook-Pro huali-test % liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cluster-machine-approver NAME READY STATUS RESTARTS AGE machine-approver-5955745c76-5z6rq 2/2 Running 0 73m machine-approver-capi-687b57b66d-lpv2q 2/2 Running 0 73m liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-cluster-machine-approver logs -f machine-approver-5955745c76-5z6rq -c machine-approver-controller … I0512 02:23:42.526825 1 controller.go:121] Reconciling CSR: csr-9n978 I0512 02:23:42.545659 1 csr_check.go:157] csr-9n978: CSR does not appear to be client csr I0512 02:23:42.552248 1 csr_check.go:545] retrieving serving cert from ip-10-0-200-145 (10.0.200.145:10250) I0512 02:23:42.553087 1 csr_check.go:182] Failed to retrieve current serving cert: remote error: tls: internal error I0512 02:23:42.553099 1 csr_check.go:202] Falling back to machine-api authorization for ip-10-0-200-145 I0512 02:23:42.558665 1 controller.go:240] CSR csr-9n978 approved Actual results: 'oc get machines.machine.openshift.io -o wide' can see the newly created node(ip-10-0-200-145) miss AWS DNS suffix; 'oc get node' only return 5 nodes, miss the one newly created. Expected results: 'oc get machines.machine.openshift.io -o wide' should see all nodes with AWS DNS suffix; 'oc get node' should return 6 nodes Additional info: Seems related to https://bugzilla.redhat.com/show_bug.cgi?id=2072195 some other cases: Case1: Repeat the above steps but change step 4 to 'Swap the dhcp-options-set for the VPC with one with domain-name' 'oc get node' can return the node newly created. liuhuali@Lius-MacBook-Pro huali-test % oc delete machines.machine.openshift.io huliu-aws411ccm-ktgjv-worker-us-east-2a-68lcj machine.machine.openshift.io "huliu-aws411ccm-ktgjv-worker-us-east-2a-68lcj" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machines.machine.openshift.io -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE huliu-aws411ccm-ktgjv-master-0 Running m6i.xlarge us-east-2 us-east-2a 123m ip-10-0-142-194.us-east-2.compute.internal aws:///us-east-2a/i-05d96395caa887d8a running huliu-aws411ccm-ktgjv-master-1 Running m6i.xlarge us-east-2 us-east-2b 123m ip-10-0-188-250.us-east-2.compute.internal aws:///us-east-2b/i-062357b65874125d0 running huliu-aws411ccm-ktgjv-master-2 Running m6i.xlarge us-east-2 us-east-2c 123m ip-10-0-193-79.us-east-2.compute.internal aws:///us-east-2c/i-0a220248387b666a8 running huliu-aws411ccm-ktgjv-worker-us-east-2a-q6gwt Running m6i.large us-east-2 us-east-2a 11m ip-10-0-128-73.us-east-2.compute.internal aws:///us-east-2a/i-079457d03825b1a8e running huliu-aws411ccm-ktgjv-worker-us-east-2b-wsdr9 Running m6i.large us-east-2 us-east-2b 119m ip-10-0-190-236.us-east-2.compute.internal aws:///us-east-2b/i-0ff467ae0b64f5e97 running huliu-aws411ccm-ktgjv-worker-us-east-2c-58cml Running m6i.large us-east-2 us-east-2c 61m ip-10-0-200-145 aws:///us-east-2c/i-00c3c1b8ac9e27704 running liuhuali@Lius-MacBook-Pro huali-test % liuhuali@Lius-MacBook-Pro huali-test % liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-73.us-east-2.compute.internal Ready worker 8m25s v1.23.3+69213f8 ip-10-0-142-194.us-east-2.compute.internal Ready master 122m v1.23.3+69213f8 ip-10-0-188-250.us-east-2.compute.internal Ready master 123m v1.23.3+69213f8 ip-10-0-190-236.us-east-2.compute.internal Ready worker 117m v1.23.3+69213f8 ip-10-0-193-79.us-east-2.compute.internal Ready master 122m v1.23.3+69213f8 Case2: Repeat the above steps but change step 2 to 'install a cluster without feature gate' 'oc get node' can return the node newly created; 'oc get machine -o wide' can see all nodes with AWS DNS suffix liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-aws411org-n9znk-worker-us-east-2c-g6png machine.machine.openshift.io "huliu-aws411org-n9znk-worker-us-east-2c-g6png" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-143-96.us-east-2.compute.internal Ready worker 64m v1.23.3+69213f8 ip-10-0-158-115.us-east-2.compute.internal Ready master 69m v1.23.3+69213f8 ip-10-0-161-97.us-east-2.compute.internal Ready worker 64m v1.23.3+69213f8 ip-10-0-183-83.us-east-2.compute.internal Ready master 67m v1.23.3+69213f8 ip-10-0-207-171.us-east-2.compute.internal Ready master 68m v1.23.3+69213f8 ip-10-0-211-24.us-east-2.compute.internal Ready worker 4m28s v1.23.3+69213f8 liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE huliu-aws411org-n9znk-master-0 Running m6i.xlarge us-east-2 us-east-2a 69m ip-10-0-158-115.us-east-2.compute.internal aws:///us-east-2a/i-015848c984c27f208 running huliu-aws411org-n9znk-master-1 Running m6i.xlarge us-east-2 us-east-2b 69m ip-10-0-183-83.us-east-2.compute.internal aws:///us-east-2b/i-05d5e5f3928e1f0cc running huliu-aws411org-n9znk-master-2 Running m6i.xlarge us-east-2 us-east-2c 69m ip-10-0-207-171.us-east-2.compute.internal aws:///us-east-2c/i-0b3e2d804b47bb401 running huliu-aws411org-n9znk-worker-us-east-2a-6595z Running m6i.large us-east-2 us-east-2a 66m ip-10-0-143-96.us-east-2.compute.internal aws:///us-east-2a/i-0caef8be0317db87c running huliu-aws411org-n9znk-worker-us-east-2b-nnprl Running m6i.large us-east-2 us-east-2b 66m ip-10-0-161-97.us-east-2.compute.internal aws:///us-east-2b/i-0315216923c19c195 running huliu-aws411org-n9znk-worker-us-east-2c-kfpwh Running m6i.large us-east-2 us-east-2c 8m23s ip-10-0-211-24.us-east-2.compute.internal aws:///us-east-2c/i-09981802d2b381bdf running Then enable feature gate liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate cluster featuregate.config.openshift.io/cluster edited Wait more than four hours, the node still NotReady liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-143-96.us-east-2.compute.internal Ready worker 6h11m v1.23.3+69213f8 ip-10-0-158-115.us-east-2.compute.internal Ready master 6h16m v1.23.3+69213f8 ip-10-0-161-97.us-east-2.compute.internal Ready worker 6h11m v1.23.3+69213f8 ip-10-0-183-83.us-east-2.compute.internal Ready master 6h14m v1.23.3+69213f8 ip-10-0-207-171.us-east-2.compute.internal Ready master 6h15m v1.23.3+69213f8 ip-10-0-211-24.us-east-2.compute.internal NotReady,SchedulingDisabled worker 5h12m v1.23.3+69213f8 liuhuali@Lius-MacBook-Pro huali-test % oc get machines.machine.openshift.io -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE huliu-aws411org-n9znk-master-0 Running m6i.xlarge us-east-2 us-east-2a 6h20m ip-10-0-158-115.us-east-2.compute.internal aws:///us-east-2a/i-015848c984c27f208 running huliu-aws411org-n9znk-master-1 Running m6i.xlarge us-east-2 us-east-2b 6h20m ip-10-0-183-83.us-east-2.compute.internal aws:///us-east-2b/i-05d5e5f3928e1f0cc running huliu-aws411org-n9znk-master-2 Running m6i.xlarge us-east-2 us-east-2c 6h20m ip-10-0-207-171.us-east-2.compute.internal aws:///us-east-2c/i-0b3e2d804b47bb401 running huliu-aws411org-n9znk-worker-us-east-2a-6595z Running m6i.large us-east-2 us-east-2a 6h17m ip-10-0-143-96.us-east-2.compute.internal aws:///us-east-2a/i-0caef8be0317db87c running huliu-aws411org-n9znk-worker-us-east-2b-nnprl Running m6i.large us-east-2 us-east-2b 6h17m ip-10-0-161-97.us-east-2.compute.internal aws:///us-east-2b/i-0315216923c19c195 running huliu-aws411org-n9znk-worker-us-east-2c-kfpwh Running m6i.large us-east-2 us-east-2c 5h19m ip-10-0-211-24.us-east-2.compute.internal aws:///us-east-2c/i-09981802d2b381bdf running
Must Gather - https://drive.google.com/file/d/1rmd--OqfVQRUODuVKzlT1IJyXDvdO-Hi/view?usp=sharing
Repeat the same steps on Cluster version 4.9.25 get below result for reference liuhuali@Lius-MacBook-Pro huali-test % oc delete machines.machine.openshift.io huliu-aws49ccm-hcfg7-worker-us-east-2b-mvkvn machine.machine.openshift.io "huliu-aws49ccm-hcfg7-worker-us-east-2b-mvkvn" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-139-217.us-east-2.compute.internal Ready master 135m v1.22.5+5c84e52 ip-10-0-143-244.us-east-2.compute.internal Ready worker 130m v1.22.5+5c84e52 ip-10-0-164-233.us-east-2.compute.internal Ready master 138m v1.22.5+5c84e52 ip-10-0-201-170.us-east-2.compute.internal Ready master 138m v1.22.5+5c84e52 ip-10-0-208-249.us-east-2.compute.internal Ready worker 130m v1.22.5+5c84e52 liuhuali@Lius-MacBook-Pro huali-test % oc get machines.machine.openshift.io -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE huliu-aws49ccm-hcfg7-master-0 Running m5.xlarge us-east-2 us-east-2a 138m ip-10-0-139-217.us-east-2.compute.internal aws:///us-east-2a/i-041caf20f5ef1d6d8 running huliu-aws49ccm-hcfg7-master-1 Running m5.xlarge us-east-2 us-east-2b 138m ip-10-0-164-233.us-east-2.compute.internal aws:///us-east-2b/i-020bb645e5b9aca85 running huliu-aws49ccm-hcfg7-master-2 Running m5.xlarge us-east-2 us-east-2c 138m ip-10-0-201-170.us-east-2.compute.internal aws:///us-east-2c/i-07f068321f91c0613 running huliu-aws49ccm-hcfg7-worker-us-east-2a-7rl8z Running m5.large us-east-2 us-east-2a 134m ip-10-0-143-244.us-east-2.compute.internal aws:///us-east-2a/i-05ab8f42d3bf587d2 running huliu-aws49ccm-hcfg7-worker-us-east-2b-67q4h Provisioned m5.large us-east-2 us-east-2b 24m aws:///us-east-2b/i-002b452c26e1bdbd7 running huliu-aws49ccm-hcfg7-worker-us-east-2c-8gwfh Running m5.large us-east-2 us-east-2c 134m ip-10-0-208-249.us-east-2.compute.internal aws:///us-east-2c/i-0791493e38c3bdb89 running liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cluster-machine-approver NAME READY STATUS RESTARTS AGE machine-approver-6f4f5f79bc-6cgrg 2/2 Running 0 146m liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-cluster-machine-approver logs -f machine-approver-6f4f5f79bc-6cgrg -c machine-approver-controller ... I0512 11:11:57.594066 1 controller.go:114] Reconciling CSR: csr-74ttz E0512 11:11:57.610904 1 csr_check.go:257] csr-74ttz: failed to find machine for node ip-10-0-175-75, cannot approve I0512 11:11:57.610942 1 controller.go:199] csr-74ttz: CSR not authorized
So I had a deep dive into this, here is what I found. The issue is manifesting _only_ in TechPreview clusters which have the VPC configured with a custom DHCP option set that has an empty `domain-name` field. So first I started a TechPreview cluster from the latest nightly `4.11.0-0.nightly-2022-05-11-054135` following the reproduction steps described here: https://bugzilla.redhat.com/show_bug.cgi?id=2084450#c0 And once swapped the DHCP option set with the custom, empty one (described at step [4] of the reproduction sequence), I deleted a random worker machine. When the replacement machine came up, it followed the correct Provisioning -> Provisioned -> Running state path. ``` NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ddonati-test101-qwv69-master-0 Running m6i.xlarge eu-west-2 eu-west-2a 5h32m ip-10-0-134-185.eu-west-2.compute.internal aws:///eu-west-2a/i-0ce2502bfe9ac9aac running ddonati-test101-qwv69-master-1 Running m6i.xlarge eu-west-2 eu-west-2b 5h32m ip-10-0-191-104.eu-west-2.compute.internal aws:///eu-west-2b/i-07c88436bc3b33dc8 running ddonati-test101-qwv69-master-2 Running m6i.xlarge eu-west-2 eu-west-2c 5h32m ip-10-0-208-251.eu-west-2.compute.internal aws:///eu-west-2c/i-04690ad386afae09e running ddonati-test101-qwv69-worker-eu-west-2a-6b85z Running m6i.large eu-west-2 eu-west-2a 5h28m ip-10-0-159-197.eu-west-2.compute.internal aws:///eu-west-2a/i-021381afd17f4ad8d running ddonati-test101-qwv69-worker-eu-west-2b-nhpnk Running m6i.large eu-west-2 eu-west-2b 5h28m ip-10-0-189-50.eu-west-2.compute.internal aws:///eu-west-2b/i-027935c1104797421 running ddonati-test101-qwv69-worker-eu-west-2c-rptjk Running m6i.large eu-west-2 eu-west-2c 4m20s ip-10-0-192-204 aws:///eu-west-2c/i-0beb8c11e90ace35c running <-- NEW ``` The CSRs for serving and client certs were correctly created and both were approved. ``` NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-csdxx 51s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved,Issued csr-bgszn 45s kubernetes.io/kubelet-serving system:node:ip-10-0-192-204 <none> Approved,Issued ``` Detail of the Cert Request for the kubelet serving certificate: ``` [~] $ openssl req -in <(oc get csr csr-bgszn -o json | jq -r '.spec.request' | base64 --decode) -noout -text Certificate Request: Data: Version: 0 (0x0) Subject: O=system:nodes, CN=system:node:ip-10-0-192-204 Subject Public Key Info: Public Key Algorithm: id-ecPublicKey Public-Key: (256 bit) pub: 04:0d:ad:33:27:cd:b0:2d:0b:24:94:31:e8:3d:54: 39:d3:c6:8e:7e:8e:98:38:a1:51:f5:07:dc:60:d4: f0:bb:79:95:3b:d7:80:89:61:52:51:bd:7d:e3:73: ff:ab:57:dc:cc:f9:74:40:7c:e9:f9:4d:b8:c1:4f: 75:7f:ec:70:83 ASN1 OID: prime256v1 NIST CURVE: P-256 Attributes: Requested Extensions: X509v3 Subject Alternative Name: DNS:ip-10-0-192-204, IP Address:10.0.192.204 Signature Algorithm: ecdsa-with-SHA256 30:46:02:21:00:e7:07:65:d7:93:a0:cb:17:f9:87:8c:49:62: 57:dc:aa:42:b9:73:fc:08:0c:c1:87:fb:9a:ae:99:a7:02:37: 0c:02:21:00:8d:a9:f8:16:01:2f:68:87:ca:c2:f0:23:f9:87: 11:18:09:ae:a9:79:4e:03:4d:4b:42:f3:c3:7c:79:fa:3e:d4 ``` Although the corresponding Kubernetes Node object was never created for the new machine: ``` NAME STATUS ROLES AGE VERSION INTERNAL-IP node/ip-10-0-134-185.eu-west-2.compute.internal Ready master 5h37m v1.23.3+69213f8 10.0.134.185 node/ip-10-0-159-197.eu-west-2.compute.internal Ready worker 5h30m v1.23.3+69213f8 10.0.159.197 node/ip-10-0-189-50.eu-west-2.compute.internal Ready worker 5h25m v1.23.3+69213f8 10.0.189.50 node/ip-10-0-191-104.eu-west-2.compute.internal Ready master 5h37m v1.23.3+69213f8 10.0.191.104 node/ip-10-0-208-251.eu-west-2.compute.internal Ready master 5h37m v1.23.3+69213f8 10.0.208.251 ``` So first I went looking into the kubelet logs: ``` May 17 13:43:03 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:03.565802 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:03 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:03.666704 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:03 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:03.767604 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:03 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:03.811619 1432 eviction_manager.go:254] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-10-0-192-204\" not found" May 17 13:43:03 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:03.867852 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:03 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:03.968457 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.069138 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.169819 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.269901 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.370412 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.448190 1432 kubelet.go:2408] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.471383 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.571786 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.625232 1432 nodelease.go:49] "Failed to get node when trying to set owner ref to the node lease" err="nodes \"ip-10-0-192-204\" not found" node="ip-10-0-192-204" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.672592 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.772659 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" May 17 13:43:04 ip-10-0-192-204 hyperkube[1432]: E0513 13:43:04.873089 1432 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" ``` The kubelet was constantly throwing an "Error getting node". I then went looking into the kube-apiserver logs, which was throwing: ``` I0517 13:55:02.593962 16 node_authorizer.go:203] "NODE DENY" err="unknown node 'ip-10-0-192-204' cannot get secret openshift-dns/node-resolver-dockercfg-v5rvg" I0517 13:55:02.594130 16 node_authorizer.go:203] "NODE DENY" err="unknown node 'ip-10-0-192-204' cannot get configmap openshift-dns/openshift-service-ca.crt" ``` I then examined a non-TechPreview cluster with the same custom DHCP Option Set with empty `domain-name`, and observed that the CSR requests from the kubelet of the new machine, were slightly different from the ones for a TechPreview cluster. They in fact had a different Common Name (CN) for the Certificate Request: ``` NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-skdvb 23m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved,Issued csr-7p4hg 23m kubernetes.io/kubelet-serving system:node:ip-10-0-202-210.ec2.internal <none> Approved,Issued ``` ``` $ openssl req -in <(oc get csr csr-7p4hg -o json | jq -r '.spec.request' | base64 --decode) -noout -text Certificate Request: Data: Version: 0 (0x0) Subject: O=system:nodes, CN=system:node:ip-10-0-202-210.ec2.internal Subject Public Key Info: Public Key Algorithm: id-ecPublicKey Public-Key: (256 bit) pub: 04:16:90:22:fe:c4:4e:8f:01:13:db:64:ee:eb:5d: a5:13:26:9c:d4:7c:22:06:05:30:3a:6c:ac:0c:03: 57:80:52:8d:3f:17:fa:26:4f:f5:39:ba:ef:7a:da: 2f:e6:bb:d6:f0:25:37:2b:d6:9a:47:ee:5e:9c:94: 94:db:ce:26:53 ASN1 OID: prime256v1 NIST CURVE: P-256 Attributes: Requested Extensions: X509v3 Subject Alternative Name: DNS:ip-10-0-202-210.ec2.internal, IP Address:10.0.202.210 Signature Algorithm: ecdsa-with-SHA256 30:45:02:21:00:9c:74:4a:d9:07:b9:a7:c3:40:a0:af:6b:77: fd:0e:09:66:94:02:4d:7f:cc:97:82:f1:48:43:e9:b7:98:f2: 0f:02:20:72:65:a8:21:64:be:cc:e7:23:18:60:6f:ca:fd:b2: 0b:63:e4:69:b2:38:fb:ec:be:ab:e7:56:c0:33:c2:cc:f1 ``` I thus took a dive into the kubelet code to understand what was setting the Common Name (CN) in the Certificate Request, and what with value. The corresponding line https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/pkg/kubelet/certificate/kubelet.go#L228 showed that the `nodeName` variable is what sets the Common Name. Following on from here `nodeName` is defined here: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L1129-L1137 and then passed down through various hops to the kubelet certificate logic mentioned previously. Also at that specific line the `nodeName` value is set by the value returned by the `getNodeName()` invocation, which in turn: - checks if `cloud cloudprovider.Interface` is `nil`. If nil it falls back to the `hostname` var value. - if it's not nil, it fetches the Instance interface and calls the `instance.CurrentNodeName()` on it ref: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L1009-L1029 To understand what path the code execution takes for these clusters (the TechPreview and the non-TechPreview ones) we need to trace back to where the cloud interface is set. - it starts off by being nil: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L424 - then it is only set if the Cloud Provider is NOT `external`: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L578-L588 To find the cloud provider value set for these clusters we can check the kubelet args in journalctl 1) For the non-TechPreview cluster the cloud provider is set to: ``` May 17 13:47:56 ip-10-0-201-35 hyperkube[1428]: I0513 13:47:56.747144 1428 flags.go:64] FLAG: --cloud-provider="aws" # <-- the legacy in-tree aws provider ``` So cloud will NOT be `nil`, and we can follow the `instance.CurrentNodeName()` logic from here to understand what value it will be set to: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L952-L955 which after various hops hits: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L1927 and ends up being set to `instance.PrivateDnsName`: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L4925-L4928 This value turns out being a FQDN `ip-10-0-145-35.eu-west-2.compute.internal`, and ends up populating nodeName and the Common Name (CN) of the Certificate Request. 2) For the TechPreview cluster the cloud provider is set to: ``` May 17 13:55:58 ip-10-0-201-63 hyperkube[1432]: I0513 13:55:58.328858 1432 flags.go:64] FLAG: --cloud-provider="external" # <-- external cloud provider (out of tree) ``` In this case the cloud provider will be `external`, so `kubeDeps.Cloud` will stay nil here: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L578-L588 Which means at `getNodeName()` the code path taken will be this: https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L1012-L1014 And the `nodeName` will be set to whatever the `hostname` variable value holds. The `hostname` is set here by `getHostname` https://github.com/openshift/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L591-L594 Which returns `os.Hostname()` or `hostnameOverride` if that is set. Turns out hostnameOverride is set as a flag argument to the kubelet: ``` May 17 13:57:56 ip-10-0-201-35 hyperkube[1428]: I0513 13:57:56.747408 1428 flags.go:64] FLAG: --hostname-override="ip-10-0-201-35" ``` so that means the value is used for `nodeName` and ends up in Common Name (CN) of the Certificate Request. Now that we know that the Common Name is set by `--hostname-override="ip-10-0-201-35"` for the cluster that breaks, can we make it be set to the FQDN (PrivateDnsName) instead? So `--hostname-override` is set to the `KUBELET_NODE_NAME` env variable here: https://github.com/openshift/machine-config-operator/blob/ae016596f42e39209f2f87ba34e5d2dc674e31e2/templates/master/01-master-kubelet/_base/units/kubelet.service.yaml#L42 In turn `KUBELET_NODE_NAME` is set to the `AFTERBURN_AWS_HOSTNAME` variable, here: https://github.com/openshift/machine-config-operator/blob/ae016596f42e39209f2f87ba34e5d2dc674e31e2/templates/common/aws/files/usr-local-bin-aws-kubelet-nodename.yaml#L21 Which is set to the `/meta-data/hostname` value obtained querying the AWS instance metadata service: https://github.com/zonggen/afterburn/blob/9770f0ac0a65bbde2aba35a0a02c3076ca595f1a/src/providers/aws/mock_tests.rs#L120 To verify this theory I've edited the `KUBELET_NODE_NAME` and set it to the FQDN value obtained by doing a DNS lookup to obtain the pointer domain to the IP address of the machine: https://github.com/damdo/machine-config-operator/commit/a08479fb577d83ce1398f481931e4f009498cb22. With this change the node correctly registers with the kube-apiserver and works correctly. This might help us find a better long-term solution. Other corollary questions: Why is this happening only for TP vs non-TP clusters in this custom, domain-name empty, DHCP Option set scenario? Because in TP vs non-TP clusters, with default DHCP Option set the AWS instance Hostname and PrivateDnsName both equal to the same FQDN.
> So first I went looking into the kubelet logs: Was there anything in the logs about it trying to create the Node object? Do you happen to have a dump of the journal? I would like to know what happened to the node object. Kubelet should have tried to create one, and AFAIK, should have not issues the serving cert CSR until/unless the Node existed. I would be surprised if the Node Authorizer didn't deny the request to create the CSR if the Node object didn't exist, does that mean it did exist briefly and then was removed? While having the FQDN is good to have consistency between both in-tree and out of tree, as the CSRs were approved, it doesn't seem that this is necessarily the issue. I don't think we have the root cause here. > Why is this happening only for TP vs non-TP clusters in this custom, domain-name empty, DHCP Option set scenario? > Because in TP vs non-TP clusters, with default DHCP Option set the AWS instance Hostname and PrivateDnsName both equal to the same FQDN. I suspect we need to look at the Node and NodeLifecycle controllers in the AWS CCM. I wonder if somehow the AWS CCM isn't finding the instance and is therefore deleting the Node shortly after create. How certain are you that it never existed? We should be able to work it out from a combination of Kubelet or Kube API audit logs
After further digging it turns out what Joel was hinting at it's what's actually happening. The node is registered by its kubelet: ``` May 17 13:53:59 ip-10-0-144-157 hyperkube[1437]: I0517 13:53:59.635278 1437 kubelet_node_status.go:75] "Successfully registered node" node="ip-10-0-144-157" ``` The Node object is then stored in etcd and briefly appears among the cluster nodes. ``` $ kubectl get nodes -w ... ip-10-0-144-157 NotReady worker 0s v1.23.3+69213f8 ip-10-0-144-157 NotReady worker 0s v1.23.3+69213f8 ip-10-0-144-157 NotReady worker 0s v1.23.3+69213f8 ip-10-0-144-157 NotReady worker 0s v1.23.3+69213f8 ``` But it then very quickly disappears within the next few seconds. And kubelet can't find it anymore either: ``` ... May 17 13:54:02 ip-10-0-192-204 hyperkube[1437]: E0517 13:54:02.015659 1437 kubelet.go:2483] "Error getting node" err="node \"ip-10-0-192-204\" not found" ``` Looking at the node controller logs in the AWS CCM, it turns out the controller is not able to find the instance backing the node and thus the node is deleted: ``` E0518 11:37:33.302686 1 node_controller.go:213] error syncing 'ip-10-0-144-157': failed to get provider ID for node ip-10-0-144-157 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing I0518 11:37:33.197748 1 aws.go:5212] Unable to convert node name "ip-10-0-144-157" to aws instanceID, fall back to findInstanceByNodeName: node has no providerID ``` This is because of how the nodeName is computed in the kubelet vs. the assumptions made in the AWS CCM. The kubelet computes the nodeName by invoking `getNodeName()`: https://github.com/kubernetes/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L1129-L1137 which in turn behaves in different ways depending on whether in-tree vs. external providers: https://github.com/kubernetes/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L1009-L1029 are used More in detail when the `--cloud-provider=external` is set on the kubelet `cloud` will be `nil` and the hostname will be used as a value for `nodeName`: https://github.com/kubernetes/kubernetes/blob/69213f85dee380a180f1c4f141f482a8c7cbcac3/cmd/kubelet/app/server.go#L1012-L1014. The AWS cloud provider, when syncing the Node in the node-controller, tries to find the instance backing the node by describing all instances and filtering out the one with `private-dns-name` matching the nodeName: https://github.com/kubernetes/cloud-provider-aws/blob/30a02a65f107eda576f644682efa1b11cf2c351b/pkg/providers/v1/aws.go#L5181-L5186 (which in this case is the hostname). This works when the hostname has the same value of the `private-dns-name`, but doesn't in cases where they differ. For example when a node is created with the custom DHCP Option Set previously described, the `hostname` will be of the form: `ip-10-0-144-157` as opposed to its `privated-dns-name` which will be of the form: `ip-10-0-144-157.ec2.internal`. I've created an issue upstream on the AWS CCM to track this bug: https://github.com/kubernetes/cloud-provider-aws/issues/384 I'll keep posting updates on it.
I tried to verify the issue on 4.11.0-0.nightly-2022-06-04-014713, but the issue still exists, 'oc get node' cannot return the node newly created. ddonati can you please help to take a look? Thanks! liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-04-014713 True False 4h36m Cluster version is 4.11.0-0.nightly-2022-06-04-014713 liuhuali@Lius-MacBook-Pro huali-test % oc get machine.machine.openshift.io -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE huliu-aws115-cv2gj-master-0 Running m6i.xlarge us-east-2 us-east-2a 4h53m ip-10-0-146-141.us-east-2.compute.internal aws:///us-east-2a/i-0be13a66c3f43b557 running huliu-aws115-cv2gj-master-1 Running m6i.xlarge us-east-2 us-east-2b 4h53m ip-10-0-179-175.us-east-2.compute.internal aws:///us-east-2b/i-0b9fd3f7c2fe20340 running huliu-aws115-cv2gj-master-2 Running m6i.xlarge us-east-2 us-east-2c 4h53m ip-10-0-192-54.us-east-2.compute.internal aws:///us-east-2c/i-0c7ae2b72736bf731 running huliu-aws115-cv2gj-worker-us-east-2a-pfrwg Running m6i.xlarge us-east-2 us-east-2a 4h51m ip-10-0-129-73.us-east-2.compute.internal aws:///us-east-2a/i-034fe202d6051b02b running huliu-aws115-cv2gj-worker-us-east-2b-bwjs8 Running m6i.xlarge us-east-2 us-east-2b 95m ip-10-0-175-194 aws:///us-east-2b/i-0dadbec158cb84111 running huliu-aws115-cv2gj-worker-us-east-2c-9d8wd Running m6i.xlarge us-east-2 us-east-2c 4h51m ip-10-0-221-82.us-east-2.compute.internal aws:///us-east-2c/i-07ea4f1df3a1acf30 running liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-129-73.us-east-2.compute.internal Ready worker 4h47m v1.24.0+bb9c2f1 ip-10-0-146-141.us-east-2.compute.internal Ready master 4h54m v1.24.0+bb9c2f1 ip-10-0-179-175.us-east-2.compute.internal Ready master 4h54m v1.24.0+bb9c2f1 ip-10-0-192-54.us-east-2.compute.internal Ready master 4h54m v1.24.0+bb9c2f1 ip-10-0-221-82.us-east-2.compute.internal Ready worker 4h47m v1.24.0+bb9c2f1 Must Gather - https://drive.google.com/file/d/1h8QwnLGeu48NBt5wN6yaO2A5KBmVW52x/view?usp=sharing
Hey @huliu it is expected for `4.11.0-0.nightly-2022-06-04-014713` to still have the issue. Here is why: 1 - A fix for this issue was merged via https://github.com/openshift/machine-config-operator/pull/3162 on 2022-05-30 16:13:00 UTC 2 - The fix merged at step 1. was later reverted via https://github.com/openshift/machine-config-operator/pull/3175 as it caused issues in CI and no accepted nightly was produced which included 1. 3 - A newer version of 1. was proposed (which was redacted with a slight change to avoid the CI issues described at 2.) via https://github.com/openshift/machine-config-operator/pull/3170. The PR has been merged at 2022-06-07 20:38:00 UTC. For this reason `4.11.0-0.nightly-2022-06-04-014713` didn't contain any fix for this BZ issue. The upcoming accepted nightly should have the fix introduced by https://github.com/openshift/machine-config-operator/pull/3170
@ddonati Thanks for your findings! When upcoming nightly build is ready, I'll verify it again. Thanks!
Checked on 4.11.0-0.nightly-2022-06-11-054027, the common case is solved. 'oc get node' return the node newly created. liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-11-054027 True False 138m Cluster version is 4.11.0-0.nightly-2022-06-11-054027 liuhuali@Lius-MacBook-Pro huali-test % oc get machine.machine.openshift.io -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE huliu-aws134-4dh6t-master-0 Running m6i.xlarge us-east-2 us-east-2a 3h25m ip-10-0-158-25.us-east-2.compute.internal aws:///us-east-2a/i-0c0f6e27611b79529 running huliu-aws134-4dh6t-master-1 Running m6i.xlarge us-east-2 us-east-2b 3h25m ip-10-0-181-251.us-east-2.compute.internal aws:///us-east-2b/i-01b54ebd9e0bb2f21 running huliu-aws134-4dh6t-master-2 Running m6i.xlarge us-east-2 us-east-2c 3h25m ip-10-0-206-180.us-east-2.compute.internal aws:///us-east-2c/i-0554223996c0a030f running huliu-aws134-4dh6t-worker-us-east-2a-rbpr9 Running m6i.xlarge us-east-2 us-east-2a 113m ip-10-0-159-163 aws:///us-east-2a/i-00babecef0457aee5 running huliu-aws134-4dh6t-worker-us-east-2b-8xxws Running m6i.xlarge us-east-2 us-east-2b 3h19m ip-10-0-180-54.us-east-2.compute.internal aws:///us-east-2b/i-042118ce41ddd4a96 running huliu-aws134-4dh6t-worker-us-east-2c-zmlxx Running m6i.xlarge us-east-2 us-east-2c 3h19m ip-10-0-210-28.us-east-2.compute.internal aws:///us-east-2c/i-06b3b94751f25a880 running liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-158-25.us-east-2.compute.internal Ready master 3h24m v1.24.0+cb71478 ip-10-0-159-163 Ready worker 111m v1.24.0+cb71478 ip-10-0-180-54.us-east-2.compute.internal Ready worker 3h9m v1.24.0+cb71478 ip-10-0-181-251.us-east-2.compute.internal Ready master 3h25m v1.24.0+cb71478 ip-10-0-206-180.us-east-2.compute.internal Ready master 3h25m v1.24.0+cb71478 ip-10-0-210-28.us-east-2.compute.internal Ready worker 3h9m v1.24.0+cb71478 liuhuali@Lius-MacBook-Pro huali-test % I also checked Case2 mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2084450#c0, but the issue still exists. After enable feature gate, the node newly created cannot get ready. I used to think it was the same issue and it should be fixed together. Not sure why it still exists, @ddonati can you please help to take a look? Thanks! Steps: 1.Create dhcp-options-set liuhuali@Lius-MacBook-Pro huali-test % aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]' DHCPOPTIONS dopt-01dfde88223fcc2cf 301721915996 DHCPCONFIGURATIONS domain-name-servers VALUES AmazonProvidedDNS 2.Install a common cluster(without feature gate) 3.After installation, check the cluster is ok, 'oc get node' return 6 nodes and all are Ready liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws133-nz2f8-master-0 Running m6i.xlarge us-east-2 us-east-2a 178m huliu-aws133-nz2f8-master-1 Running m6i.xlarge us-east-2 us-east-2b 178m huliu-aws133-nz2f8-master-2 Running m6i.xlarge us-east-2 us-east-2c 178m huliu-aws133-nz2f8-worker-us-east-2a-nj486 Running m6i.xlarge us-east-2 us-east-2a 174m huliu-aws133-nz2f8-worker-us-east-2b-hf9sf Running m6i.xlarge us-east-2 us-east-2b 174m huliu-aws133-nz2f8-worker-us-east-2c-dtgkp Running m6i.xlarge us-east-2 us-east-2c 174m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-144-225.us-east-2.compute.internal Ready worker 162m v1.24.0+cb71478 ip-10-0-158-86.us-east-2.compute.internal Ready master 176m v1.24.0+cb71478 ip-10-0-167-186.us-east-2.compute.internal Ready worker 168m v1.24.0+cb71478 ip-10-0-176-31.us-east-2.compute.internal Ready master 176m v1.24.0+cb71478 ip-10-0-195-129.us-east-2.compute.internal Ready worker 168m v1.24.0+cb71478 ip-10-0-222-99.us-east-2.compute.internal Ready master 176m v1.24.0+cb71478 4.Swap the dhcp-options-set for the VPC with the one above 5.Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-aws133-nz2f8-worker-us-east-2b-hf9sf machine.machine.openshift.io "huliu-aws133-nz2f8-worker-us-east-2b-hf9sf" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws133-nz2f8-master-0 Running m6i.xlarge us-east-2 us-east-2a 3h4m huliu-aws133-nz2f8-master-1 Running m6i.xlarge us-east-2 us-east-2b 3h4m huliu-aws133-nz2f8-master-2 Running m6i.xlarge us-east-2 us-east-2c 3h4m huliu-aws133-nz2f8-worker-us-east-2a-nj486 Running m6i.xlarge us-east-2 us-east-2a 3h huliu-aws133-nz2f8-worker-us-east-2b-4h9pc Running m6i.xlarge us-east-2 us-east-2b 4m16s huliu-aws133-nz2f8-worker-us-east-2c-dtgkp Running m6i.xlarge us-east-2 us-east-2c 3h liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-144-225.us-east-2.compute.internal Ready worker 168m v1.24.0+cb71478 ip-10-0-158-86.us-east-2.compute.internal Ready master 3h2m v1.24.0+cb71478 ip-10-0-168-28.us-east-2.compute.internal Ready worker 53s v1.24.0+cb71478 ip-10-0-176-31.us-east-2.compute.internal Ready master 3h2m v1.24.0+cb71478 ip-10-0-195-129.us-east-2.compute.internal Ready worker 174m v1.24.0+cb71478 ip-10-0-222-99.us-east-2.compute.internal Ready master 3h2m v1.24.0+cb71478 6. Enable feature gate liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate cluster featuregate.config.openshift.io/cluster edited 7. Wait more than two hours, the node newly created still NotReady. The cluster is degraded. liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-144-225.us-east-2.compute.internal Ready worker 5h4m v1.24.0+cb71478 ip-10-0-158-86.us-east-2.compute.internal Ready master 5h18m v1.24.0+cb71478 ip-10-0-168-28.us-east-2.compute.internal NotReady,SchedulingDisabled worker 136m v1.24.0+cb71478 ip-10-0-176-31.us-east-2.compute.internal Ready master 5h18m v1.24.0+cb71478 ip-10-0-195-129.us-east-2.compute.internal Ready worker 5h10m v1.24.0+cb71478 ip-10-0-222-99.us-east-2.compute.internal Ready master 5h18m v1.24.0+cb71478 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-11-054027 True False 4h53m Error while reconciling 4.11.0-0.nightly-2022-06-11-054027: the cluster operator network is degraded liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0-0.nightly-2022-06-11-054027 True False False 122m baremetal 4.11.0-0.nightly-2022-06-11-054027 True False False 5h16m cloud-controller-manager 4.11.0-0.nightly-2022-06-11-054027 True False False 5h18m cloud-credential 4.11.0-0.nightly-2022-06-11-054027 True False False 5h19m cluster-api 4.11.0-0.nightly-2022-06-11-054027 True False False 130m cluster-autoscaler 4.11.0-0.nightly-2022-06-11-054027 True False False 5h16m config-operator 4.11.0-0.nightly-2022-06-11-054027 True False False 5h17m console 4.11.0-0.nightly-2022-06-11-054027 True False False 5h csi-snapshot-controller 4.11.0-0.nightly-2022-06-11-054027 True False False 5h17m dns 4.11.0-0.nightly-2022-06-11-054027 True True False 5h16m DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6." etcd 4.11.0-0.nightly-2022-06-11-054027 True False False 5h9m image-registry 4.11.0-0.nightly-2022-06-11-054027 True False False 5h4m ingress 4.11.0-0.nightly-2022-06-11-054027 True False False 5h4m insights 4.11.0-0.nightly-2022-06-11-054027 True False False 5h6m kube-apiserver 4.11.0-0.nightly-2022-06-11-054027 True False False 5h6m kube-controller-manager 4.11.0-0.nightly-2022-06-11-054027 True False False 5h14m kube-scheduler 4.11.0-0.nightly-2022-06-11-054027 True False False 5h8m kube-storage-version-migrator 4.11.0-0.nightly-2022-06-11-054027 True False False 131m machine-api 4.11.0-0.nightly-2022-06-11-054027 True False False 5h8m machine-approver 4.11.0-0.nightly-2022-06-11-054027 True False False 5h17m machine-config 4.11.0-0.nightly-2022-06-11-054027 False False True 112m Cluster not available for [{operator 4.11.0-0.nightly-2022-06-11-054027}] marketplace 4.11.0-0.nightly-2022-06-11-054027 True False False 5h16m monitoring 4.11.0-0.nightly-2022-06-11-054027 True False False 5h4m network 4.11.0-0.nightly-2022-06-11-054027 True True True 5h18m DaemonSet "/openshift-multus/multus" rollout is not making progress - last change 2022-06-14T05:59:59Z... node-tuning 4.11.0-0.nightly-2022-06-11-054027 True False False 5h16m openshift-apiserver 4.11.0-0.nightly-2022-06-11-054027 True False False 5h5m openshift-controller-manager 4.11.0-0.nightly-2022-06-11-054027 True False False 5h8m openshift-samples 4.11.0-0.nightly-2022-06-11-054027 True False False 5h4m operator-lifecycle-manager 4.11.0-0.nightly-2022-06-11-054027 True False False 5h16m operator-lifecycle-manager-catalog 4.11.0-0.nightly-2022-06-11-054027 True False False 5h17m operator-lifecycle-manager-packageserver 4.11.0-0.nightly-2022-06-11-054027 True False False 5h5m service-ca 4.11.0-0.nightly-2022-06-11-054027 True False False 5h17m storage 4.11.0-0.nightly-2022-06-11-054027 True True False 131m AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods liuhuali@Lius-MacBook-Pro huali-test % Must Gather - https://drive.google.com/file/d/1vodjVMb5GPsRMhJ2gh6J6dNQXOzzWhcL/view?usp=sharing
This is a corner case which is causing issues with CCM being enabled, Damiano has parked this for a moment and will come back to it as we prioritise the CCM stability later this release cycle
We've scheduled time for Damiano to look into this within the current sprint, he will update here once he has reproduced the issue and worked out a bit more about what's going on
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9268