Description of problem: This is similar to BZ#1798272 but slightly different. Now worker nodes do show up in the nodes list but they are in NotReady state: [kni@provisionhost-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0.ocp-edge-cluster.qe.lab.redhat.com Ready master 29m v1.17.1 master-1.ocp-edge-cluster.qe.lab.redhat.com Ready master 28m v1.17.1 master-2.ocp-edge-cluster.qe.lab.redhat.com Ready master 30m v1.17.1 worker-0.ocp-edge-cluster.qe.lab.redhat.com NotReady worker 14m v1.17.1 worker-1.ocp-edge-cluster.qe.lab.redhat.com NotReady worker 14m v1.17.1 [kni@provisionhost-0 ~]$ oc get csr NAME AGE REQUESTOR CONDITION csr-2rc6z 29m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-6t67p 29m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-bdlsw 30m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-cq4nh 14m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-h6q64 14m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-kljnb 14m system:node:worker-1.ocp-edge-cluster.qe.lab.redhat.com Pending csr-ltc92 29m system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com Approved,Issued csr-mn5lw 29m system:node:master-1.ocp-edge-cluster.qe.lab.redhat.com Approved,Issued csr-vzwsh 14m system:node:worker-0.ocp-edge-cluster.qe.lab.redhat.com Pending csr-z27vg 30m system:node:master-2.ocp-edge-cluster.qe.lab.redhat.com Approved,Issued Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-02-042029 How reproducible: 100% Steps to Reproduce: 1. Deploy 4.4 bare metal deployment with 3 x master and 2 x worker nodes Actual results: Deployment is blocked because worker nodes CSRs are in Pending state. Expected results: Deployment succeeds. Additional info:
Note: even after manually approving the CSRs the worker nodes do not get into Ready state because ovnkube-node keeps looping through: [root@worker-0 core]# crictl logs 0eba49f3ad7c8 + [[ -f /env/worker-0.ocp-edge-cluster.qe.lab.redhat.com ]] + cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/ + ovn_config_namespace=openshift-ovn-kubernetes + retries=0 + true ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' + db_ip= + [[ -n '' ]] + (( retries += 1 )) + [[ 1 -gt 40 ]] + echo 'waiting for db endpoint' waiting for db endpoint + sleep 5 + true ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' + db_ip= + [[ -n '' ]] + (( retries += 1 )) waiting for db endpoint + [[ 2 -gt 40 ]] + echo 'waiting for db endpoint' + sleep 5 + true ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' + db_ip= + [[ -n '' ]] + (( retries += 1 )) waiting for db endpoint
So is it one testblocker ?
(In reply to Wei Sun from comment #2) > So is it one testblocker ? Yes, it's a test blocker.
The ovnkube-node issue is a known bug which is fixed by: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/359 I would test again, as at least that issue should be gone. As for the CSR approval thing, there's nothing I can do without either logs or access to a cluster in this state.
Marius gave me access to a cluster showing the problem. The cluster-machine-approver shows: I0311 18:10:32.635909 1 csr_check.go:418] retrieving serving cert from worker-0.ocp-edge-cluster.qe.lab.redhat.com ([fd2e:6f44:5dd8:c956::13b]:10250) W0311 18:10:32.636926 1 csr_check.go:178] Failed to retrieve current serving cert: remote error: tls: internal error I0311 18:10:32.636948 1 csr_check.go:183] Falling back to machine-api authorization for worker-0.ocp-edge-cluster.qe.lab.redhat.com I0311 18:10:32.636959 1 main.go:181] CSR csr-nlgss not authorized: No target machine for node "worker-0.ocp-edge-cluster.qe.lab.redhat.com" I0311 18:10:32.636966 1 main.go:217] Error syncing csr csr-nlgss: No target machine for node "worker-0.ocp-edge-cluster.qe.lab.redhat.com" ... meaning that it failed to map a Node back to a corresponding Machine. This usually happens because our addresses don't match up correctly. The addresses on the worker-0 Node are: addresses: - address: fd2e:6f44:5dd8:c956::13b type: InternalIP - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com type: Hostname The addresses on the Machine are: addresses: - address: 172.22.0.59 type: InternalIP - address: "" type: InternalIP - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com type: Hostname - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com type: InternalDNS and the relevant info from the BareMetalHost: hostname: worker-0.ocp-edge-cluster.qe.lab.redhat.com nics: - ip: 172.22.0.59 mac: 52:54:00:09:9d:d2 model: 0x1af4 0x0001 name: enp4s0 pxe: true speedGbps: 0 vlanId: 0 - ip: "" mac: 52:54:00:50:57:ca model: 0x1af4 0x0001 name: enp5s0 pxe: false speedGbps: 0 vlanId: 0 The problem here is that we failed to collect the host's IPv6 address during Ironic introspection. That interface has a blank ip field on the BareMetalHost (which got copied to the Machine). We need to determine what went wrong during introspection. Either the host didn't get an IP (or didn't even ask for one), or it failed to report it back to ironic.
I removed the TestBlocker as deployment passes now(tested on 4.4.0-0.ci-2020-03-11-095511) and we can use as a workaround manually approving the certificates: oc get csr | grep Pending | awk {'print $1'} | xargs oc adm certificate approve
Marius, if you still have this environment, can you please provide the raw introspection data for the interfaces for one of the workers, e.g similar to: $ oc describe bmh openshift-worker-0 -n openshift-machine-api | grep -A1 Provisioning Provisioning: ID: cfc24b87-d922-492a-a504-22a5d13057c3 $ curl http://172.22.0.3:5050/v1/introspection/cfc24b87-d922-492a-a504-22a5d13057c3/data | jq .all_interfaces,.interfaces % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3091 100 3091 0 0 754k 0 --:--:-- --:--:-- --:--:-- 754k { "enp2s0": { "ip": "192.168.111.23", "mac": "00:24:28:5f:06:1e", "client_id": null, "pxe": false }, "enp1s0": { "ip": "172.22.0.90", "mac": "00:24:28:5f:06:1c", "client_id": null, "pxe": true } } { "enp1s0": { "ip": "172.22.0.90", "mac": "00:24:28:5f:06:1c", "client_id": null, "pxe": true } }
I didn't have the same environment anymore but this is what I got from a new one which shows the same issue: oc describe bmh openshift-worker-0 -n openshift-machine-api | grep -A1 Provisioning Provisioning: ID: a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a -- Normal ProvisioningStarted 66m metal3-baremetal-controller Image provisioning started for http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2 Normal ProvisioningComplete 62m metal3-baremetal-controller Image provisioning completed for http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2 Normal Registered 20m metal3-baremetal-controller Registered new host [kni@provisionhost-0 ~]$ curl -g http://[fd00:1101::3]:5050/v1/introspection/a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a/data {"error":{"message":"Introspection data not found for node a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a, processed=True"}} oc -n openshift-machine-api get bmh/openshift-worker-0 -o yaml apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: creationTimestamp: "2020-03-12T15:17:51Z" finalizers: - baremetalhost.metal3.io generation: 2 name: openshift-worker-0 namespace: openshift-machine-api resourceVersion: "53866" selfLink: /apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts/openshift-worker-0 uid: 7230aad7-e738-4a25-8f63-74159839b001 spec: bmc: address: redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/934a8d73-0eef-4e01-80b8-07ae714f2eb1 credentialsName: openshift-worker-0-bmc-secret disableCertificateVerification: true bootMACAddress: 52:54:00:51:2a:08 consumerRef: apiVersion: machine.openshift.io/v1beta1 kind: Machine name: ocp-edge-cluster-worker-0-fzttk namespace: openshift-machine-api hardwareProfile: unknown image: checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2.md5sum url: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2 online: true userData: name: worker-user-data namespace: openshift-machine-api status: errorMessage: "" goodCredentials: credentials: name: openshift-worker-0-bmc-secret namespace: openshift-machine-api credentialsVersion: "7733" hardware: cpu: arch: x86_64 clockMegahertz: 2199.996 count: 8 flags: - 3dnowprefetch - abm - adx - aes - apic - arat - arch_capabilities - arch_perfmon - avx - avx2 - bmi1 - bmi2 - clflush - cmov - constant_tsc - cpuid - cpuid_fault - cx16 - cx8 - de - ept - erms - f16c - flexpriority - fma - fpu - fsgsbase - fxsr - hle - hypervisor - invpcid - invpcid_single - lahf_lm - lm - mca - mce - mmx - movbe - msr - mtrr - nopl - nx - pae - pat - pcid - pclmulqdq - pdpe1gb - pge - pni - popcnt - pse - pse36 - pti - rdrand - rdseed - rdtscp - rep_good - rtm - sep - smap - smep - ss - sse - sse2 - sse4_1 - sse4_2 - ssse3 - syscall - tpr_shadow - tsc - tsc_adjust - tsc_deadline_timer - tsc_known_freq - umip - vme - vmx - vnmi - vpid - x2apic - xsave - xsaveopt - xtopology model: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz firmware: bios: date: "" vendor: "" version: "" hostname: worker-0.ocp-edge-cluster.qe.lab.redhat.com nics: - ip: "" mac: 52:54:00:4d:38:bc model: 0x1af4 0x0001 name: enp5s0 pxe: false speedGbps: 0 vlanId: 0 - ip: "" mac: 52:54:00:51:2a:08 model: 0x1af4 0x0001 name: enp4s0 pxe: true speedGbps: 0 vlanId: 0 ramMebibytes: 16384 storage: - hctl: "0:0:0:0" model: QEMU HARDDISK name: /dev/sda rotational: true serialNumber: drive-scsi0-0-0-0 sizeBytes: 55834574848 vendor: QEMU systemVendor: manufacturer: Red Hat productName: KVM serialNumber: "" hardwareProfile: unknown lastUpdated: "2020-03-12T16:47:59Z" operationHistory: deprovision: end: null start: null inspect: end: "2020-03-12T15:36:53Z" start: "2020-03-12T15:33:38Z" provision: end: "2020-03-12T15:43:53Z" start: "2020-03-12T15:40:08Z" register: end: "2020-03-12T15:33:38Z" start: "2020-03-12T15:33:18Z" operationalStatus: OK poweredOn: true provisioning: ID: a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a image: checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2.md5sum url: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2 state: provisioned triedCredentials: credentials: name: openshift-worker-0-bmc-secret namespace: openshift-machine-api credentialsVersion: "7733"
Created attachment 1669718 [details] metal3-ironic-inspector.log Attaching metal3-ironic-inspector log
*** This bug has been marked as a duplicate of bug 1816121 ***