Bug 1809238
Summary: | Workers node deployment on bare metal with IPv6 control plane is blocked because worker nodes CSRs are not automatically approved | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Marius Cornea <mcornea> | ||||
Component: | Installer | Assignee: | Russell Bryant <rbryant> | ||||
Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Amit Ugol <augol> | ||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | urgent | CC: | jfan, mifiedle, rbryant, sasha, scuppett, shardy, stbenjam, vvoronko, wsun, yprokule | ||||
Version: | 4.4 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.4.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-04-16 14:00:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1811530 | ||||||
Bug Blocks: | 1788155 | ||||||
Attachments: |
|
Description
Marius Cornea
2020-03-02 16:21:47 UTC
Note: even after manually approving the CSRs the worker nodes do not get into Ready state because ovnkube-node keeps looping through: [root@worker-0 core]# crictl logs 0eba49f3ad7c8 + [[ -f /env/worker-0.ocp-edge-cluster.qe.lab.redhat.com ]] + cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/ + ovn_config_namespace=openshift-ovn-kubernetes + retries=0 + true ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' + db_ip= + [[ -n '' ]] + (( retries += 1 )) + [[ 1 -gt 40 ]] + echo 'waiting for db endpoint' waiting for db endpoint + sleep 5 + true ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' + db_ip= + [[ -n '' ]] + (( retries += 1 )) waiting for db endpoint + [[ 2 -gt 40 ]] + echo 'waiting for db endpoint' + sleep 5 + true ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' + db_ip= + [[ -n '' ]] + (( retries += 1 )) waiting for db endpoint So is it one testblocker ? (In reply to Wei Sun from comment #2) > So is it one testblocker ? Yes, it's a test blocker. The ovnkube-node issue is a known bug which is fixed by: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/359 I would test again, as at least that issue should be gone. As for the CSR approval thing, there's nothing I can do without either logs or access to a cluster in this state. Marius gave me access to a cluster showing the problem. The cluster-machine-approver shows: I0311 18:10:32.635909 1 csr_check.go:418] retrieving serving cert from worker-0.ocp-edge-cluster.qe.lab.redhat.com ([fd2e:6f44:5dd8:c956::13b]:10250) W0311 18:10:32.636926 1 csr_check.go:178] Failed to retrieve current serving cert: remote error: tls: internal error I0311 18:10:32.636948 1 csr_check.go:183] Falling back to machine-api authorization for worker-0.ocp-edge-cluster.qe.lab.redhat.com I0311 18:10:32.636959 1 main.go:181] CSR csr-nlgss not authorized: No target machine for node "worker-0.ocp-edge-cluster.qe.lab.redhat.com" I0311 18:10:32.636966 1 main.go:217] Error syncing csr csr-nlgss: No target machine for node "worker-0.ocp-edge-cluster.qe.lab.redhat.com" ... meaning that it failed to map a Node back to a corresponding Machine. This usually happens because our addresses don't match up correctly. The addresses on the worker-0 Node are: addresses: - address: fd2e:6f44:5dd8:c956::13b type: InternalIP - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com type: Hostname The addresses on the Machine are: addresses: - address: 172.22.0.59 type: InternalIP - address: "" type: InternalIP - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com type: Hostname - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com type: InternalDNS and the relevant info from the BareMetalHost: hostname: worker-0.ocp-edge-cluster.qe.lab.redhat.com nics: - ip: 172.22.0.59 mac: 52:54:00:09:9d:d2 model: 0x1af4 0x0001 name: enp4s0 pxe: true speedGbps: 0 vlanId: 0 - ip: "" mac: 52:54:00:50:57:ca model: 0x1af4 0x0001 name: enp5s0 pxe: false speedGbps: 0 vlanId: 0 The problem here is that we failed to collect the host's IPv6 address during Ironic introspection. That interface has a blank ip field on the BareMetalHost (which got copied to the Machine). We need to determine what went wrong during introspection. Either the host didn't get an IP (or didn't even ask for one), or it failed to report it back to ironic. I removed the TestBlocker as deployment passes now(tested on 4.4.0-0.ci-2020-03-11-095511) and we can use as a workaround manually approving the certificates: oc get csr | grep Pending | awk {'print $1'} | xargs oc adm certificate approve Marius, if you still have this environment, can you please provide the raw introspection data for the interfaces for one of the workers, e.g similar to: $ oc describe bmh openshift-worker-0 -n openshift-machine-api | grep -A1 Provisioning Provisioning: ID: cfc24b87-d922-492a-a504-22a5d13057c3 $ curl http://172.22.0.3:5050/v1/introspection/cfc24b87-d922-492a-a504-22a5d13057c3/data | jq .all_interfaces,.interfaces % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3091 100 3091 0 0 754k 0 --:--:-- --:--:-- --:--:-- 754k { "enp2s0": { "ip": "192.168.111.23", "mac": "00:24:28:5f:06:1e", "client_id": null, "pxe": false }, "enp1s0": { "ip": "172.22.0.90", "mac": "00:24:28:5f:06:1c", "client_id": null, "pxe": true } } { "enp1s0": { "ip": "172.22.0.90", "mac": "00:24:28:5f:06:1c", "client_id": null, "pxe": true } } I didn't have the same environment anymore but this is what I got from a new one which shows the same issue: oc describe bmh openshift-worker-0 -n openshift-machine-api | grep -A1 Provisioning Provisioning: ID: a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a -- Normal ProvisioningStarted 66m metal3-baremetal-controller Image provisioning started for http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2 Normal ProvisioningComplete 62m metal3-baremetal-controller Image provisioning completed for http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2 Normal Registered 20m metal3-baremetal-controller Registered new host [kni@provisionhost-0 ~]$ curl -g http://[fd00:1101::3]:5050/v1/introspection/a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a/data {"error":{"message":"Introspection data not found for node a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a, processed=True"}} oc -n openshift-machine-api get bmh/openshift-worker-0 -o yaml apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: creationTimestamp: "2020-03-12T15:17:51Z" finalizers: - baremetalhost.metal3.io generation: 2 name: openshift-worker-0 namespace: openshift-machine-api resourceVersion: "53866" selfLink: /apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts/openshift-worker-0 uid: 7230aad7-e738-4a25-8f63-74159839b001 spec: bmc: address: redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/934a8d73-0eef-4e01-80b8-07ae714f2eb1 credentialsName: openshift-worker-0-bmc-secret disableCertificateVerification: true bootMACAddress: 52:54:00:51:2a:08 consumerRef: apiVersion: machine.openshift.io/v1beta1 kind: Machine name: ocp-edge-cluster-worker-0-fzttk namespace: openshift-machine-api hardwareProfile: unknown image: checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2.md5sum url: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2 online: true userData: name: worker-user-data namespace: openshift-machine-api status: errorMessage: "" goodCredentials: credentials: name: openshift-worker-0-bmc-secret namespace: openshift-machine-api credentialsVersion: "7733" hardware: cpu: arch: x86_64 clockMegahertz: 2199.996 count: 8 flags: - 3dnowprefetch - abm - adx - aes - apic - arat - arch_capabilities - arch_perfmon - avx - avx2 - bmi1 - bmi2 - clflush - cmov - constant_tsc - cpuid - cpuid_fault - cx16 - cx8 - de - ept - erms - f16c - flexpriority - fma - fpu - fsgsbase - fxsr - hle - hypervisor - invpcid - invpcid_single - lahf_lm - lm - mca - mce - mmx - movbe - msr - mtrr - nopl - nx - pae - pat - pcid - pclmulqdq - pdpe1gb - pge - pni - popcnt - pse - pse36 - pti - rdrand - rdseed - rdtscp - rep_good - rtm - sep - smap - smep - ss - sse - sse2 - sse4_1 - sse4_2 - ssse3 - syscall - tpr_shadow - tsc - tsc_adjust - tsc_deadline_timer - tsc_known_freq - umip - vme - vmx - vnmi - vpid - x2apic - xsave - xsaveopt - xtopology model: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz firmware: bios: date: "" vendor: "" version: "" hostname: worker-0.ocp-edge-cluster.qe.lab.redhat.com nics: - ip: "" mac: 52:54:00:4d:38:bc model: 0x1af4 0x0001 name: enp5s0 pxe: false speedGbps: 0 vlanId: 0 - ip: "" mac: 52:54:00:51:2a:08 model: 0x1af4 0x0001 name: enp4s0 pxe: true speedGbps: 0 vlanId: 0 ramMebibytes: 16384 storage: - hctl: "0:0:0:0" model: QEMU HARDDISK name: /dev/sda rotational: true serialNumber: drive-scsi0-0-0-0 sizeBytes: 55834574848 vendor: QEMU systemVendor: manufacturer: Red Hat productName: KVM serialNumber: "" hardwareProfile: unknown lastUpdated: "2020-03-12T16:47:59Z" operationHistory: deprovision: end: null start: null inspect: end: "2020-03-12T15:36:53Z" start: "2020-03-12T15:33:38Z" provision: end: "2020-03-12T15:43:53Z" start: "2020-03-12T15:40:08Z" register: end: "2020-03-12T15:33:38Z" start: "2020-03-12T15:33:18Z" operationalStatus: OK poweredOn: true provisioning: ID: a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a image: checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2.md5sum url: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2 state: provisioned triedCredentials: credentials: name: openshift-worker-0-bmc-secret namespace: openshift-machine-api credentialsVersion: "7733" Created attachment 1669718 [details]
metal3-ironic-inspector.log
Attaching metal3-ironic-inspector log
*** This bug has been marked as a duplicate of bug 1816121 *** |