Bug 1809238
| Summary: | Workers node deployment on bare metal with IPv6 control plane is blocked because worker nodes CSRs are not automatically approved | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Marius Cornea <mcornea> | ||||
| Component: | Installer | Assignee: | Russell Bryant <rbryant> | ||||
| Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Amit Ugol <augol> | ||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||
| Severity: | urgent | ||||||
| Priority: | urgent | CC: | jfan, mifiedle, rbryant, sasha, scuppett, shardy, stbenjam, vvoronko, wsun, yprokule | ||||
| Version: | 4.4 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.4.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-04-16 14:00:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1811530 | ||||||
| Bug Blocks: | 1788155 | ||||||
| Attachments: |
|
||||||
|
Description
Marius Cornea
2020-03-02 16:21:47 UTC
Note: even after manually approving the CSRs the worker nodes do not get into Ready state because ovnkube-node keeps looping through:
[root@worker-0 core]# crictl logs 0eba49f3ad7c8
+ [[ -f /env/worker-0.ocp-edge-cluster.qe.lab.redhat.com ]]
+ cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
+ ovn_config_namespace=openshift-ovn-kubernetes
+ retries=0
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=
+ [[ -n '' ]]
+ (( retries += 1 ))
+ [[ 1 -gt 40 ]]
+ echo 'waiting for db endpoint'
waiting for db endpoint
+ sleep 5
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=
+ [[ -n '' ]]
+ (( retries += 1 ))
waiting for db endpoint
+ [[ 2 -gt 40 ]]
+ echo 'waiting for db endpoint'
+ sleep 5
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=
+ [[ -n '' ]]
+ (( retries += 1 ))
waiting for db endpoint
So is it one testblocker ? (In reply to Wei Sun from comment #2) > So is it one testblocker ? Yes, it's a test blocker. The ovnkube-node issue is a known bug which is fixed by: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/359 I would test again, as at least that issue should be gone. As for the CSR approval thing, there's nothing I can do without either logs or access to a cluster in this state. Marius gave me access to a cluster showing the problem.
The cluster-machine-approver shows:
I0311 18:10:32.635909 1 csr_check.go:418] retrieving serving cert from worker-0.ocp-edge-cluster.qe.lab.redhat.com ([fd2e:6f44:5dd8:c956::13b]:10250)
W0311 18:10:32.636926 1 csr_check.go:178] Failed to retrieve current serving cert: remote error: tls: internal error
I0311 18:10:32.636948 1 csr_check.go:183] Falling back to machine-api authorization for worker-0.ocp-edge-cluster.qe.lab.redhat.com
I0311 18:10:32.636959 1 main.go:181] CSR csr-nlgss not authorized: No target machine for node "worker-0.ocp-edge-cluster.qe.lab.redhat.com"
I0311 18:10:32.636966 1 main.go:217] Error syncing csr csr-nlgss: No target machine for node "worker-0.ocp-edge-cluster.qe.lab.redhat.com"
... meaning that it failed to map a Node back to a corresponding Machine. This usually happens because our addresses don't match up correctly.
The addresses on the worker-0 Node are:
addresses:
- address: fd2e:6f44:5dd8:c956::13b
type: InternalIP
- address: worker-0.ocp-edge-cluster.qe.lab.redhat.com
type: Hostname
The addresses on the Machine are:
addresses:
- address: 172.22.0.59
type: InternalIP
- address: ""
type: InternalIP
- address: worker-0.ocp-edge-cluster.qe.lab.redhat.com
type: Hostname
- address: worker-0.ocp-edge-cluster.qe.lab.redhat.com
type: InternalDNS
and the relevant info from the BareMetalHost:
hostname: worker-0.ocp-edge-cluster.qe.lab.redhat.com
nics:
- ip: 172.22.0.59
mac: 52:54:00:09:9d:d2
model: 0x1af4 0x0001
name: enp4s0
pxe: true
speedGbps: 0
vlanId: 0
- ip: ""
mac: 52:54:00:50:57:ca
model: 0x1af4 0x0001
name: enp5s0
pxe: false
speedGbps: 0
vlanId: 0
The problem here is that we failed to collect the host's IPv6 address during Ironic introspection. That interface has a blank ip field on the BareMetalHost (which got copied to the Machine).
We need to determine what went wrong during introspection. Either the host didn't get an IP (or didn't even ask for one), or it failed to report it back to ironic.
I removed the TestBlocker as deployment passes now(tested on 4.4.0-0.ci-2020-03-11-095511) and we can use as a workaround manually approving the certificates:
oc get csr | grep Pending | awk {'print $1'} | xargs oc adm certificate approve
Marius, if you still have this environment, can you please provide the raw introspection data for the interfaces for one of the workers, e.g similar to:
$ oc describe bmh openshift-worker-0 -n openshift-machine-api | grep -A1 Provisioning
Provisioning:
ID: cfc24b87-d922-492a-a504-22a5d13057c3
$ curl http://172.22.0.3:5050/v1/introspection/cfc24b87-d922-492a-a504-22a5d13057c3/data | jq .all_interfaces,.interfaces
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3091 100 3091 0 0 754k 0 --:--:-- --:--:-- --:--:-- 754k
{
"enp2s0": {
"ip": "192.168.111.23",
"mac": "00:24:28:5f:06:1e",
"client_id": null,
"pxe": false
},
"enp1s0": {
"ip": "172.22.0.90",
"mac": "00:24:28:5f:06:1c",
"client_id": null,
"pxe": true
}
}
{
"enp1s0": {
"ip": "172.22.0.90",
"mac": "00:24:28:5f:06:1c",
"client_id": null,
"pxe": true
}
}
I didn't have the same environment anymore but this is what I got from a new one which shows the same issue:
oc describe bmh openshift-worker-0 -n openshift-machine-api | grep -A1 Provisioning
Provisioning:
ID: a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a
--
Normal ProvisioningStarted 66m metal3-baremetal-controller Image provisioning started for http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2
Normal ProvisioningComplete 62m metal3-baremetal-controller Image provisioning completed for http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2
Normal Registered 20m metal3-baremetal-controller Registered new host
[kni@provisionhost-0 ~]$ curl -g http://[fd00:1101::3]:5050/v1/introspection/a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a/data
{"error":{"message":"Introspection data not found for node a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a, processed=True"}}
oc -n openshift-machine-api get bmh/openshift-worker-0 -o yaml
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
creationTimestamp: "2020-03-12T15:17:51Z"
finalizers:
- baremetalhost.metal3.io
generation: 2
name: openshift-worker-0
namespace: openshift-machine-api
resourceVersion: "53866"
selfLink: /apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts/openshift-worker-0
uid: 7230aad7-e738-4a25-8f63-74159839b001
spec:
bmc:
address: redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/934a8d73-0eef-4e01-80b8-07ae714f2eb1
credentialsName: openshift-worker-0-bmc-secret
disableCertificateVerification: true
bootMACAddress: 52:54:00:51:2a:08
consumerRef:
apiVersion: machine.openshift.io/v1beta1
kind: Machine
name: ocp-edge-cluster-worker-0-fzttk
namespace: openshift-machine-api
hardwareProfile: unknown
image:
checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2.md5sum
url: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2
online: true
userData:
name: worker-user-data
namespace: openshift-machine-api
status:
errorMessage: ""
goodCredentials:
credentials:
name: openshift-worker-0-bmc-secret
namespace: openshift-machine-api
credentialsVersion: "7733"
hardware:
cpu:
arch: x86_64
clockMegahertz: 2199.996
count: 8
flags:
- 3dnowprefetch
- abm
- adx
- aes
- apic
- arat
- arch_capabilities
- arch_perfmon
- avx
- avx2
- bmi1
- bmi2
- clflush
- cmov
- constant_tsc
- cpuid
- cpuid_fault
- cx16
- cx8
- de
- ept
- erms
- f16c
- flexpriority
- fma
- fpu
- fsgsbase
- fxsr
- hle
- hypervisor
- invpcid
- invpcid_single
- lahf_lm
- lm
- mca
- mce
- mmx
- movbe
- msr
- mtrr
- nopl
- nx
- pae
- pat
- pcid
- pclmulqdq
- pdpe1gb
- pge
- pni
- popcnt
- pse
- pse36
- pti
- rdrand
- rdseed
- rdtscp
- rep_good
- rtm
- sep
- smap
- smep
- ss
- sse
- sse2
- sse4_1
- sse4_2
- ssse3
- syscall
- tpr_shadow
- tsc
- tsc_adjust
- tsc_deadline_timer
- tsc_known_freq
- umip
- vme
- vmx
- vnmi
- vpid
- x2apic
- xsave
- xsaveopt
- xtopology
model: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
firmware:
bios:
date: ""
vendor: ""
version: ""
hostname: worker-0.ocp-edge-cluster.qe.lab.redhat.com
nics:
- ip: ""
mac: 52:54:00:4d:38:bc
model: 0x1af4 0x0001
name: enp5s0
pxe: false
speedGbps: 0
vlanId: 0
- ip: ""
mac: 52:54:00:51:2a:08
model: 0x1af4 0x0001
name: enp4s0
pxe: true
speedGbps: 0
vlanId: 0
ramMebibytes: 16384
storage:
- hctl: "0:0:0:0"
model: QEMU HARDDISK
name: /dev/sda
rotational: true
serialNumber: drive-scsi0-0-0-0
sizeBytes: 55834574848
vendor: QEMU
systemVendor:
manufacturer: Red Hat
productName: KVM
serialNumber: ""
hardwareProfile: unknown
lastUpdated: "2020-03-12T16:47:59Z"
operationHistory:
deprovision:
end: null
start: null
inspect:
end: "2020-03-12T15:36:53Z"
start: "2020-03-12T15:33:38Z"
provision:
end: "2020-03-12T15:43:53Z"
start: "2020-03-12T15:40:08Z"
register:
end: "2020-03-12T15:33:38Z"
start: "2020-03-12T15:33:18Z"
operationalStatus: OK
poweredOn: true
provisioning:
ID: a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a
image:
checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2.md5sum
url: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2
state: provisioned
triedCredentials:
credentials:
name: openshift-worker-0-bmc-secret
namespace: openshift-machine-api
credentialsVersion: "7733"
Created attachment 1669718 [details]
metal3-ironic-inspector.log
Attaching metal3-ironic-inspector log
*** This bug has been marked as a duplicate of bug 1816121 *** |