1809238 – Workers node deployment on bare metal with IPv6 control plane is blocked because worker nodes CSRs are not automatically approved

Bug 1809238 - Workers node deployment on bare metal with IPv6 control plane is blocked because worker nodes CSRs are not automatically approved

Summary: Workers node deployment on bare metal with IPv6 control plane is blocked beca...

Keywords:
Status:	CLOSED DUPLICATE of bug 1816121
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Russell Bryant
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:	1811530
Blocks:	1788155
TreeView+	depends on / blocked

Reported:	2020-03-02 16:21 UTC by Marius Cornea
Modified:	2020-04-16 14:00 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-16 14:00:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
metal3-ironic-inspector.log (142.12 KB, text/plain) 2020-03-12 16:55 UTC, Marius Cornea	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 359	0	None	closed	Bug 1809368: bump openshift/api to remove IPv6DualStack feature flag	2021-02-11 23:05:13 UTC

Description Marius Cornea 2020-03-02 16:21:47 UTC

Description of problem:
This is similar to BZ#1798272 but slightly different. Now worker nodes do show up in the nodes list but they are in NotReady state:

[kni@provisionhost-0 ~]$ oc get nodes
NAME                                          STATUS     ROLES    AGE   VERSION
master-0.ocp-edge-cluster.qe.lab.redhat.com   Ready      master   29m   v1.17.1
master-1.ocp-edge-cluster.qe.lab.redhat.com   Ready      master   28m   v1.17.1
master-2.ocp-edge-cluster.qe.lab.redhat.com   Ready      master   30m   v1.17.1
worker-0.ocp-edge-cluster.qe.lab.redhat.com   NotReady   worker   14m   v1.17.1
worker-1.ocp-edge-cluster.qe.lab.redhat.com   NotReady   worker   14m   v1.17.1
[kni@provisionhost-0 ~]$ oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-2rc6z   29m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-6t67p   29m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-bdlsw   30m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-cq4nh   14m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-h6q64   14m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-kljnb   14m   system:node:worker-1.ocp-edge-cluster.qe.lab.redhat.com                     Pending
csr-ltc92   29m   system:node:master-0.ocp-edge-cluster.qe.lab.redhat.com                     Approved,Issued
csr-mn5lw   29m   system:node:master-1.ocp-edge-cluster.qe.lab.redhat.com                     Approved,Issued
csr-vzwsh   14m   system:node:worker-0.ocp-edge-cluster.qe.lab.redhat.com                     Pending
csr-z27vg   30m   system:node:master-2.ocp-edge-cluster.qe.lab.redhat.com                     Approved,Issued



Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-02-042029

How reproducible:
100%

Steps to Reproduce:
1. Deploy 4.4 bare metal deployment with 3 x master and 2 x worker nodes

Actual results:
Deployment is blocked because worker nodes CSRs are in Pending state.

Expected results:
Deployment succeeds.

Additional info:

Comment 1 Marius Cornea 2020-03-02 16:32:22 UTC

Note: even after manually approving the CSRs the worker nodes do not get into Ready state because ovnkube-node keeps looping through:

[root@worker-0 core]# crictl logs 0eba49f3ad7c8
+ [[ -f /env/worker-0.ocp-edge-cluster.qe.lab.redhat.com ]]
+ cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
+ ovn_config_namespace=openshift-ovn-kubernetes
+ retries=0
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=
+ [[ -n '' ]]
+ ((  retries += 1  ))
+ [[ 1 -gt 40 ]]
+ echo 'waiting for db endpoint'
waiting for db endpoint
+ sleep 5
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=
+ [[ -n '' ]]
+ ((  retries += 1  ))
waiting for db endpoint
+ [[ 2 -gt 40 ]]
+ echo 'waiting for db endpoint'
+ sleep 5
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=
+ [[ -n '' ]]
+ ((  retries += 1  ))
waiting for db endpoint

Comment 2 Wei Sun 2020-03-03 03:33:40 UTC

So is it one testblocker ?

Comment 3 Marius Cornea 2020-03-03 13:56:05 UTC

(In reply to Wei Sun from comment #2)
> So is it one testblocker ?

Yes, it's a test blocker.

Comment 9 Russell Bryant 2020-03-11 13:37:49 UTC

The ovnkube-node issue is a known bug which is fixed by: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/359

I would test again, as at least that issue should be gone.  As for the CSR approval thing, there's nothing I can do without either logs or access to a cluster in this state.

Comment 10 Russell Bryant 2020-03-11 18:22:42 UTC

Marius gave me access to a cluster showing the problem.

The cluster-machine-approver shows:

I0311 18:10:32.635909       1 csr_check.go:418] retrieving serving cert from worker-0.ocp-edge-cluster.qe.lab.redhat.com ([fd2e:6f44:5dd8:c956::13b]:10250)
W0311 18:10:32.636926       1 csr_check.go:178] Failed to retrieve current serving cert: remote error: tls: internal error
I0311 18:10:32.636948       1 csr_check.go:183] Falling back to machine-api authorization for worker-0.ocp-edge-cluster.qe.lab.redhat.com
I0311 18:10:32.636959       1 main.go:181] CSR csr-nlgss not authorized: No target machine for node "worker-0.ocp-edge-cluster.qe.lab.redhat.com"
I0311 18:10:32.636966       1 main.go:217] Error syncing csr csr-nlgss: No target machine for node "worker-0.ocp-edge-cluster.qe.lab.redhat.com"

... meaning that it failed to map a Node back to a corresponding Machine.  This usually happens because our addresses don't match up correctly.

The addresses on the worker-0 Node are:

  addresses:
  - address: fd2e:6f44:5dd8:c956::13b
    type: InternalIP
  - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com
    type: Hostname

The addresses on the Machine are:

    addresses: 
    - address: 172.22.0.59
      type: InternalIP
    - address: ""
      type: InternalIP
    - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com
      type: Hostname
    - address: worker-0.ocp-edge-cluster.qe.lab.redhat.com
      type: InternalDNS

and the relevant info from the BareMetalHost:

    hostname: worker-0.ocp-edge-cluster.qe.lab.redhat.com
    nics:
    - ip: 172.22.0.59
      mac: 52:54:00:09:9d:d2
      model: 0x1af4 0x0001
      name: enp4s0
      pxe: true
      speedGbps: 0
      vlanId: 0
    - ip: ""
      mac: 52:54:00:50:57:ca
      model: 0x1af4 0x0001
      name: enp5s0
      pxe: false
      speedGbps: 0
      vlanId: 0


The problem here is that we failed to collect the host's IPv6 address during Ironic introspection.  That interface has a blank ip field on the BareMetalHost (which got copied to the Machine).

We need to determine what went wrong during introspection.  Either the host didn't get an IP (or didn't even ask for one), or it failed to report it back to ironic.

Comment 11 Marius Cornea 2020-03-11 21:02:38 UTC

I removed the TestBlocker as deployment passes now(tested on 4.4.0-0.ci-2020-03-11-095511) and we can use as a workaround manually approving the certificates:

oc get csr | grep Pending | awk {'print $1'} | xargs oc adm certificate approve

Comment 12 Steven Hardy 2020-03-12 09:55:49 UTC

Marius, if you still have this environment, can you please provide the raw introspection data for the interfaces for one of the workers, e.g similar to:

$ oc describe bmh openshift-worker-0 -n openshift-machine-api | grep -A1 Provisioning
  Provisioning:
    ID:  cfc24b87-d922-492a-a504-22a5d13057c3

$ curl http://172.22.0.3:5050/v1/introspection/cfc24b87-d922-492a-a504-22a5d13057c3/data | jq .all_interfaces,.interfaces
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3091  100  3091    0     0   754k      0 --:--:-- --:--:-- --:--:--  754k
{
  "enp2s0": {
    "ip": "192.168.111.23",
    "mac": "00:24:28:5f:06:1e",
    "client_id": null,
    "pxe": false
  },
  "enp1s0": {
    "ip": "172.22.0.90",
    "mac": "00:24:28:5f:06:1c",
    "client_id": null,
    "pxe": true
  }
}
{
  "enp1s0": {
    "ip": "172.22.0.90",
    "mac": "00:24:28:5f:06:1c",
    "client_id": null,
    "pxe": true
  }
}

Comment 13 Marius Cornea 2020-03-12 16:49:10 UTC

I didn't have the same environment anymore but this is what I got from a new one which shows the same issue:


oc describe bmh openshift-worker-0 -n openshift-machine-api | grep -A1 Provisioning
  Provisioning:
    ID:  a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a
--
  Normal  ProvisioningStarted   66m   metal3-baremetal-controller  Image provisioning started for http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2
  Normal  ProvisioningComplete  62m   metal3-baremetal-controller  Image provisioning completed for http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2
  Normal  Registered            20m   metal3-baremetal-controller  Registered new host

[kni@provisionhost-0 ~]$ curl -g http://[fd00:1101::3]:5050/v1/introspection/a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a/data 
{"error":{"message":"Introspection data not found for node a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a, processed=True"}}


oc -n openshift-machine-api get bmh/openshift-worker-0 -o yaml 
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  creationTimestamp: "2020-03-12T15:17:51Z"
  finalizers:
  - baremetalhost.metal3.io
  generation: 2
  name: openshift-worker-0
  namespace: openshift-machine-api
  resourceVersion: "53866"
  selfLink: /apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts/openshift-worker-0
  uid: 7230aad7-e738-4a25-8f63-74159839b001
spec:
  bmc:
    address: redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/934a8d73-0eef-4e01-80b8-07ae714f2eb1
    credentialsName: openshift-worker-0-bmc-secret
    disableCertificateVerification: true
  bootMACAddress: 52:54:00:51:2a:08
  consumerRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: Machine
    name: ocp-edge-cluster-worker-0-fzttk
    namespace: openshift-machine-api
  hardwareProfile: unknown
  image:
    checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2.md5sum
    url: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2
  online: true
  userData:
    name: worker-user-data
    namespace: openshift-machine-api
status:
  errorMessage: ""
  goodCredentials:
    credentials:
      name: openshift-worker-0-bmc-secret
      namespace: openshift-machine-api
    credentialsVersion: "7733"
  hardware:
    cpu:
      arch: x86_64
      clockMegahertz: 2199.996
      count: 8
      flags:
      - 3dnowprefetch
      - abm
      - adx
      - aes
      - apic
      - arat
      - arch_capabilities
      - arch_perfmon
      - avx
      - avx2
      - bmi1
      - bmi2
      - clflush
      - cmov
      - constant_tsc
      - cpuid
      - cpuid_fault
      - cx16
      - cx8
      - de
      - ept
      - erms
      - f16c
      - flexpriority
      - fma
      - fpu
      - fsgsbase
      - fxsr
      - hle
      - hypervisor
      - invpcid
      - invpcid_single
      - lahf_lm
      - lm
      - mca
      - mce
      - mmx
      - movbe
      - msr
      - mtrr
      - nopl
      - nx
      - pae
      - pat
      - pcid
      - pclmulqdq
      - pdpe1gb
      - pge
      - pni
      - popcnt
      - pse
      - pse36
      - pti
      - rdrand
      - rdseed
      - rdtscp
      - rep_good
      - rtm
      - sep
      - smap
      - smep
      - ss
      - sse
      - sse2
      - sse4_1
      - sse4_2
      - ssse3
      - syscall
      - tpr_shadow
      - tsc
      - tsc_adjust
      - tsc_deadline_timer
      - tsc_known_freq
      - umip
      - vme
      - vmx
      - vnmi
      - vpid
      - x2apic
      - xsave
      - xsaveopt
      - xtopology
      model: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
    firmware:
      bios:
        date: ""
        vendor: ""
        version: ""
    hostname: worker-0.ocp-edge-cluster.qe.lab.redhat.com
    nics:
    - ip: ""
      mac: 52:54:00:4d:38:bc
      model: 0x1af4 0x0001
      name: enp5s0
      pxe: false
      speedGbps: 0
      vlanId: 0
    - ip: ""
      mac: 52:54:00:51:2a:08
      model: 0x1af4 0x0001
      name: enp4s0
      pxe: true
      speedGbps: 0
      vlanId: 0
    ramMebibytes: 16384
    storage:
    - hctl: "0:0:0:0"
      model: QEMU HARDDISK
      name: /dev/sda
      rotational: true
      serialNumber: drive-scsi0-0-0-0
      sizeBytes: 55834574848
      vendor: QEMU
    systemVendor:
      manufacturer: Red Hat
      productName: KVM
      serialNumber: ""
  hardwareProfile: unknown
  lastUpdated: "2020-03-12T16:47:59Z"
  operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: "2020-03-12T15:36:53Z"
      start: "2020-03-12T15:33:38Z"
    provision:
      end: "2020-03-12T15:43:53Z"
      start: "2020-03-12T15:40:08Z"
    register:
      end: "2020-03-12T15:33:38Z"
      start: "2020-03-12T15:33:18Z"
  operationalStatus: OK
  poweredOn: true
  provisioning:
    ID: a8c6d4fd-74a2-4f7e-b57e-4b900c63b58a
    image:
      checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2.md5sum
      url: http://[fd00:1101::3]:6180/images/rhcos-44.81.202002241126-0-openstack.x86_64.qcow2/rhcos-44.81.202002241126-0-compressed.x86_64.qcow2
    state: provisioned
  triedCredentials:
    credentials:
      name: openshift-worker-0-bmc-secret
      namespace: openshift-machine-api
    credentialsVersion: "7733"

Comment 14 Marius Cornea 2020-03-12 16:55:05 UTC

Created attachment 1669718 [details]
metal3-ironic-inspector.log

Attaching metal3-ironic-inspector log

Comment 15 Russell Bryant 2020-04-16 14:00:34 UTC


*** This bug has been marked as a duplicate of bug 1816121 ***

Note You need to log in before you can comment on or make changes to this bug.