Bug 1796344

Summary: OCP 4.4: CSRs not automatically approved when scaling machineset with providerSpec publicIP set to true
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: Jianwei Hou <jhou>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: mifiedle, nelluri, vlaad
Version: 4.4Keywords: NeedsTestCase
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Linux   
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-15 15:55:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Walid A. 2020-01-30 08:47:52 UTC
Description of problem:
This was not an issue on OCP 4.3.  This issue is happening on OCP 4.4 when trying to add a new workload node for automation by scaling an existing machineset on AWS IPI cluster, with the  providerSpec value for publicIP set to true.

oc debug node/<newly_added_node> fails with:
# oc debug node/ip-10-0-7-124.us-west-2.compute.internal 
Starting pod/ip-10-0-7-124us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP:
If you don't see a command prompt, try pressing enter.

Removing debug pod ...
Error from server: error dialing backend: remote error: tls: internal error

# oc logs -n openshift-cluster-machine-approver machine-approver-7b9ffbdbd5-67k4x -c machine-approver-controller | grep "ip-10-0-7-124.us-west-2.compute.internal"
I0128 18:02:41.296779       1 csr_check.go:418] retrieving serving cert from ip-10-0-7-124.us-west-2.compute.internal (
I0128 18:02:41.300586       1 csr_check.go:183] Falling back to machine-api authorization for ip-10-0-7-124.us-west-2.compute.internal
I0128 18:02:41.300600       1 main.go:181] CSR csr-7z7f6 not authorized: DNS name 'ec2-54-203-167-77.us-west-2.compute.amazonaws.com' not in machine names: ip-10-0-7-124.us-west-2.compute.internal ip-10-0-7-124.us-west-2.compute.internal
I0128 18:02:41.300609       1 main.go:217] Error syncing csr csr-7z7f6: DNS name 'ec2-54-203-167-77.us-west-2.compute.amazonaws.com' not in machine names: ip-10-0-7-124.us-west-2.compute.internal ip-10-0-7-124.us-west-2.compute.internal

Manually approving the pending certs for that cluster will restore oc debug node functionality on that cluster:

oc adm certificate approve certificatesigningrequest.certificates.k8s.io/csr-zhp5l

 `oc debug node/<node_ip>` now will be successful after manual approval of certs

Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.4.0-0.nightly-2020-01-24-141203
Server Version: 4.4.0-0.nightly-2020-01-24-141203
Kubernetes Version: v1.17.1
root@ip-172-31-40-229: ~/oc_clients # 

How reproducible:
All the time on OCP 4.4

Steps to Reproduce:
1.  AWS IPI Cluster ( 3 master, 3 worker nodes) install of OCP 4.4 with 4.4.0-0.nightly-2020-01-24-141203 payload
2. oc get machineset -n openshift-machine-api -o yaml > first_worker_node_machineset.yaml
3. cp first_worker_node_machineset.yaml new_workload_node_machineset.yaml
4. vim machineset new_workload_node_machineset.yaml.

   Edit machineset name and labels and ensure that providerSpec value publicIP: true
5. oc create -f first_worker_node_machineset.yaml

Actual results:
Several CSRs pending approval, and `oc debug node/<new_workload_node_ip>` fails with:

Expected results:
CSRs should be automatically approved when scaling the cluster and adding a new workload node for automation.

Additional info:
Links to msut-gather logs, machineset, machine-auto-approver logs will be provided in next comment

Comment 2 Brad Ison 2020-02-04 15:18:43 UTC
This was a regression introduced by:

It has been fixed with:

Comment 4 Jianwei Hou 2020-02-06 05:52:23 UTC
Verified in 4.4.0-0.nightly-2020-02-05-181112

A machine with publicIp set to true is provisioned and approved.

apiVersion: machine.openshift.io/v1beta1
kind: Machine
    machine.openshift.io/instance-state: running
  creationTimestamp: "2020-02-06T05:38:39Z"
  - machine.machine.openshift.io
  generateName: qe-jhou06-f8dr2-worker-us-east-2a-publicip-
  generation: 2
    machine.openshift.io/cluster-api-cluster: qe-jhou06-f8dr2
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: qe-jhou06-f8dr2-worker-us-east-2a
    machine.openshift.io/instance-type: m4.large
    machine.openshift.io/region: us-east-2
    machine.openshift.io/zone: us-east-2a
  name: qe-jhou06-f8dr2-worker-us-east-2a-publicip-ltrpk
  namespace: openshift-machine-api
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: qe-jhou06-f8dr2-worker-us-east-2a-publicip
    uid: cbf4296b-948a-43ca-bc74-2e1c69b6ea3a
  resourceVersion: "58548"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/qe-jhou06-f8dr2-worker-us-east-2a-publicip-ltrpk
  uid: b2c395ac-b0e5-45c7-819b-1056b34c8c39
    creationTimestamp: null
  providerID: aws:///us-east-2a/i-002b70030a8d0af6c
        id: ami-0a8ba019bc9d4bd64
      apiVersion: awsproviderconfig.openshift.io/v1beta1
      - ebs:
          iops: 0
          volumeSize: 120
          volumeType: gp2
        name: aws-cloud-credentials
      deviceIndex: 0
        id: qe-jhou06-f8dr2-worker-profile
      instanceType: m4.large
      kind: AWSMachineProviderConfig
        creationTimestamp: null
        availabilityZone: us-east-2a
        region: us-east-2
      publicIp: true
      - filters:
        - name: tag:Name
          - qe-jhou06-f8dr2-worker-sg
        - name: tag:Name
          - qe-jhou06-f8dr2-private-us-east-2a
      - name: kubernetes.io/cluster/qe-jhou06-f8dr2
        value: owned
        name: worker-user-data
  - address:
    type: InternalIP
  - address:
    type: ExternalIP
  - address: ip-10-0-131-133.us-east-2.compute.internal
    type: InternalDNS
  - address: ip-10-0-131-133.us-east-2.compute.internal
    type: Hostname
  - address: ec2-3-135-218-125.us-east-2.compute.amazonaws.com
    type: ExternalDNS
  lastUpdated: "2020-02-06T05:43:21Z"
    kind: Node
    name: ip-10-0-131-133.us-east-2.compute.internal
    uid: cecd777b-8291-46fd-8a43-a41a77b3a24a
  phase: Running
    apiVersion: awsproviderconfig.openshift.io/v1beta1
    - lastProbeTime: "2020-02-06T05:38:41Z"
      lastTransitionTime: "2020-02-06T05:38:41Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-002b70030a8d0af6c
    instanceState: running
    kind: AWSMachineProviderStatus

oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-hxw7z   9m35s   system:node:ip-10-0-131-133.us-east-2.compute.internal                      Approved,Issued
csr-vmk49   9m48s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued