Bug 1796344 - OCP 4.4: CSRs not automatically approved when scaling machineset with providerSpec publicIP set to true
Summary: OCP 4.4: CSRs not automatically approved when scaling machineset with provide...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 4.4.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-30 08:47 UTC by Walid A.
Modified: 2020-05-15 15:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-15 15:55:51 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Walid A. 2020-01-30 08:47:52 UTC
Description of problem:
This was not an issue on OCP 4.3.  This issue is happening on OCP 4.4 when trying to add a new workload node for automation by scaling an existing machineset on AWS IPI cluster, with the  providerSpec value for publicIP set to true.

oc debug node/<newly_added_node> fails with:
# oc debug node/ip-10-0-7-124.us-west-2.compute.internal 
Starting pod/ip-10-0-7-124us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.7.124
If you don't see a command prompt, try pressing enter.

Removing debug pod ...
Error from server: error dialing backend: remote error: tls: internal error


# oc logs -n openshift-cluster-machine-approver machine-approver-7b9ffbdbd5-67k4x -c machine-approver-controller | grep "ip-10-0-7-124.us-west-2.compute.internal"
I0128 18:02:41.296779       1 csr_check.go:418] retrieving serving cert from ip-10-0-7-124.us-west-2.compute.internal (10.0.7.124:10250)
I0128 18:02:41.300586       1 csr_check.go:183] Falling back to machine-api authorization for ip-10-0-7-124.us-west-2.compute.internal
I0128 18:02:41.300600       1 main.go:181] CSR csr-7z7f6 not authorized: DNS name 'ec2-54-203-167-77.us-west-2.compute.amazonaws.com' not in machine names: ip-10-0-7-124.us-west-2.compute.internal ip-10-0-7-124.us-west-2.compute.internal
I0128 18:02:41.300609       1 main.go:217] Error syncing csr csr-7z7f6: DNS name 'ec2-54-203-167-77.us-west-2.compute.amazonaws.com' not in machine names: ip-10-0-7-124.us-west-2.compute.internal ip-10-0-7-124.us-west-2.compute.internal


Manually approving the pending certs for that cluster will restore oc debug node functionality on that cluster:

oc adm certificate approve certificatesigningrequest.certificates.k8s.io/csr-zhp5l

 `oc debug node/<node_ip>` now will be successful after manual approval of certs



Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.4.0-0.nightly-2020-01-24-141203
Server Version: 4.4.0-0.nightly-2020-01-24-141203
Kubernetes Version: v1.17.1
root@ip-172-31-40-229: ~/oc_clients # 


How reproducible:
All the time on OCP 4.4

Steps to Reproduce:
1.  AWS IPI Cluster ( 3 master, 3 worker nodes) install of OCP 4.4 with 4.4.0-0.nightly-2020-01-24-141203 payload
2. oc get machineset -n openshift-machine-api -o yaml > first_worker_node_machineset.yaml
3. cp first_worker_node_machineset.yaml new_workload_node_machineset.yaml
4. vim machineset new_workload_node_machineset.yaml.

   Edit machineset name and labels and ensure that providerSpec value publicIP: true
5. oc create -f first_worker_node_machineset.yaml


Actual results:
Several CSRs pending approval, and `oc debug node/<new_workload_node_ip>` fails with:



Expected results:
CSRs should be automatically approved when scaling the cluster and adding a new workload node for automation.


Additional info:
Links to msut-gather logs, machineset, machine-auto-approver logs will be provided in next comment

Comment 2 Brad Ison 2020-02-04 15:18:43 UTC
This was a regression introduced by:
https://github.com/openshift/cluster-api-provider-aws/pull/285

It has been fixed with:
https://github.com/openshift/cluster-api-provider-aws/pull/288

Comment 4 Jianwei Hou 2020-02-06 05:52:23 UTC
Verified in 4.4.0-0.nightly-2020-02-05-181112

A machine with publicIp set to true is provisioned and approved.

apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: running
  creationTimestamp: "2020-02-06T05:38:39Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: qe-jhou06-f8dr2-worker-us-east-2a-publicip-
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: qe-jhou06-f8dr2
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: qe-jhou06-f8dr2-worker-us-east-2a
    machine.openshift.io/instance-type: m4.large
    machine.openshift.io/region: us-east-2
    machine.openshift.io/zone: us-east-2a
  name: qe-jhou06-f8dr2-worker-us-east-2a-publicip-ltrpk
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: qe-jhou06-f8dr2-worker-us-east-2a-publicip
    uid: cbf4296b-948a-43ca-bc74-2e1c69b6ea3a
  resourceVersion: "58548"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/qe-jhou06-f8dr2-worker-us-east-2a-publicip-ltrpk
  uid: b2c395ac-b0e5-45c7-819b-1056b34c8c39
spec:
  metadata:
    creationTimestamp: null
  providerID: aws:///us-east-2a/i-002b70030a8d0af6c
  providerSpec:
    value:
      ami:
        id: ami-0a8ba019bc9d4bd64
      apiVersion: awsproviderconfig.openshift.io/v1beta1
      blockDevices:
      - ebs:
          iops: 0
          volumeSize: 120
          volumeType: gp2
      credentialsSecret:
        name: aws-cloud-credentials
      deviceIndex: 0
      iamInstanceProfile:
        id: qe-jhou06-f8dr2-worker-profile
      instanceType: m4.large
      kind: AWSMachineProviderConfig
      metadata:
        creationTimestamp: null
      placement:
        availabilityZone: us-east-2a
        region: us-east-2
      publicIp: true
      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - qe-jhou06-f8dr2-worker-sg
      subnet:
        filters:
        - name: tag:Name
          values:
          - qe-jhou06-f8dr2-private-us-east-2a
      tags:
      - name: kubernetes.io/cluster/qe-jhou06-f8dr2
        value: owned
      userDataSecret:
        name: worker-user-data
status:
  addresses:
  - address: 10.0.131.133
    type: InternalIP
  - address: 3.135.218.125
    type: ExternalIP
  - address: ip-10-0-131-133.us-east-2.compute.internal
    type: InternalDNS
  - address: ip-10-0-131-133.us-east-2.compute.internal
    type: Hostname
  - address: ec2-3-135-218-125.us-east-2.compute.amazonaws.com
    type: ExternalDNS
  lastUpdated: "2020-02-06T05:43:21Z"
  nodeRef:
    kind: Node
    name: ip-10-0-131-133.us-east-2.compute.internal
    uid: cecd777b-8291-46fd-8a43-a41a77b3a24a
  phase: Running
  providerStatus:
    apiVersion: awsproviderconfig.openshift.io/v1beta1
    conditions:
    - lastProbeTime: "2020-02-06T05:38:41Z"
      lastTransitionTime: "2020-02-06T05:38:41Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-002b70030a8d0af6c
    instanceState: running
    kind: AWSMachineProviderStatus


oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-hxw7z   9m35s   system:node:ip-10-0-131-133.us-east-2.compute.internal                      Approved,Issued
csr-vmk49   9m48s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued


Note You need to log in before you can comment on or make changes to this bug.