Bug 1664585

Summary: MachineHealthCheck controller can not find machine annotation for nodes
Product: OpenShift Container Platform Reporter: Jianwei Hou <jhou>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: jchaloup, wsun, zhsun
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:41:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jianwei Hou 2019-01-09 08:53:53 UTC
Description of problem:
The machine-healthcheck controller shows the node annotated with a machine name has no machine annoatation.

Version-Release number of selected component (if applicable):
bin/openshift-install v0.9.0-master-9-g31662509d435d0e94415c3e9b0093a441a5e7563
4.0.0-0.alpha-2019-01-09-045210

How reproducible:
Always

Steps to Reproduce:
1. Create a MachineHealthCheck CR in the openshift-cluster-api namespace.

apiVersion: healthchecking.openshift.io/v1alpha1
kind: MachineHealthCheck
metadata:
  name: example
spec:
  selector:
    matchLabels:
      sigs.k8s.io/cluster-api-cluster: jhou
      sigs.k8s.io/cluster-api-machine-role: worker
      sigs.k8s.io/cluster-api-machine-type: worker
      sigs.k8s.io/cluster-api-machineset: jhou-worker-us-east-1b

2. Stop the kubelet running on the node that has a machine annotation jhou-worker-us-east-1b-rnlln:

oc get nodes|grep ip-10-0-154-187.ec2.internal                                                  
ip-10-0-154-187.ec2.internal   NotReady   worker    2h        v1.11.0+f67f40dbad

Verify that the node has machine annotation:

oc get node ip-10-0-154-187.ec2.internal -o yaml|grep 'cluster.k8s.io/machine'                  
    cluster.k8s.io/machine: openshift-cluster-api/jhou-worker-us-east-1b-rnlln

Verify the machine's label matches the machinehealthcheck's matchLabels:
oc get machine jhou-worker-us-east-1b-rnlln -o yaml
apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  creationTimestamp: 2019-01-09T06:26:17Z
  finalizers:
  - machine.cluster.k8s.io
  generateName: jhou-worker-us-east-1b-
  generation: 1
  labels:
    sigs.k8s.io/cluster-api-cluster: jhou
    sigs.k8s.io/cluster-api-machine-role: worker
    sigs.k8s.io/cluster-api-machine-type: worker
    sigs.k8s.io/cluster-api-machineset: jhou-worker-us-east-1b
  name: jhou-worker-us-east-1b-rnlln
  namespace: openshift-cluster-api
  ownerReferences:
  - apiVersion: cluster.k8s.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: jhou-worker-us-east-1b
    uid: a39c4cb7-13d4-11e9-afab-0a5d67725934
  resourceVersion: "147179"
  selfLink: /apis/cluster.k8s.io/v1alpha1/namespaces/openshift-cluster-api/machines/jhou-worker-us-east-1b-rnlln
  uid: 7862cf0f-13d7-11e9-9f8f-0a609fedb69e
spec:
  metadata:
    creationTimestamp: null
  providerConfig:
    value:
      ami:
        arn: null
        filters: null
        id: ami-0acd9649a24fe3a19
      apiVersion: awsproviderconfig.k8s.io/v1alpha1
      credentialsSecret: null
      deviceIndex: 0
      iamInstanceProfile:
        arn: null
        filters: null
        id: jhou-worker-profile
      instanceType: m4.large
      keyName: null
      kind: AWSMachineProviderConfig
      loadBalancers: null
      metadata:
        creationTimestamp: null
      placement:
        availabilityZone: us-east-1b
        region: us-east-1
      publicIp: null
      securityGroups:
      - arn: null
        filters:
        - name: tag:Name
          values:
          - jhou_worker_sg
        id: null
      subnet:
        arn: null
        filters:
        - name: tag:Name
          values:
          - jhou-worker-us-east-1b
        id: null
      tags:
      - name: openshiftClusterID
        value: c3223d2a-8b3c-43e5-9d07-e33ed7be6a6d
      - name: kubernetes.io/cluster/jhou
        value: owned
      userDataSecret:
        name: worker-user-data
  providerSpec: {}
  versions:
    kubelet: ""
status:
  addresses:
  - address: 10.0.154.187
    type: InternalIP
  - address: ""
    type: ExternalDNS
  - address: ip-10-0-154-187.ec2.internal
    type: InternalDNS
  lastUpdated: 2019-01-09T07:56:54Z
  nodeRef:
    kind: Node
    name: ip-10-0-154-187.ec2.internal
    uid: a100e30c-13d7-11e9-8395-024c79168cf2
  providerStatus:
    apiVersion: awsproviderconfig.k8s.io/v1alpha1
    conditions:
    - lastProbeTime: 2019-01-09T06:26:20Z
      lastTransitionTime: 2019-01-09T06:26:20Z
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-0457b5bdaa3d671df
    instanceState: running
    kind: AWSMachineProviderStatus

3. Monitor the machine-healthcheck container
oc logs -f clusterapi-manager-controllers-b9cc8df7-t9pjh -c machine-healthcheck|grep ip-10-0-154-187


Actual results:

```
I0109 07:53:14.104234       1 machinehealthcheck_controller.go:72] Reconciling MachineHealthCheck triggered by /ip-10-0-154-187.ec2.internal
W0109 07:53:14.104433       1 machinehealthcheck_controller.go:91] No machine annotation for node ip-10-0-154-187.ec2.internal
I0109 07:56:54.433932       1 machinehealthcheck_controller.go:72] Reconciling MachineHealthCheck triggered by /ip-10-0-154-187.ec2.internal
W0109 07:56:54.434053       1 machinehealthcheck_controller.go:91] No machine annotation for node ip-10-0-154-187.ec2.internal
```


Expected results:
The node ip-10-0-154-187.ec2.internal has machine annotation.

Additional info:

Comment 1 Jianwei Hou 2019-01-10 07:24:12 UTC
Also reproducible on 4.0.0-0.nightly-2019-01-10-005204

Comment 2 Jan Chaloupka 2019-01-15 09:21:05 UTC
Upstream PR: https://github.com/openshift/machine-api-operator/pull/175

Comment 4 sunzhaohua 2019-01-25 09:21:30 UTC
Verified in version 4.0.0-0.nightly-2019-01-25-034943

$ oc logs -f clusterapi-manager-controllers-595cdd7745-2fdlj -c machine-healthcheck
I0125 09:19:49.719644       1 machinehealthcheck_controller.go:135] Machine zhsun-worker-us-east-2c-8f8rt has no MachineHealthCheck associated
I0125 09:19:52.231169       1 machinehealthcheck_controller.go:73] Reconciling MachineHealthCheck triggered by /ip-10-0-134-135.us-east-2.compute.internal
I0125 09:19:52.231401       1 machinehealthcheck_controller.go:96] Node ip-10-0-134-135.us-east-2.compute.internal is annotated with machine openshift-cluster-api/zhsun-worker-us-east-2a-fw2wh
I0125 09:19:52.231589       1 machinehealthcheck_controller.go:153] Initialising remediation logic for machine zhsun-worker-us-east-2a-fw2wh
I0125 09:19:52.231710       1 machinehealthcheck_controller.go:190] No remediaton action was taken. Machine zhsun-worker-us-east-2a-fw2wh with node ip-10-0-134-135.us-east-2.compute.internal is healthy
I0125 09:19:53.981858       1 machinehealthcheck_controller.go:73] Reconciling MachineHealthCheck triggered by /ip-10-0-28-232.us-east-2.compute.internal
I0125 09:19:53.982016       1 machinehealthcheck_controller.go:96] Node ip-10-0-28-232.us-east-2.compute.internal is annotated with machine openshift-cluster-api/zhsun-master-1
I0125 09:19:53.982317       1 machinehealthcheck_controller.go:135] Machine zhsun-master-1 has no MachineHealthCheck associated
I0125 09:19:54.788446       1 machinehealthcheck_controller.go:73] Reconciling MachineHealthCheck triggered by /ip-10-0-34-201.us-east-2.compute.internal
I0125 09:19:54.788507       1 machinehealthcheck_controller.go:96] Node ip-10-0-34-201.us-east-2.compute.internal is annotated with machine openshift-cluster-api/zhsun-master-2

Comment 7 errata-xmlrpc 2019-06-04 10:41:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758