Bug 1862180 - [OCP 4.6] [Machine Health Check] remediation on worker: machine not returning "Provisioned as node" ,rather "Provisioned": remediation.metal3.io/powered-off-for-remediation not being removed: machine not associated with Node: Remaining Unhealthy
Summary: [OCP 4.6] [Machine Health Check] remediation on worker: machine not returning...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Nir
QA Contact: mlammon
URL:
Whiteboard:
Depends On: 1866719
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-30 16:33 UTC by mlammon
Modified: 2021-02-06 07:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-06 13:51:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description mlammon 2020-07-30 16:33:51 UTC
Description of problem:
remediation on worker: machine not returning "Provisioned as node" ,rather "Provisioned": 
remediation.metal3.io/powered-off-for-remediation not being removed: machine not associated with Node: Remaining Unhealthy

Version-Release number of selected component (if applicable):
4.6.0-0.ci-2020-07-21-114552

How reproducible:
100% 

Steps to Reproduce:
1. Install OCP 4.6
2. Create a machine Health Check (like below)

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 maxUnhealthy: 1
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s


3. Create unknown/Ready condition by suspending virtual node
virsh suspend worker-0-0

Actual results:

[root@sealusa6 ~]# oc get mhc -n openshift-machine-api -w
NAME      MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
workers   1              4                  3

oc describe machine ocp-edge-cluster-rdu1-0-worker-0-tm62r -n openshift-machine-api  |less


Name:         ocp-edge-cluster-rdu1-0-worker-0-tm62r
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=ocp-edge-cluster-rdu1-0
              machine.openshift.io/cluster-api-machine-role=worker
              machine.openshift.io/cluster-api-machine-type=worker
              machine.openshift.io/cluster-api-machineset=ocp-edge-cluster-rdu1-0-worker-0
Annotations:  host.metal3.io/external-remediation:
              metal3.io/BareMetalHost: openshift-machine-api/openshift-worker-0-0
              remediation.metal3.io/powered-off-for-remediation:
API Version:  machine.openshift.io/v1beta1



[root@sealusa6 ~]# oc get machines -n openshift-machine-api
NAME                                     PHASE         TYPE   REGION   ZONE   AGE
ocp-edge-cluster-rdu1-0-master-0         Running                              20h
ocp-edge-cluster-rdu1-0-master-1         Running                              20h
ocp-edge-cluster-rdu1-0-master-2         Running                              20h
ocp-edge-cluster-rdu1-0-worker-0-44ldb   Running                              20h
ocp-edge-cluster-rdu1-0-worker-0-qfc2t   Running                              20h
ocp-edge-cluster-rdu1-0-worker-0-tm62r   Provisioned                          20h
ocp-edge-cluster-rdu1-0-worker-0-z4ptg   Running                              20h

Expected results:
Full remediation and healthy node restored:

Power off the host
Add poweredOffForRemediation annotation to the unhealthy Machine
Delete the node
Power on the host
Wait for the node the come up (by waiting for the node to be registered in the cluster)
Remove poweredOffForRemediation annotation and the MAO's machine unhealthy annotation



LAST SEEN   TYPE     REASON                    OBJECT                                           MESSAGE
28m         Normal   DetectedUnhealthy         machine/ocp-edge-cluster-rdu1-0-worker-0-tm62r   Machine openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/worker-0-0 has unhealthy node worker-0-0
27m         Normal   ExternalAnnotationAdded   machine/ocp-edge-cluster-rdu1-0-worker-0-tm62r   Requesting external remediation of node associated with machine openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/worker-0-0
3m2s        Normal   DetectedUnhealthy         machine/ocp-edge-cluster-rdu1-0-worker-0-tm62r   Machine openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/ has unhealthy node



I0730 16:10:25.824985       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/workers
I0730 16:10:25.825167       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/workers: finding targets
I0730 16:10:25.825453       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/: health checking
I0730 16:10:25.825559       1 machinehealthcheck_controller.go:292] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/: is likely to go unhealthy in 10m0.174451529s
I0730 16:10:25.825656       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-44ldb/worker-0-2: health checking
I0730 16:10:25.825741       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-z4ptg/worker-0-3: health checking
I0730 16:10:25.825800       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-qfc2t/worker-0-1: health checking
I0730 16:10:25.833595       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 4,  maxUnhealthy: 1, unhealthy: 1. Remediations are allowed
I0730 16:10:25.833752       1 machinehealthcheck_controller.go:229] Reconciling openshift-machine-api/workers: some targets might go unhealthy. Ensuring a requeue happens in 10m0.174451529s


Manually removed poweroff remediation annotation from machine but did not resolve the problem

I was unable to pull must-gather while filing bug but it is reproducible 100%
[root@sealusa6 ~]# oc adm must-gather
[must-gather      ] OUT Using must-gather plugin-in image: registry.svc.ci.openshift.org/ocp/4.6-2020-07-21-114552@sha256:ae1ab459d16df6d104b05b5bc9d751d7dbf92a99d6c88238083fe3f6a2ee7c19
[must-gather      ] OUT namespace/openshift-must-gather-6qpf4 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-xnppw created
[must-gather      ] OUT pod for plug-in image registry.svc.ci.openshift.org/ocp/4.6-2020-07-21-114552@sha256:ae1ab459d16df6d104b05b5bc9d751d7dbf92a99d6c88238083fe3f6a2ee7c19 created
[must-gather-pqtg5] OUT gather did not start: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = Error reading manifest sha256:ae1ab459d16df6d104b05b5bc9d751d7dbf92a99d6c88238083fe3f6a2ee7c19 in registry.svc.ci.openshift.org/ocp/4.6-2020-07-21-114552: manifest unknown: manifest unknown
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-xnppw deleted
[must-gather      ] OUT namespace/openshift-must-gather-6qpf4 deleted
error: gather did not start for pod must-gather-pqtg5: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = Error reading manifest sha256:ae1ab459d16df6d104b05b5bc9d751d7dbf92a99d6c88238083fe3f6a2ee7c19 in registry.svc.ci.openshift.org/ocp/4.6-2020-07-21-114552: manifest unknown: manifest unknown

Comment 1 mlammon 2020-07-30 17:37:26 UTC
oc get machine -n openshift-machine-api
NAME                                     PHASE         TYPE   REGION   ZONE   AGE
ocp-edge-cluster-rdu1-0-master-0         Running                              21h
ocp-edge-cluster-rdu1-0-master-1         Running                              21h
ocp-edge-cluster-rdu1-0-master-2         Running                              21h
ocp-edge-cluster-rdu1-0-worker-0-44ldb   Running                              21h
ocp-edge-cluster-rdu1-0-worker-0-qfc2t   Running                              21h
ocp-edge-cluster-rdu1-0-worker-0-tm62r   Provisioned                          21h
ocp-edge-cluster-rdu1-0-worker-0-z4ptg   Running                              21h


oc get no -owide
NAME         STATUS   ROLES    AGE   VERSION           INTERNAL-IP       EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION          CONTAINER-RUNTIME
master-0-0   Ready    master   21h   v1.18.3+a34fde4   192.168.123.129   <none>        Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev
master-0-1   Ready    master   21h   v1.18.3+a34fde4   192.168.123.142   <none>        Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev
master-0-2   Ready    master   21h   v1.18.3+a34fde4   192.168.123.127   <none>        Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev
worker-0-0   Ready    worker   21h   v1.18.3+a34fde4   192.168.123.114   <none>        Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev
worker-0-1   Ready    worker   21h   v1.18.3+a34fde4   192.168.123.137   <none>        Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev
worker-0-2   Ready    worker   21h   v1.18.3+a34fde4   192.168.123.138   <none>        Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev
worker-0-3   Ready    worker   21h   v1.18.3+a34fde4   192.168.123.135   <none>        Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev


oc get bmh  -n openshift-machine-api
NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                 BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-rdu1-0-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/2c84feb5-8bc8-4fac-81cc-c835def78f31                      true
openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-rdu1-0-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/dc7385b7-ba20-4528-803b-3ea4fbc97412                      true
openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-rdu1-0-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/dca26a1c-8405-4970-bf53-85eb7707e341                      true
openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-rdu1-0-worker-0-tm62r   redfish://192.168.123.1:8000/redfish/v1/Systems/0697c5e3-6bf0-4c2f-8cc3-4b5ea9abccba   unknown            true
openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-rdu1-0-worker-0-qfc2t   redfish://192.168.123.1:8000/redfish/v1/Systems/4029924f-2aad-4630-bf68-2046e84c4879   unknown            true
openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-rdu1-0-worker-0-44ldb   redfish://192.168.123.1:8000/redfish/v1/Systems/276cb58c-649c-4552-8f0e-c6183f98ce26   unknown            true
openshift-worker-0-3   OK       provisioned              ocp-edge-cluster-rdu1-0-worker-0-z4ptg   redfish://192.168.123.1:8000/redfish/v1/Systems/ca12cfa2-0b5f-42c2-9ba6-1555cf58eab3   unknown            true

oc get po -n openshift-machine-api
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-8464b98f6b-n6p2k   2/2     Running   0          21h
machine-api-controllers-644db4b6c5-8nxng       7/7     Running   0          21h
machine-api-operator-67488f57b-q5hlq           2/2     Running   0          21h
metal3-565d56c698-wwjhz                        8/8     Running   0          21h

Comment 2 Nir 2020-08-11 05:29:33 UTC
@mlammon - can you try to reproduce again with an up-to-date build?
I believe this was resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1866719

Comment 3 mlammon 2020-09-16 12:40:12 UTC
No longer seeing the issue... Assume it was fixed with other issue https://bugzilla.redhat.com/show_bug.cgi?id=1866719


Note You need to log in before you can comment on or make changes to this bug.