Bug 2065727 - Scaling down an hypershift cluster ends with BMH shutdown and in maintenance mode
Summary: Scaling down an hypershift cluster ends with BMH shutdown and in maintenance ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.12.0
Assignee: Dmitry Tantsur
QA Contact: Pedro Amoedo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-18 15:12 UTC by Javi Polo
Modified: 2023-01-17 19:55 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:47:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
metal3-ironic-conductor.log (229.88 KB, text/x-matlab)
2022-03-18 15:12 UTC, Javi Polo
no flags Details
metal3-baremetal-operator.log (131.88 KB, text/x-matlab)
2022-03-18 15:13 UTC, Javi Polo
no flags Details
cluster-api-agent-provider.log (562 bytes, text/plain)
2022-03-18 15:14 UTC, Javi Polo
no flags Details
assisted-service.log (1.90 KB, text/plain)
2022-03-18 15:15 UTC, Javi Polo
no flags Details
BareMetalHost.yaml CustomResource (5.23 KB, text/plain)
2022-03-18 15:15 UTC, Javi Polo
no flags Details
full metal3-baremetal-operator.log (518.61 KB, text/plain)
2022-03-18 17:59 UTC, Javi Polo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github metal3-io baremetal-operator pull 1101 0 None open Handle maintenance more gracefully 2022-03-25 09:34:17 UTC
Github openshift baremetal-operator pull 239 0 None Merged Merge upstream 2022-08-29 14:22:23 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:55:25 UTC

Description Javi Polo 2022-03-18 15:12:53 UTC
Created attachment 1866623 [details]
metal3-ironic-conductor.log

Created attachment 1866623 [details]
metal3-ironic-conductor.log

Description of problem:

I'm having problems when scaling down a node in an hypershift cluster, instead of booting again assisted-installer's discovery ISO, the BMH is being shutdown and set into maintenance mode

Long explanation:
I set up a HUB cluster using dev-scripts, and installed there assisted-service, hive and hypershift
I create an infraenv, and create BMH with infraenvs.agent-install.openshift.io: myinfraenv label
Then I create an agent hypershift cluster, and scale it up
The problem appears when I try to scale it down
The expected behaviour would be that a worker node is removed from the cluster, and then BMH is reprovisioned using assisted's discovery ISO
What actually happens is that the node is shutdown


How reproducible:
Always

Steps to Reproduce:
Start a HUB cluster using dev-scripts. This is my config_$USER.sh file:

--- config_$USER.sh ---
export WORKING_DIR=/home/dev-scripts
#Go to https://console-openshift-console.apps.ci.l2s4.p1.openshiftapps.com/, click on your name in the top right, and use the token from the login command
export CI_TOKEN='mysecrettoken'

export CLUSTER_NAME="hub"
export BASE_DOMAIN="redhat.com"

export OPENSHIFT_VERSION=4.9.17
export OPENSHIFT_RELEASE_TYPE=ga
export IP_STACK=v4
export BMC_DRIVER=redfish-virtualmedia
export REDFISH_EMULATOR_IGNORE_BOOT_DEVICE=True

export NUM_WORKERS=0
export MASTER_VCPU=8
export MASTER_MEMORY=20000
export MASTER_DISK=60

export NUM_EXTRA_WORKERS=2
export WORKER_VCPU=8
export WORKER_MEMORY=35000
export WORKER_DISK=150

export VM_EXTRADISKS=true
export VM_EXTRADISKS_LIST="vda vdb"
export VM_EXTRADISKS_SIZE=30G
--- END FILE ---

Steps to reproduce:

NAMESPACE=mynamespace
CLUSTERNAME=mycluster
INFRAENV=myinfraenv
BASEDOMAIN=redhat.com
SSHKEY=~/.ssh/id_rsa.pub
HYPERSHIFT_IMAGE=quay.io/hypershift/hypershift-operator:latest

make all assisted
cp ocp/hub/auth/kubeconfig ~/.kube/config
oc patch storageclass assisted-service -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'                                                              
oc patch provisioning provisioning-configuration --type merge -p '{"spec":{"watchAllNamespaces": true}}'                                                                                      

alias hypershift="podman run --net host --rm --entrypoint /usr/bin/hypershift -e KUBECONFIG=/working_dir/ocp/hub/auth/kubeconfig -v $HOME/.ssh:/root/.ssh -v $(pwd):/working_dir $HYPERSHIFT_IMAGE"
hypershift install --hypershift-image $HYPERSHIFT_IMAGE

oc create namespace $NAMESPACE

# Create pull secret
cat << EOF | oc apply -n $NAMESPACE -f -
apiVersion: v1
kind: Secret
type: kubernetes.io/dockerconfigjson
metadata:
  name: my-pull-secret
data:
  .dockerconfigjson: $(cat pull_secret.json | base64 -w0)
EOF

# Create infraenv
cat << EOF | oc apply -n $NAMESPACE -f -
apiVersion: agent-install.openshift.io/v1beta1
kind: InfraEnv
metadata:
  name: $INFRAENV
spec:
  pullSecretRef:
    name: my-pull-secret
  sshAuthorizedKey: $(cat $SSHKEY)
EOF

# Create BMH
cat  ocp/hub/saved-assets/assisted-installer-manifests/06-extra-host-manifests.yaml | sed "s/assisted-installer/$NAMESPACE/g; s/myinfraenv/$INFRAENV/g" | oc apply -f -

# Create hypershift cluster
INFRAID=$(oc get infraenv $INFRAENV -n $NAMESPACE -o yaml| yq -r .status.isoDownloadURL | awk -F/ '{print $NF}'| cut -d \? -f 1)
hypershift create cluster agent --name $CLUSTERNAME --base-domain $BASEDOMAIN --pull-secret /working_dir/pull_secret.json  --ssh-key $SSHKEY --agent-namespace $NAMESPACE --namespace $NAMESPACE --infra-id=$INFRAID --control-plane-operator-image=$HYPERSHIFT_IMAGE
KUBECONFIG=ocp/hub/auth/kubeconfig:<(hypershift create kubeconfig) kubectl config view --flatten > ~/.kube/config

# Scale up
oc scale nodepool/$CLUSTERNAME -n $NAMESPACE --replicas=2

# Force EFI next-boot to be CD
ssh core@extraworker-0 sudo efibootmgr -n1
ssh core@extraworker-1 sudo efibootmgr -n1


# Scale down
oc scale nodepool/$CLUSTERNAME -n $NAMESPACE --replicas=1


Actual results:
Scaled down node is shutdown, and in ironic it appears as in maintenance mode

Expected results:
Scaled down node is rebooted into assisted service's discovery ISO

Additional info:

Related slack threads
https://coreos.slack.com/archives/C02N0RJR170/p1647513139893809
https://coreos.slack.com/archives/CFP6ST0A3/p1647449263684629

I also attach logs from metal3-ironic-conductor, metal3-baremetal-operator, cluster-api-agent-provider and assisted-service

I'll also attach an affected BareMetalHost CR

Comment 1 Javi Polo 2022-03-18 15:13:58 UTC
Created attachment 1866624 [details]
metal3-baremetal-operator.log

Comment 2 Javi Polo 2022-03-18 15:14:43 UTC
Created attachment 1866625 [details]
cluster-api-agent-provider.log

Comment 3 Javi Polo 2022-03-18 15:15:10 UTC
Created attachment 1866626 [details]
assisted-service.log

Comment 4 Javi Polo 2022-03-18 15:15:41 UTC
Created attachment 1866627 [details]
BareMetalHost.yaml CustomResource

Comment 5 Dmitry Tantsur 2022-03-18 15:28:05 UTC
Findings so far: when scaling down, BMO first goes down the Ironic node deletion path. For that, it sets the maintenance flag on the Ironic node and reschedule the reconciliation. On the next iteration, BMO goes down the deprovisioning path (as expected). Since the node is in maintenance still, the provisioning action is not allowed on it, and the process gets stuck.

1) I'm not sure why BMO tries the node deletion (but never completes it). My guess is that it has something to do with the detached annotation removal.
2) We need to enforce the maintenance mode or lack of it on Ironic nodes in BMO.

Comment 6 Javi Polo 2022-03-18 17:59:04 UTC
Created attachment 1866654 [details]
full metal3-baremetal-operator.log

Comment 7 Dmitry Tantsur 2022-03-25 17:12:19 UTC
So far I've been unable to artificially reproduce this problem by deploying a live ISO on a BMH, detaching it and the deprovisioning. Maybe it's something that is only present on 4.9 (I'm on master, i.e. 4.11) or my testing is missing some critical step. Or it's timing-sensitive.

Anyway, https://github.com/metal3-io/baremetal-operator/pull/1101 should make the maintenance mode less of a problem at least. I still hope to get to the root cause eventually, but I'm also leaving for PTO soon.

Comment 8 Javi Polo 2022-03-28 10:33:17 UTC
I'll try to build a custom bmo image with your changes and a spawn a new cluster to see if the usability improves :)

Anyway, since I can easily reproduce this behavior, if you want when you're back from PTO I can give you access to one affected cluster so you can debug it better

Comment 9 Javi Polo 2022-05-30 13:50:21 UTC
After reproducing the behaviour, I replaced baremetal-operator image to one with Dimitry's patch, and now the nodes dont get stuck on maintenance mode :)

Comment 10 Javi Polo 2022-06-09 10:28:48 UTC
@dtantsur do you want to keep this open to get to the root issue? or is the workaround you wrote enough to close this?

As a user I cannot find the wrong behaviour anymore

Comment 13 Pedro Amoedo 2022-11-02 10:53:46 UTC
The workaround that did the trick for this BZ is already present in the PR#239[1] via commit d46e118fe5a055d1d84a9b62850e7d1456acee8a[2], so far so good, moving to VERIFIED.

[1] - https://github.com/openshift/baremetal-operator/pull/239
[2] - https://github.com/openshift/baremetal-operator/pull/239/commits/d46e118fe5a055d1d84a9b62850e7d1456acee8a

Comment 16 errata-xmlrpc 2023-01-17 19:47:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 17 errata-xmlrpc 2023-01-17 19:55:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.