Bug 2043672

Summary: [MAPO] root volumes not working
Product: OpenShift Container Platform Reporter: rlobillo
Component: Cloud ComputeAssignee: Michał Dulko <mdulko>
Cloud Compute sub component: OpenStack Provider QA Contact: Itay Matza <imatza>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: m.andre, mbooth, mdulko, mfedosin, pprinett
Version: 4.10Keywords: Triaged
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:43:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
must-gather none

Description rlobillo 2022-01-21 17:28:59 UTC
Created attachment 1852582 [details]
must-gather

Description of problem:

Deploying a cluster with below section on install-config.yaml:

compute:
- name: worker
  platform:
    openstack:
      zones: ['AZ-0', 'AZ-0', 'AZ-0']
      additionalNetworkIDs: ['ef087939-421b-4b4e-ba79-08406ec461b1']
      rootVolume:
        size: 25
        type: tripleo
        zones: ['cinderAZ0', 'cinderAZ1', 'cinderAZ0']
  replicas: 3
controlPlane:
  name: master
  platform:
    openstack:
      zones: ['AZ-0', 'AZ-0', 'AZ-0']
      rootVolume:
        size: 25
        type: tripleo
        zones: ['cinderAZ0', 'cinderAZ1', 'cinderAZ0']
  replicas: 3

is not adding any worker to the cluster. The machines are stuck on Provisioning status:

$ oc get machines -n openshift-machine-api
NAME                          PHASE          TYPE        REGION      ZONE   AGE
ostest-5548v-master-0         Running        m4.xlarge   regionOne   AZ-0   63m
ostest-5548v-master-1         Running        m4.xlarge   regionOne   AZ-0   63m
ostest-5548v-master-2         Running        m4.xlarge   regionOne   AZ-0   63m
ostest-5548v-worker-0-f2mmj   Provisioning                                  43m
ostest-5548v-worker-1-pdthb   Provisioning                                  43m
ostest-5548v-worker-2-xwv9d   Provisioning                                  43m


And machine-controller is showing below errors continuosly:

$ oc logs -n openshift-machine-api machine-api-controllers-77b5487964-mqpzp machine-controller -f
[...]
E0121 17:20:43.424817       1 instance.go:204] capo-compute "msg"="failed to clean up ports after failure" "error"="Expected HTTP response code [] when accessing [DELETE https://10.46.44.10:13696/v2.0/ports/5b2484b7-6552-4482-876d-f7d2cf1ec279], but got 409 instead\n{\"NeutronError\": {\"type\": \"PortInUseAsTrunkParent\", \"message\": \"Port 5b2484b7-6552-4482-876d-f7d2cf1ec279 is currently a parent port for trunk 395abf86-4b9b-41c9-9e40-def14af36324.\", \"detail\": \"\"}}"  "cluster"="openshift-machine-api-ostest-5548v" "machine"="ostest-5548v-worker-1-pdthb"
I0121 17:20:43.425514       1 logr.go:252] events "msg"="Warning"  "message"="CreateError" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-5548v-worker-1-pdthb","uid":"c90eb71e-260d-4358-b077-2583eacb0635","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"18311"} "reason"="FailedCreate"
E0121 17:20:43.595313       1 actuator.go:441] Machine error ostest-5548v-worker-1-pdthb: error creating Openstack instance: error creating Openstack instance: Bad request with: [POST https://10.46.44.10:13774/v2.1/servers], error message: {"badRequest": {"code": 400, "message": "Block Device Mapping is Invalid: failed to get image ostest-5548v-rhcos."}}
W0121 17:20:43.595359       1 controller.go:388] ostest-5548v-worker-1-pdthb: failed to create machine: error creating Openstack instance: error creating Openstack instance: Bad request with: [POST https://10.46.44.10:13774/v2.1/servers], error message: {"badRequest": {"code": 400, "message": "Block Device Mapping is Invalid: failed to get image ostest-5548v-rhcos."}}
E0121 17:20:43.595419       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: error creating Openstack instance: Bad request with: [POST https://10.46.44.10:13774/v2.1/servers], error message: {\"badRequest\": {\"code\": 400, \"message\": \"Block Device Mapping is Invalid: failed to get image ostest-5548v-rhcos.\"}}" "name"="ostest-5548v-worker-1-pdthb" "namespace"="openshift-machine-api" 

The same is working if TechPrev featureGate is not enabled:

$ oc get machines -n openshift-machine-api ostest-5548v-worker-0-f2mmj -o json | jq .spec.providerSpec.value.rootVolume
{
  "availabilityZone": "cinderAZ0",
  "deviceType": "",
  "diskSize": 25,
  "sourceType": "image",
  "sourceUUID": "ostest-5548v-rhcos",
  "volumeType": "tripleo"
}

$ oc get machines -n openshift-machine-api ostest-5548v-worker-0-f2mmj -o json | jq .spec.providerSpec.value.image
""

Version-Release number of selected component (if applicable):


How reproducible: Always


Steps to Reproduce:
1. Install OCP Cluster enabling the techPrev features including a install-config.yaml section as detailed above.

Actual results: Workers not added to the cluster.


Expected results: Installation successful.


Additional info: must-gather attached

Comment 1 rlobillo 2022-01-21 17:30:31 UTC
The volumes are created on openstack only for masters:

$ o volume list
+--------------------------------------+-----------------------+--------+------+------------------------------------------------+
| ID                                   | Name                  | Status | Size | Attached to                                    |
+--------------------------------------+-----------------------+--------+------+------------------------------------------------+
| a4c14d0f-c88c-4869-8e7d-cae18dd2ed49 | ostest-5548v-master-0 | in-use |   25 | Attached to ostest-5548v-master-0 on /dev/vda  |
| 76e38ae9-4d1b-4ab9-97ff-dbe438aa3417 | ostest-5548v-master-1 | in-use |   25 | Attached to ostest-5548v-master-1 on /dev/vda  |
| 90375fe7-9847-4d3e-b4b1-d63c4d04c98e | ostest-5548v-master-2 | in-use |   25 | Attached to ostest-5548v-master-2 on /dev/vda  |
+--------------------------------------+-----------------------+--------+------+------------------------------------------------+

Comment 2 rlobillo 2022-01-24 09:28:27 UTC
OCP version: 4.10.0-0.nightly-2022-01-21-074618

Comment 3 Martin André 2022-01-24 09:38:44 UTC
This is likely caused by Matt's latest work on volume AZ [1] missing downstream.

[1] https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1030

Comment 4 Matthew Booth 2022-01-24 09:52:19 UTC
Yes, this is a known limitation. Already addressed upstream, we just need to integrate it in MAPO now.

Comment 10 Itay Matza 2022-04-05 15:12:39 UTC
Verified with OCP 4.11.0-0.nightly-2022-03-29-152521 on top of RHOS-16.1-RHEL-8-20220315.n.1.
(MAPO is the default for OpenStack deployments on this version)


Verification steps:
Deploying a cluster with AZ and root volumes in the install-config.yaml:
```
apiVersion: v1
baseDomain: "shiftstack.com"
compute:
- name: worker
  platform:
    openstack:
      zones: ['AZhci-0', 'AZhci-1', 'AZhci-2']
      additionalNetworkIDs: ['f8b46595-abf1-43cb-b8ca-fdb2aa531c07']
      rootVolume:
        size: 25
        type: tripleo
        zones: ['cinderAZ0', 'cinderAZ1', 'cinderAZ0']
  replicas: 3
controlPlane:
  name: master
  platform:
    openstack:
      zones: ['AZhci-0', 'AZhci-1', 'AZhci-2']
      rootVolume:
        size: 25
        type: tripleo
        zones: ['cinderAZ0', 'cinderAZ1', 'cinderAZ0']
  replicas: 3

```
The openshfit installer finished successfully, and the machines are running.

Comment 13 errata-xmlrpc 2022-08-10 10:43:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069