Bug 1916596 - [OSP] Machine-controller do not use created root volume to reconcile machine when creating instance got quota issue
Summary: [OSP] Machine-controller do not use created root volume to reconcile machine ...
Keywords:
Status: CLOSED DUPLICATE of bug 1943378
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: egarcia
QA Contact: GenadiC
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-15 08:34 UTC by weiwei jiang
Modified: 2021-03-30 12:26 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-30 12:26:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description weiwei jiang 2021-01-15 08:34:00 UTC
Version:

./openshift-install 4.7.0-0.nightly-2021-01-14-211319
built from commit b3dae7f4736bcd1dbf5a1e0ddafa826ee1738d81
release image registry.ci.openshift.org/ocp/release@sha256:4c4e4e15e7c9cb334c8e1fc49cbf92ce6d620ff5fa2538ea578c53a48fe15b98
Platform:

#Please specify the platform type: aws, libvirt, openstack or baremetal etc.

Please specify:
* IPI 

What happened?
Found we have many same name volume for a worker when we have some limited quota for nova instances but not cinder volume.
$ openstack volume list | grep wduan-115b-p7fcn-worker-0-xh5qr | wc -l
20

[02:14:52] INFO> install-config.yaml:
---
apiVersion: v1
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    openstack:
      rootVolume:
        size: 25
        type: tripleo
      type: m1.large
  replicas: 3

So it means we do not use created root volume for reconciling a specific machine.

What did you expect to happen?

It should use created machine to reconcile machine on OSP platform

How to reproduce it (as minimally and precisely as possible)?
1. Install IPI on OSP with compute have root_volume when OSP only allow 4 servers(1 bootstrap + 3 masters, make bootstrap preserved for easily produce) for nova, and no limited volume quota(or at least allow 7 volumes)
---
apiVersion: v1
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    openstack:
      rootVolume:
        size: 25
        type: tripleo
      type: m1.large
  replicas: 3
2. Check if machine-controller create many same named volume for a specific worker machine


Anything else we need to know?

#Enter text here.

Comment 2 Matthew Booth 2021-01-20 16:25:58 UTC
Weiwei,

Thanks for reporting. Unfortunately I'm not clear what the issue being reported is from the description. Specifically I don't understand where quotas come into the reported issue, or indeed exactly what actions were taken to reproduce.

Please can you describe what actions you took to reproduce from the beginning, preferably with command line output, and also any errors which were displayed. Please feel free to reach out to me on slack.

Matt

Comment 4 Matthew Booth 2021-01-21 12:26:15 UTC
I can't reproduce this. Here are my reproducer steps:

* Checkout and build installer commit b3dae7f4736bcd1dbf5a1e0ddafa826ee1738d81
* Reduce instances quota of openshift user to 4:
  $ OS_CLOUD=standalone openstack quota set --instances 4 openshift
* Create install-config specifying computes using root volume:
...
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    openstack:
      rootVolume:
        size: 25
        type: tripleo
      type: m1.large
  replicas: 3
...
* Run installer:
  $ ~/src/shiftstack/installer/bin/openshift-install create cluster --dir config/

This fails almost immediately with the error:

FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Quota Check": error(MissingQuota): Instances is not available because the required number of resources (6) is more than the limit of 4

The installer did not create any servers or volumes.

Are you able to see what I have done differently to you?

Comment 5 Matthew Booth 2021-01-22 11:01:40 UTC
After talking to Weiwei the issue is they're racing to create multiple clusters simultaneously. Therefore the quota check does not fail, but the subsequent creation does fail due to insufficient quota.

To reproduce, try setting quota to 4 instances while the bootstrap node is waiting for bootstrap-complete.

Comment 7 Matthew Booth 2021-01-22 14:10:01 UTC
Based on comment 5 I think this is not a bug in the installer. The quota checks are not intended to be reliable under these conditions: they are simply a best effort to provide a good error message to the typical user under normal conditions that the requested installation would fail if it continued. Making this 100% reliable would probably require redesigning the installer.

It might be a bug in CAPO, though. Questions:

* Should CAPO do a quota check before creating a worker?
  - I personally think no.
* Should CAPO immediately clean up the volume if the instance can't be created?
  - Again I think no. We presumably keep trying to create the instance in the reconcile loop,
    so I guess it could be used eventually. I think this is ok as long as it's cleaned up with
    the Machine object.

Martin, what do you think?

Comment 9 Martin André 2021-01-22 14:25:09 UTC
(In reply to Matthew Booth from comment #7)
> Based on comment 5 I think this is not a bug in the installer. The quota
> checks are not intended to be reliable under these conditions: they are
> simply a best effort to provide a good error message to the typical user
> under normal conditions that the requested installation would fail if it
> continued. Making this 100% reliable would probably require redesigning the
> installer.

Agreed that it's not an installer bug. It was clear when we implemented the quota checks that it validates available resources prior to the deployment but not when the installer actually tries to provision them.

> It might be a bug in CAPO, though. Questions:
> 
> * Should CAPO do a quota check before creating a worker?
>   - I personally think no.
> * Should CAPO immediately clean up the volume if the instance can't be
> created?
>   - Again I think no. We presumably keep trying to create the instance in
> the reconcile loop,
>     so I guess it could be used eventually. I think this is ok as long as
> it's cleaned up with
>     the Machine object.
> 
> Martin, what do you think?

Seeing the many duplicated volumes, there still seem to be an issue in CAPO.

Either we treat the creation of the volume and the instance should be treated as an atomic operation and clean up the volume if we fail to provision the instance.

Or we have both the volume and the instance creation in separate retry loops.

Now it seems like we keep creating a new volume each time CAPO tries to create the instance.

Comment 11 Nick Stielau 2021-02-03 19:04:58 UTC
Setting to blocker-

Comment 12 Pierre Prinetti 2021-03-11 16:00:29 UTC
Lowering sev/prio; the leak is there but probably not frequent. The bug is deemed valid. Deferring to an upcoming sprint.

Comment 13 Andrew Collins 2021-03-26 18:15:49 UTC
Seeing this issue in a customer environment 4.7.0 on OSP.
Appears that reconcile errors, and then the created boot volume is not removed.
https://bugzilla.redhat.com/show_bug.cgi?id=1943378

Comment 14 Matthew Booth 2021-03-30 12:26:10 UTC

*** This bug has been marked as a duplicate of bug 1943378 ***


Note You need to log in before you can comment on or make changes to this bug.