Version: ./openshift-install 4.7.0-0.nightly-2021-01-14-211319 built from commit b3dae7f4736bcd1dbf5a1e0ddafa826ee1738d81 release image registry.ci.openshift.org/ocp/release@sha256:4c4e4e15e7c9cb334c8e1fc49cbf92ce6d620ff5fa2538ea578c53a48fe15b98 Platform: #Please specify the platform type: aws, libvirt, openstack or baremetal etc. Please specify: * IPI What happened? Found we have many same name volume for a worker when we have some limited quota for nova instances but not cinder volume. $ openstack volume list | grep wduan-115b-p7fcn-worker-0-xh5qr | wc -l 20 [02:14:52] INFO> install-config.yaml: --- apiVersion: v1 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: openstack: rootVolume: size: 25 type: tripleo type: m1.large replicas: 3 So it means we do not use created root volume for reconciling a specific machine. What did you expect to happen? It should use created machine to reconcile machine on OSP platform How to reproduce it (as minimally and precisely as possible)? 1. Install IPI on OSP with compute have root_volume when OSP only allow 4 servers(1 bootstrap + 3 masters, make bootstrap preserved for easily produce) for nova, and no limited volume quota(or at least allow 7 volumes) --- apiVersion: v1 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: openstack: rootVolume: size: 25 type: tripleo type: m1.large replicas: 3 2. Check if machine-controller create many same named volume for a specific worker machine Anything else we need to know? #Enter text here.
Weiwei, Thanks for reporting. Unfortunately I'm not clear what the issue being reported is from the description. Specifically I don't understand where quotas come into the reported issue, or indeed exactly what actions were taken to reproduce. Please can you describe what actions you took to reproduce from the beginning, preferably with command line output, and also any errors which were displayed. Please feel free to reach out to me on slack. Matt
I can't reproduce this. Here are my reproducer steps: * Checkout and build installer commit b3dae7f4736bcd1dbf5a1e0ddafa826ee1738d81 * Reduce instances quota of openshift user to 4: $ OS_CLOUD=standalone openstack quota set --instances 4 openshift * Create install-config specifying computes using root volume: ... compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: openstack: rootVolume: size: 25 type: tripleo type: m1.large replicas: 3 ... * Run installer: $ ~/src/shiftstack/installer/bin/openshift-install create cluster --dir config/ This fails almost immediately with the error: FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Quota Check": error(MissingQuota): Instances is not available because the required number of resources (6) is more than the limit of 4 The installer did not create any servers or volumes. Are you able to see what I have done differently to you?
After talking to Weiwei the issue is they're racing to create multiple clusters simultaneously. Therefore the quota check does not fail, but the subsequent creation does fail due to insufficient quota. To reproduce, try setting quota to 4 instances while the bootstrap node is waiting for bootstrap-complete.
Based on comment 5 I think this is not a bug in the installer. The quota checks are not intended to be reliable under these conditions: they are simply a best effort to provide a good error message to the typical user under normal conditions that the requested installation would fail if it continued. Making this 100% reliable would probably require redesigning the installer. It might be a bug in CAPO, though. Questions: * Should CAPO do a quota check before creating a worker? - I personally think no. * Should CAPO immediately clean up the volume if the instance can't be created? - Again I think no. We presumably keep trying to create the instance in the reconcile loop, so I guess it could be used eventually. I think this is ok as long as it's cleaned up with the Machine object. Martin, what do you think?
(In reply to Matthew Booth from comment #7) > Based on comment 5 I think this is not a bug in the installer. The quota > checks are not intended to be reliable under these conditions: they are > simply a best effort to provide a good error message to the typical user > under normal conditions that the requested installation would fail if it > continued. Making this 100% reliable would probably require redesigning the > installer. Agreed that it's not an installer bug. It was clear when we implemented the quota checks that it validates available resources prior to the deployment but not when the installer actually tries to provision them. > It might be a bug in CAPO, though. Questions: > > * Should CAPO do a quota check before creating a worker? > - I personally think no. > * Should CAPO immediately clean up the volume if the instance can't be > created? > - Again I think no. We presumably keep trying to create the instance in > the reconcile loop, > so I guess it could be used eventually. I think this is ok as long as > it's cleaned up with > the Machine object. > > Martin, what do you think? Seeing the many duplicated volumes, there still seem to be an issue in CAPO. Either we treat the creation of the volume and the instance should be treated as an atomic operation and clean up the volume if we fail to provision the instance. Or we have both the volume and the instance creation in separate retry loops. Now it seems like we keep creating a new volume each time CAPO tries to create the instance.
Setting to blocker-
Lowering sev/prio; the leak is there but probably not frequent. The bug is deemed valid. Deferring to an upcoming sprint.
Seeing this issue in a customer environment 4.7.0 on OSP. Appears that reconcile errors, and then the created boot volume is not removed. https://bugzilla.redhat.com/show_bug.cgi?id=1943378
*** This bug has been marked as a duplicate of bug 1943378 ***