Bug 1883978
| Summary: | [oVirt] autoscaler detects nodes as unregistered and tries to delete them | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Gal Zaidman <gzaidman> |
| Component: | Cloud Compute | Assignee: | Gal Zaidman <gzaidman> |
| Cloud Compute sub component: | oVirt Provider | QA Contact: | Guilherme Santos <gdeolive> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | high | CC: | aoconnor, apjagtap |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:47:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Gal Zaidman
2020-09-30 16:27:45 UTC
*** Bug 1883979 has been marked as a duplicate of this bug. *** *** Bug 1881051 has been marked as a duplicate of this bug. *** *** Bug 1880136 has been marked as a duplicate of this bug. *** Verified on:
openshift-4.6.0-0.nightly-2020-10-02-065738
Steps:
1. Have OCP with 3 masters and 3 workers
2. # cat cluster_autoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
name: "default"
spec:
podPriorityThreshold: -10
resourceLimits:
maxNodesTotal: 9
cores:
min: 24
max: 40
memory:
min: 96
max: 256
scaleDown:
enabled: true
delayAfterAdd: 10s
delayAfterDelete: 10s
delayAfterFailure: 30s
unneededTime: 30s
3. # oc create -f cluster_autoscaler.yaml
4. # cat machine_autoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
name: "primary-jnzvt-worker-0"
namespace: "openshift-machine-api"
spec:
minReplicas: 3
maxReplicas: 6
scaleTargetRef:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
name: primary-jnzvt-worker-0
5. # oc create -f machine_autoscaler.yaml
6. # oc apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: autoscaler-demo
EOF
7. # cat scale-up.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scale-up
labels:
app: scale-up
spec:
replicas: 18
selector:
matchLabels:
app: scale-up
template:
metadata:
labels:
app: scale-up
spec:
containers:
- name: origin-base
image: openshift/origin-base
resources:
requests:
memory: 2Gi
command:
- /bin/sh
- "-c"
- "echo 'this should be in the logs' && sleep 86400"
terminationGracePeriodSeconds: 0
8. # oc apply -n autoscaler-demo -f scale-up.yaml
9. Waited around 1 hr and checked the results:
# oc get pods -n autoscaler-demo && oc get machine -n openshift-machine-api
NAME READY STATUS RESTARTS AGE
scale-up-5fd5c67f64-24wk5 1/1 Running 0 78m
scale-up-5fd5c67f64-2qd9c 1/1 Running 0 79m
scale-up-5fd5c67f64-58tjg 1/1 Running 0 79m
scale-up-5fd5c67f64-7fh9q 1/1 Running 0 78m
scale-up-5fd5c67f64-7q5bc 1/1 Running 0 79m
scale-up-5fd5c67f64-cdhv5 1/1 Running 0 79m
scale-up-5fd5c67f64-cv4rl 1/1 Running 0 79m
scale-up-5fd5c67f64-fkq9t 1/1 Running 0 79m
scale-up-5fd5c67f64-grzl2 1/1 Running 0 79m
scale-up-5fd5c67f64-jb57z 1/1 Running 0 79m
scale-up-5fd5c67f64-jhnp9 1/1 Running 0 79m
scale-up-5fd5c67f64-rtq82 1/1 Running 0 79m
scale-up-5fd5c67f64-v5pnd 1/1 Running 0 79m
scale-up-5fd5c67f64-vfq6r 1/1 Running 0 79m
scale-up-5fd5c67f64-wv9ld 1/1 Running 0 79m
scale-up-5fd5c67f64-xmgq6 1/1 Running 0 78m
scale-up-5fd5c67f64-z4wtj 1/1 Running 0 79m
scale-up-5fd5c67f64-zm4j9 1/1 Running 0 79m
NAME PHASE TYPE REGION ZONE AGE
primary-8hhkw-master-0 Running 3h51m
primary-8hhkw-master-1 Running 3h51m
primary-8hhkw-master-2 Running 3h51m
primary-8hhkw-worker-0-4xvnf Running 129m
primary-8hhkw-worker-0-pb5kv Running 3h40m
primary-8hhkw-worker-0-pcntp Running 129m
primary-8hhkw-worker-0-sxkl2 Running 78m
10. Changed the scale-up.yaml to scaled to 24 containers waited around 20 min and checked the results
# oc get pods -n autoscaler-demo && oc get machine -n openshift-machine-api
NAME READY STATUS RESTARTS AGE
scale-up-5fd5c67f64-24wk5 1/1 Running 0 97m
scale-up-5fd5c67f64-2qd9c 1/1 Running 0 99m
scale-up-5fd5c67f64-58tjg 1/1 Running 0 99m
scale-up-5fd5c67f64-7fh9q 1/1 Running 0 97m
scale-up-5fd5c67f64-7q5bc 1/1 Running 0 99m
scale-up-5fd5c67f64-8tx9c 1/1 Running 0 28m
scale-up-5fd5c67f64-b9zk9 1/1 Running 0 28m
scale-up-5fd5c67f64-cdhv5 1/1 Running 0 99m
scale-up-5fd5c67f64-cv4rl 1/1 Running 0 99m
scale-up-5fd5c67f64-fkq9t 1/1 Running 0 99m
scale-up-5fd5c67f64-fkxhf 1/1 Running 0 28m
scale-up-5fd5c67f64-grzl2 1/1 Running 0 99m
scale-up-5fd5c67f64-jb57z 1/1 Running 0 99m
scale-up-5fd5c67f64-jhnp9 1/1 Running 0 99m
scale-up-5fd5c67f64-qxnfk 1/1 Running 0 27m
scale-up-5fd5c67f64-rtq82 1/1 Running 0 99m
scale-up-5fd5c67f64-v5pnd 1/1 Running 0 99m
scale-up-5fd5c67f64-vfq6r 1/1 Running 0 99m
scale-up-5fd5c67f64-wv9ld 1/1 Running 0 99m
scale-up-5fd5c67f64-xm72p 1/1 Running 0 27m
scale-up-5fd5c67f64-xmgq6 1/1 Running 0 97m
scale-up-5fd5c67f64-xn6cb 1/1 Running 0 28m
scale-up-5fd5c67f64-z4wtj 1/1 Running 0 99m
scale-up-5fd5c67f64-zm4j9 1/1 Running 0 99m
NAME PHASE TYPE REGION ZONE AGE
primary-8hhkw-master-0 Running 4h10m
primary-8hhkw-master-1 Running 4h10m
primary-8hhkw-master-2 Running 4h10m
primary-8hhkw-worker-0-4xvnf Running 148m
primary-8hhkw-worker-0-pb5kv Running 4h
primary-8hhkw-worker-0-pcntp Running 148m
primary-8hhkw-worker-0-sxkl2 Running 97m
primary-8hhkw-worker-0-zjr6g Running 28m
11. Checked the provider ID in the new machines:
# oc describe {node,machine}/{primary-8hhkw-worker-0-sxkl2,primary-8hhkw-worker-0-zjr6g} -n openshift-machine-api | egrep "ProviderID|Provider ID"
ProviderID: ovirt://c17bec45-2237-433c-9661-ccdce2aa3dbc
ProviderID: ovirt://39bfbf06-d803-4bf6-ab1e-bb3a4324013a
Provider ID: ovirt://c17bec45-2237-433c-9661-ccdce2aa3dbc
Provider ID: ovirt://39bfbf06-d803-4bf6-ab1e-bb3a4324013a
Results:
Extra machines created successfully, provider ID matching and new machines + new containers stable over time
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |