Bug 1847185
| Summary: | API server crash loop on ARO | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jim Minter <jminter> | |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> | |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | urgent | CC: | aos-bugs, dcain, esimard, ffranz, kewang, mfojtik, mjudeiki, mradchuk, nstielau, scuppett, sttts | |
| Version: | 4.3.z | |||
| Target Milestone: | --- | |||
| Target Release: | 4.6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1847368 (view as bug list) | Environment: | ||
| Last Closed: | 2020-10-27 16:07:13 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1847368 | |||
|
Comment 4
Alberto
2020-06-16 09:24:10 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1847185#c4 we'll keep discussing the linked PR and farther possible mitigations on how best initialise the client. I'm going to move this to api-server component for now though so it gets the attention and guidance it deserves from a cluster recovery pov and how to deal with api server crash looping. Verified with OCP build 4.6.0-0.nightly-2020-06-23-053310, steps see below,
- Creating one sc and pvc on non-zoned region,
$ cat sc-non-zoned.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
labels:
kubernetes.io/cluster-service: "true"
name: managed-premium-nonzoned
parameters:
kind: Managed
storageaccounttype: Premium_LRS
zoned: "false"
provisioner: kubernetes.io/azure-disk
volumeBindingMode: WaitForFirstConsumer
$ oc apply -f sc-non-zoned.yaml
storageclass.storage.k8s.io/managed-premium-nonzoned created
$ oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
managed-premium (default) kubernetes.io/azure-disk Delete WaitForFirstConsumer true 52m
managed-premium-nonzoned kubernetes.io/azure-disk Delete WaitForFirstConsumer false 22m
$ cat pvc-nonzoned.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: azure-managed-non
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium-nonzoned
resources:
requests:
storage: 5Gi
$ oc apply -f pvc-non-zoned.yaml
persistentvolumeclaim/azure-managed-non created
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
azure-managed-non Bound pvc-9031e4d4-ca07-4537-bd27-494008ab781d 5Gi RWO managed-premium-nonzoned 22m
$ cat mypod-non-zoned.yaml
kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
containers:
- name: mypod
image: nginx:1.15.5
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
volumeMounts:
- mountPath: "/mnt/azure"
name: volume
volumes:
- name: volume
persistentVolumeClaim:
claimName: azure-managed-non
$ oc create -f mypod-non-zoned.yaml
pod/mypod created
Checked the created pod status,
$ oc get pod/mypod
NAME READY STATUS RESTARTS AGE
mypod 1/1 Running 0 22m
$ oc describe pod/mypod
Name: mypod
Namespace: default
...
Status: Running
...
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: azure-managed-non
ReadOnly: false
default-token-vpm6b:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-vpm6b
Optional: false
...
- Creating one sc and pvc on zoned region,
Since one default zoned sc already existed, no need new one.
$ oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
managed-premium (default) kubernetes.io/azure-disk Delete WaitForFirstConsumer true 54m
$ cat pvc-zoned.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: azure-managed-disk
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 5Gi
$ oc apply -f pvc-zoned.yaml
persistentvolumeclaim/azure-managed-disk created
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
azure-managed-disk Pending managed-premium 12s
...
$ cat mypod-zoned.yaml
kind: Pod
apiVersion: v1
metadata:
name: mypod1
spec:
containers:
- name: mypod1
image: nginx:1.15.5
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
volumeMounts:
- mountPath: "/mnt/azure"
name: volume
volumes:
- name: volume
persistentVolumeClaim:
claimName: azure-managed-disk
$ oc apply -f mypod-zoned.yaml
pod/mypod1 created
$ oc get pods
NAME READY STATUS RESTARTS AGE
mypod 1/1 Running 0 25m
mypod1 1/1 Running 0 38s
Waiting for a while, checking if the apiservers status are changed.
$ oc get pods -A | grep -E 'apiserver|NAME' | grep -vE 'installer|revision|catalog'
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-apiserver-operator openshift-apiserver-operator-bf6448884-q8hsf 1/1 Running 2 68m
openshift-apiserver apiserver-66c7dd448-2rfw2 1/1 Running 0 55m
openshift-apiserver apiserver-66c7dd448-jw5tq 1/1 Running 0 57m
openshift-apiserver apiserver-66c7dd448-lfv9m 1/1 Running 0 56m
openshift-kube-apiserver-operator kube-apiserver-operator-75844cbf95-kc9lf 1/1 Running 2 68m
openshift-kube-apiserver kube-apiserver-kewang23azure61-w5g28-master-0 5/5 Running 0 40m
openshift-kube-apiserver kube-apiserver-kewang23azure61-w5g28-master-1 5/5 Running 0 46m
openshift-kube-apiserver kube-apiserver-kewang23azure61-w5g28-master-2 5/5 Running 0 44m
From above test results, it doesn't matter if creating sc and pvc zoned or non-zoned, the kube-apiservers won't crash, move the bug verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |