Bug 1847185
Summary: | API server crash loop on ARO | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jim Minter <jminter> | |
Component: | Cloud Compute | Assignee: | Alberto <agarcial> | |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | aos-bugs, dcain, esimard, ffranz, kewang, mfojtik, mjudeiki, mradchuk, nstielau, scuppett, sttts | |
Version: | 4.3.z | |||
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1847368 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:07:13 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1847368 |
Comment 4
Alberto
2020-06-16 09:24:10 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1847185#c4 we'll keep discussing the linked PR and farther possible mitigations on how best initialise the client. I'm going to move this to api-server component for now though so it gets the attention and guidance it deserves from a cluster recovery pov and how to deal with api server crash looping. Verified with OCP build 4.6.0-0.nightly-2020-06-23-053310, steps see below, - Creating one sc and pvc on non-zoned region, $ cat sc-non-zoned.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: labels: kubernetes.io/cluster-service: "true" name: managed-premium-nonzoned parameters: kind: Managed storageaccounttype: Premium_LRS zoned: "false" provisioner: kubernetes.io/azure-disk volumeBindingMode: WaitForFirstConsumer $ oc apply -f sc-non-zoned.yaml storageclass.storage.k8s.io/managed-premium-nonzoned created $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE managed-premium (default) kubernetes.io/azure-disk Delete WaitForFirstConsumer true 52m managed-premium-nonzoned kubernetes.io/azure-disk Delete WaitForFirstConsumer false 22m $ cat pvc-nonzoned.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: azure-managed-non spec: accessModes: - ReadWriteOnce storageClassName: managed-premium-nonzoned resources: requests: storage: 5Gi $ oc apply -f pvc-non-zoned.yaml persistentvolumeclaim/azure-managed-non created $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE azure-managed-non Bound pvc-9031e4d4-ca07-4537-bd27-494008ab781d 5Gi RWO managed-premium-nonzoned 22m $ cat mypod-non-zoned.yaml kind: Pod apiVersion: v1 metadata: name: mypod spec: containers: - name: mypod image: nginx:1.15.5 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 250m memory: 256Mi volumeMounts: - mountPath: "/mnt/azure" name: volume volumes: - name: volume persistentVolumeClaim: claimName: azure-managed-non $ oc create -f mypod-non-zoned.yaml pod/mypod created Checked the created pod status, $ oc get pod/mypod NAME READY STATUS RESTARTS AGE mypod 1/1 Running 0 22m $ oc describe pod/mypod Name: mypod Namespace: default ... Status: Running ... Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: volume: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: azure-managed-non ReadOnly: false default-token-vpm6b: Type: Secret (a volume populated by a Secret) SecretName: default-token-vpm6b Optional: false ... - Creating one sc and pvc on zoned region, Since one default zoned sc already existed, no need new one. $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE managed-premium (default) kubernetes.io/azure-disk Delete WaitForFirstConsumer true 54m $ cat pvc-zoned.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: azure-managed-disk spec: accessModes: - ReadWriteOnce storageClassName: managed-premium resources: requests: storage: 5Gi $ oc apply -f pvc-zoned.yaml persistentvolumeclaim/azure-managed-disk created $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE azure-managed-disk Pending managed-premium 12s ... $ cat mypod-zoned.yaml kind: Pod apiVersion: v1 metadata: name: mypod1 spec: containers: - name: mypod1 image: nginx:1.15.5 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 250m memory: 256Mi volumeMounts: - mountPath: "/mnt/azure" name: volume volumes: - name: volume persistentVolumeClaim: claimName: azure-managed-disk $ oc apply -f mypod-zoned.yaml pod/mypod1 created $ oc get pods NAME READY STATUS RESTARTS AGE mypod 1/1 Running 0 25m mypod1 1/1 Running 0 38s Waiting for a while, checking if the apiservers status are changed. $ oc get pods -A | grep -E 'apiserver|NAME' | grep -vE 'installer|revision|catalog' NAMESPACE NAME READY STATUS RESTARTS AGE openshift-apiserver-operator openshift-apiserver-operator-bf6448884-q8hsf 1/1 Running 2 68m openshift-apiserver apiserver-66c7dd448-2rfw2 1/1 Running 0 55m openshift-apiserver apiserver-66c7dd448-jw5tq 1/1 Running 0 57m openshift-apiserver apiserver-66c7dd448-lfv9m 1/1 Running 0 56m openshift-kube-apiserver-operator kube-apiserver-operator-75844cbf95-kc9lf 1/1 Running 2 68m openshift-kube-apiserver kube-apiserver-kewang23azure61-w5g28-master-0 5/5 Running 0 40m openshift-kube-apiserver kube-apiserver-kewang23azure61-w5g28-master-1 5/5 Running 0 46m openshift-kube-apiserver kube-apiserver-kewang23azure61-w5g28-master-2 5/5 Running 0 44m From above test results, it doesn't matter if creating sc and pvc zoned or non-zoned, the kube-apiservers won't crash, move the bug verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |