Bug 1847185 - API server crash loop on ARO
Summary: API server crash loop on ARO
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Alberto
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 1847368
TreeView+ depends on / blocked
 
Reported: 2020-06-15 20:45 UTC by Jim Minter
Modified: 2020-10-27 16:07 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1847368 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:07:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25121 0 None closed Bug 1847185: fix: GetLabelsForVolume panic issue for azure disk PV 2021-02-08 06:43:59 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:07:38 UTC

Comment 4 Alberto 2020-06-16 09:24:10 UTC
>A persistent volume claim / storage class configured in a certain way 

What is it specific about that certain way which causes the issue?

>however this code path does not run in the API server.

can you elaborate why? would you be able to add a code ref where this is happening?

Comment 5 Alberto 2020-06-16 09:45:33 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1847185#c4 we'll keep discussing the linked PR and farther possible mitigations on how best initialise the client.

I'm going to move this to api-server component for now though so it gets the attention and guidance it deserves from a cluster recovery pov and how to deal with api server crash looping.

Comment 10 Ke Wang 2020-06-23 10:30:10 UTC
Verified with OCP build 4.6.0-0.nightly-2020-06-23-053310, steps see below,

- Creating one sc and pvc on non-zoned region,
$ cat sc-non-zoned.yaml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
  labels:
    kubernetes.io/cluster-service: "true"
  name: managed-premium-nonzoned
parameters:
  kind: Managed
  storageaccounttype: Premium_LRS
  zoned: "false"
provisioner: kubernetes.io/azure-disk
volumeBindingMode: WaitForFirstConsumer

$ oc apply -f sc-non-zoned.yaml
storageclass.storage.k8s.io/managed-premium-nonzoned created

$ oc get sc
NAME                        PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
managed-premium (default)   kubernetes.io/azure-disk   Delete          WaitForFirstConsumer   true                   52m
managed-premium-nonzoned    kubernetes.io/azure-disk   Delete          WaitForFirstConsumer   false                  22m

$ cat pvc-nonzoned.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: azure-managed-non
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium-nonzoned
  resources:
    requests:
      storage: 5Gi


$ oc apply -f pvc-non-zoned.yaml      
persistentvolumeclaim/azure-managed-non created

 $ oc get pvc
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
azure-managed-non   Bound    pvc-9031e4d4-ca07-4537-bd27-494008ab781d   5Gi        RWO            managed-premium-nonzoned   22m

$ cat mypod-non-zoned.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: mypod
spec:
  containers:
  - name: mypod
    image: nginx:1.15.5
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 250m
        memory: 256Mi
    volumeMounts:
    - mountPath: "/mnt/azure"
      name: volume
  volumes:
    - name: volume
      persistentVolumeClaim:
        claimName: azure-managed-non

$ oc create -f mypod-non-zoned.yaml 
pod/mypod created

Checked the created pod status,
$ oc get pod/mypod
NAME    READY   STATUS    RESTARTS   AGE
mypod   1/1     Running   0          22m

$ oc describe pod/mypod
Name:         mypod
Namespace:    default
...
Status:       Running
...
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  azure-managed-non
    ReadOnly:   false
  default-token-vpm6b:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-vpm6b
    Optional:    false
...

- Creating one sc and pvc on zoned region,
Since one default zoned sc already existed, no need new one.
$ oc get sc
NAME                        PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
managed-premium (default)   kubernetes.io/azure-disk   Delete          WaitForFirstConsumer   true                   54m

$ cat pvc-zoned.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: azure-managed-disk
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 5Gi
      

$ oc apply -f pvc-zoned.yaml 
persistentvolumeclaim/azure-managed-disk created

$ oc get pvc
NAME                 STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
azure-managed-disk   Pending                                                                        managed-premium            12s
...

$ cat mypod-zoned.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: mypod1
spec:
  containers:
  - name: mypod1
    image: nginx:1.15.5
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 250m
        memory: 256Mi
    volumeMounts:
    - mountPath: "/mnt/azure"
      name: volume
  volumes:
    - name: volume
      persistentVolumeClaim:
        claimName: azure-managed-disk
        
$ oc apply -f mypod-zoned.yaml 
pod/mypod1 created

$ oc get pods
NAME     READY   STATUS    RESTARTS   AGE
mypod    1/1     Running   0          25m
mypod1   1/1     Running   0          38s

Waiting for a while, checking if the apiservers status are changed.

$ oc get pods -A | grep -E 'apiserver|NAME' | grep -vE 'installer|revision|catalog'
NAMESPACE                                          NAME                                                         READY   STATUS      RESTARTS   AGE
openshift-apiserver-operator                       openshift-apiserver-operator-bf6448884-q8hsf                 1/1     Running     2          68m
openshift-apiserver                                apiserver-66c7dd448-2rfw2                                    1/1     Running     0          55m
openshift-apiserver                                apiserver-66c7dd448-jw5tq                                    1/1     Running     0          57m
openshift-apiserver                                apiserver-66c7dd448-lfv9m                                    1/1     Running     0          56m
openshift-kube-apiserver-operator                  kube-apiserver-operator-75844cbf95-kc9lf                     1/1     Running     2          68m
openshift-kube-apiserver                           kube-apiserver-kewang23azure61-w5g28-master-0                5/5     Running     0          40m
openshift-kube-apiserver                           kube-apiserver-kewang23azure61-w5g28-master-1                5/5     Running     0          46m
openshift-kube-apiserver                           kube-apiserver-kewang23azure61-w5g28-master-2                5/5     Running     0          44m

From above test results, it doesn't matter if creating sc and pvc zoned or non-zoned, the kube-apiservers won't crash, move the bug verified.

Comment 12 errata-xmlrpc 2020-10-27 16:07:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.