Bug 1847419 - API server crash loop on ARO
Summary: API server crash loop on ARO
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.4.z
Assignee: Alberto
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 1847368
Blocks: 1848622
TreeView+ depends on / blocked
 
Reported: 2020-06-16 11:00 UTC by Alberto
Modified: 2020-06-29 15:34 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1847368
: 1848622 (view as bug list)
Environment:
Last Closed: 2020-06-29 15:34:20 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 25159 None closed Bug 1847419: [release-4.4] UPSTREAM: 92166: fix: GetLabelsForVolume panic issue for azure disk PV 2020-08-07 09:05:34 UTC
Red Hat Product Errata RHBA-2020:2713 None None None 2020-06-29 15:34:40 UTC

Comment 3 Ke Wang 2020-06-24 06:23:21 UTC
Verified with OCP build 4.4.0-0.nightly-2020-06-23-102753, steps see below,

Before we verify the bug, checking current status of apiservers as below, kube-apiservers restarted 4 times, this is because bug 1837992 is not backported to 4.4.
$ oc get pods -A | grep -E 'apiserver|NAME' | grep -vE 'installer|revision|catalog'
NAMESPACE                                               NAME                                                              READY   STATUS      RESTARTS   AGE
openshift-apiserver-operator                            openshift-apiserver-operator-7d68cd5574-dl49s                     1/1     Running     2          65m
openshift-apiserver                                     apiserver-6b4776d799-cr5pq                                        1/1     Running     0          55m
openshift-apiserver                                     apiserver-6b4776d799-gwxlw                                        1/1     Running     0          57m
openshift-apiserver                                     apiserver-6b4776d799-jrh8n                                        1/1     Running     0          56m
openshift-kube-apiserver-operator                       kube-apiserver-operator-7c98b4cd9f-6gnzr                          1/1     Running     2          65m
openshift-kube-apiserver                                kube-apiserver-kewang24azure41-bhpqr-master-0                     4/4     Running     3          41m
openshift-kube-apiserver                                kube-apiserver-kewang24azure41-bhpqr-master-1                     4/4     Running     5          46m
openshift-kube-apiserver                                kube-apiserver-kewang24azure41-bhpqr-master-2                     4/4     Running     5          44m

- Creating one sc and pvc on non-zoned region,
$ cat sc-non-zoned.yaml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
  labels:
    kubernetes.io/cluster-service: "true"
  name: managed-premium-nonzoned
parameters:
  kind: Managed
  storageaccounttype: Premium_LRS
  zoned: "false"
provisioner: kubernetes.io/azure-disk
volumeBindingMode: WaitForFirstConsumer

$ oc apply -f sc-non-zoned.yaml
storageclass.storage.k8s.io/managed-premium-nonzoned created

$ oc get sc
NAME                        PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
managed-premium (default)   kubernetes.io/azure-disk   Delete          WaitForFirstConsumer   true                   1h
managed-premium-nonzoned    kubernetes.io/azure-disk   Delete          WaitForFirstConsumer   false                  56m

$ cat pvc-nonzoned.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: azure-managed-non
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium-nonzoned
  resources:
    requests:
      storage: 5Gi


$ oc apply -f pvc-non-zoned.yaml      
persistentvolumeclaim/azure-managed-non created

$ oc get pvc
NAME                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
azure-managed-non    Bound    pvc-7e772bd1-1ca3-4edb-b443-2dea0f2bb76e   5Gi        RWO            managed-premium-nonzoned   56m

$ cat mypod-non-zoned.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: mypod
spec:
  containers:
  - name: mypod
    image: nginx:1.15.5
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 250m
        memory: 256Mi
    volumeMounts:
    - mountPath: "/mnt/azure"
      name: volume
  volumes:
    - name: volume
      persistentVolumeClaim:
        claimName: azure-managed-non

$ oc create -f mypod-non-zoned.yaml 
pod/mypod created

Checked the created pod status,
$ oc describe pod/mypod
Name:         mypod
Namespace:    default
...
Status:       Pending
...
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  azure-managed-non
    ReadOnly:   false
  default-token-gwhm7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-gwhm7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  <unknown>           default-scheduler  0/6 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling  <unknown>           default-scheduler  0/6 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling  16s (x29 over 34m)  default-scheduler  0/6 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 3 node(s) had volume node affinity conflict.
  
From above results, non-zoned pvc doesn't match NodeSelectorTerms on OCP 4.4, it works fine on OCP 4.5 and 4.6.
  
- Creating one sc and pvc on zoned region,
Since one default zoned sc already existed, no need new one.
$ oc get sc
NAME                        PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
managed-premium (default)   kubernetes.io/azure-disk   Delete          WaitForFirstConsumer   true                   1h

$ cat pvc-zoned.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: azure-managed-disk
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 5Gi
      

$ oc apply -f pvc-zoned.yaml 
persistentvolumeclaim/azure-managed-disk created

$ oc get pvc
NAME                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
azure-managed-disk   Bound    pvc-9415a7e6-93e3-4892-9c36-cdd47c16fe02   5Gi        RWO            managed-premium            58m
...

$ cat mypod-zoned.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: mypod1
spec:
  containers:
  - name: mypod1
    image: nginx:1.15.5
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 250m
        memory: 256Mi
    volumeMounts:
    - mountPath: "/mnt/azure"
      name: volume
  volumes:
    - name: volume
      persistentVolumeClaim:
        claimName: azure-managed-disk
        
$ oc apply -f mypod-zoned.yaml 
pod/mypod1 created

$ oc get pods
NAME     READY   STATUS    RESTARTS   AGE
mypod    0/1     Pending   0          72m
mypod1   1/1     Running   0          59m

 $ oc get pods -A | grep -E 'apiserver|NAME' | grep -vE 'installer|revision|catalog'
NAMESPACE                                               NAME                                                              READY   STATUS      RESTARTS   AGE
openshift-apiserver-operator                            openshift-apiserver-operator-7d68cd5574-dl49s                     1/1     Running     2          14h
openshift-apiserver                                     apiserver-6b4776d799-cr5pq                                        1/1     Running     0          14h
openshift-apiserver                                     apiserver-6b4776d799-gwxlw                                        1/1     Running     0          14h
openshift-apiserver                                     apiserver-6b4776d799-jrh8n                                        1/1     Running     0          14h
openshift-kube-apiserver-operator                       kube-apiserver-operator-7c98b4cd9f-6gnzr                          1/1     Running     2          14h
openshift-kube-apiserver                                kube-apiserver-kewang24azure41-bhpqr-master-0                     4/4     Running     3          13h
openshift-kube-apiserver                                kube-apiserver-kewang24azure41-bhpqr-master-1                     4/4     Running     5          13h
openshift-kube-apiserver                                kube-apiserver-kewang24azure41-bhpqr-master-2                     4/4     Running     5          13h

From above test results, we can see there is no new crashloop occurred, it doesn't matter if creating sc and pvc zoned or non-zoned, the kube-apiservers won't crash, move the bug verified.

Comment 5 errata-xmlrpc 2020-06-29 15:34:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2713


Note You need to log in before you can comment on or make changes to this bug.