1848622 – API server crash loop on ARO

Bug 1848622 - API server crash loop on ARO

Summary: API server crash loop on ARO

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Alberto
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:	1847419
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-18 16:02 UTC by Alberto
Modified:	2020-07-01 15:02 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1847419
Environment:
Last Closed:	2020-07-01 15:02:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25138	0	None	closed	Bug 1848622: [release-4.3]: UPSTREAM: 92166: fix: GetLabelsForVolume panic issue for azure disk PV	2020-10-29 18:57:51 UTC
Red Hat Product Errata	RHBA-2020:2628	0	None	None	None	2020-07-01 15:02:49 UTC

Comment 3 Ke Wang 2020-06-24 06:06:23 UTC

Verified with OCP build 4.3.0-0.nightly-2020-06-23-231659, steps see below,

Before we verify the bug, checking current status of apiservers as below, kube-apiservers restarted 4 times, this is because 1837992 is not backported to 4.3.
$ oc get pods -A | grep -E 'apiserver|NAME' | grep -vE 'installer|revision|catalog'
oc get nodes
NAMESPACE                                               NAME                                                              READY   STATUS      RESTARTS   AGE
openshift-apiserver-operator                            openshift-apiserver-operator-66977c5c67-qmxpd                     1/1     Running     1          59m
openshift-apiserver                                     apiserver-4l6h9                                                   1/1     Running     0          51m
openshift-apiserver                                     apiserver-6tgpf                                                   1/1     Running     0          51m
openshift-apiserver                                     apiserver-zcgkp                                                   1/1     Running     0          52m
openshift-kube-apiserver-operator                       kube-apiserver-operator-796f4664b7-2pc48                          1/1     Running     1          59m
openshift-kube-apiserver                                kube-apiserver-kewang24azure32-cgf88-master-0                     3/3     Running     4          48m
openshift-kube-apiserver                                kube-apiserver-kewang24azure32-cgf88-master-1                     3/3     Running     4          26m
openshift-kube-apiserver                                kube-apiserver-kewang24azure32-cgf88-master-2                     3/3     Running     4          45m

- Creating one sc and pvc on non-zoned region,
$ cat sc-non-zoned.yaml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
  labels:
    kubernetes.io/cluster-service: "true"
  name: managed-premium-nonzoned
parameters:
  kind: Managed
  storageaccounttype: Premium_LRS
  zoned: "false"
provisioner: kubernetes.io/azure-disk
volumeBindingMode: WaitForFirstConsumer

$ oc apply -f sc-non-zoned.yaml
storageclass.storage.k8s.io/managed-premium-nonzoned created

$ oc get sc
NAME                        PROVISIONER                AGE
managed-premium (default)   kubernetes.io/azure-disk   54m
managed-premium-nonzoned    kubernetes.io/azure-disk   7s

$ cat pvc-nonzoned.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: azure-managed-non
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium-nonzoned
  resources:
    requests:
      storage: 5Gi


$ oc apply -f pvc-non-zoned.yaml      
persistentvolumeclaim/azure-managed-non created

$ oc get pvc
NAME                STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS               AGE
azure-managed-non   Pending                                      managed-premium-nonzoned   4s

$ cat mypod-non-zoned.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: mypod
spec:
  containers:
  - name: mypod
    image: nginx:1.15.5
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 250m
        memory: 256Mi
    volumeMounts:
    - mountPath: "/mnt/azure"
      name: volume
  volumes:
    - name: volume
      persistentVolumeClaim:
        claimName: azure-managed-non

$ oc create -f mypod-non-zoned.yaml 
pod/mypod created

Checked the created pod status,
$ oc get pods
NAME     READY   STATUS    RESTARTS   AGE
mypod    0/1     Pending   0          7m8s

$ oc describe pod/mypod
Name:         mypod
Namespace:    default
...
Status:       Pending
...
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  azure-managed-non
    ReadOnly:   false
  default-token-tfnhm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-tfnhm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  Failed to bind volumes: pv "pvc-d47bebc8-cfb6-468a-9d21-06123a09621c" node affinity doesn't match node "kewang24azure32-cgf88-worker-westus23-45f5z": No matching NodeSelectorTerms
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 3 node(s) had volume node affinity conflict.
  
From above results, non-zoned pvc doesn't match NodeSelectorTerms on OCP 4.3, it works fine on OCP 4.5 and 4.6.

- Creating one sc and pvc on zoned region,
Since one default zoned sc already existed, no need new one.
$ oc get sc
NAME                        PROVISIONER                AGE
managed-premium (default)   kubernetes.io/azure-disk   56m
...

$ cat pvc-zoned.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: azure-managed-disk
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 5Gi
      

$ oc apply -f pvc-zoned.yaml 
persistentvolumeclaim/azure-managed-disk created

$ oc get pvc
NAME                 STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
azure-managed-disk   Pending                                                                        managed-premium            5s
...

$ cat mypod-zoned.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: mypod1
spec:
  containers:
  - name: mypod1
    image: nginx:1.15.5
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 250m
        memory: 256Mi
    volumeMounts:
    - mountPath: "/mnt/azure"
      name: volume
  volumes:
    - name: volume
      persistentVolumeClaim:
        claimName: azure-managed-disk
        
$ oc apply -f mypod-zoned.yaml 
pod/mypod1 created

$ oc get pods
NAME     READY   STATUS    RESTARTS   AGE
mypod    0/1     Pending   0          7m8s
mypod1   1/1     Running   0          3m56s

$ oc get pods -A | grep -E 'apiserver|NAME' | grep -vE 'installer|revision|catalog'
NAMESPACE                                               NAME                                                              READY   STATUS      RESTARTS   AGE
openshift-apiserver-operator                            openshift-apiserver-operator-66977c5c67-qmxpd                     1/1     Running     1          143m
openshift-apiserver                                     apiserver-4l6h9                                                   1/1     Running     0          135m
openshift-apiserver                                     apiserver-6tgpf                                                   1/1     Running     0          135m
openshift-apiserver                                     apiserver-zcgkp                                                   1/1     Running     0          136m
openshift-kube-apiserver-operator                       kube-apiserver-operator-796f4664b7-2pc48                          1/1     Running     1          143m
openshift-kube-apiserver                                kube-apiserver-kewang24azure32-cgf88-master-0                     3/3     Running     4          132m
openshift-kube-apiserver                                kube-apiserver-kewang24azure32-cgf88-master-1                     3/3     Running     4          110m
openshift-kube-apiserver                                kube-apiserver-kewang24azure32-cgf88-master-2                     3/3     Running     4          129m

From above test results, we can see there is no new crashloop occurred, it doesn't matter if creating sc and pvc zoned or non-zoned, the kube-apiservers won't crash, move the bug verified.

Comment 5 errata-xmlrpc 2020-07-01 15:02:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2628

Note You need to log in before you can comment on or make changes to this bug.