Bug 1851874

Summary: In-tree provisioner doesn't work on GCP
Product: OpenShift Container Platform Reporter: Wei Duan <wduan>
Component: kube-controller-managerAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: aos-bugs, jsafrane, maszulik, mfojtik
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:09:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Wei Duan 2020-06-29 09:59:25 UTC
Description of problem:
After scheduling pod to a node, pvc with default sc is still no bound.  

Version-Release number of selected component (if applicable):
[wduan@MINT config]$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-rc.4   True        False         7h48m   Cluster version is 4.5.0-rc.4


How reproducible:
100%

Steps to Reproduce:
1.GCP cluster is installed, it is a disconnect env and with FIPS=on.
2.creating pvc with default sc
3.creating pod to consume pvc
4.$ oc get pod,pvc

Actual results:
[wduan@MINT 01_general]$ oc get pod,pvc
NAME          READY   STATUS              RESTARTS   AGE
pod/mypod02   0/1     ContainerCreating   0          10m

NAME                            STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/mypvc02   Pending                                      standard       10m


Expected results:
pod should be "running" and pvc should be "bound"

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:
[wduan@MINT 01_general]$ oc describe persistentvolumeclaim/mypvc02
Name:          mypvc02
Namespace:     wduan
StorageClass:  standard
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Mounted By:    mypod02
Events:
  Type    Reason                Age                 From                         Message
  ----    ------                ----                ----                         -------
  Normal  WaitForFirstConsumer  81s (x43 over 11m)  persistentvolume-controller  waiting for first consumer to be created before binding

StorageClass Dump (if StorageClass used by PV/PVC):
[wduan@MINT 01_general]$ oc get sc standard -o yaml
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2020-06-29T01:32:01Z"
  name: standard
  ownerReferences:
  - apiVersion: v1
    kind: clusteroperator
    name: storage
    uid: c6ec2c1f-bfbe-470b-8a56-4c24e51df792
  resourceVersion: "10256"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/standard
  uid: 2ef24263-9ebe-451e-b0eb-075f3d73a2d9
parameters:
  replication-type: none
  type: pd-standard
provisioner: kubernetes.io/gce-pd
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer


Additional info:
[wduan@MINT 01_general]$ oc describe pod/mypod02
Name:         mypod02
Namespace:    wduan
Priority:     0
Node:         yinzhougcp-k2fqv-worker-c-krtm6.c.openshift-qe.internal/10.0.32.4
Start Time:   Mon, 29 Jun 2020 17:44:19 +0800
Labels:       name=frontendhttp
Annotations:  openshift.io/scc: anyuid
Status:       Pending
IP:           
IPs:          <none>
Containers:
  myfrontend:
    Container ID:   
    Image:          quay.io/openshifttest/storage@sha256:a05b96d373be86f46e76817487027a7f5b8b5f87c0ac18a246b018df11529b40
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mnt/local from local (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pwztl (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  local:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  mypvc02
    ReadOnly:   false
  default-token-pwztl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pwztl
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From                                                              Message
  ----     ------            ----                 ----                                                              -------
  Warning  FailedScheduling  <unknown>            default-scheduler                                                 persistentvolumeclaim "mypvc02" not found
  Warning  FailedScheduling  <unknown>            default-scheduler                                                 persistentvolumeclaim "mypvc02" not found
  Normal   Scheduled         <unknown>            default-scheduler                                                 Successfully assigned wduan/mypod02 to yinzhougcp-k2fqv-worker-c-krtm6.c.openshift-qe.internal
  Warning  FailedMount       6m25s (x7 over 11m)  kubelet, yinzhougcp-k2fqv-worker-c-krtm6.c.openshift-qe.internal  Unable to attach or mount volumes: unmounted volumes=[local], unattached volumes=[default-token-pwztl local]: error processing PVC wduan/mypvc02: PVC is not bound
  Warning  FailedMount       114s (x35 over 11m)  kubelet, yinzhougcp-k2fqv-worker-c-krtm6.c.openshift-qe.internal  Unable to attach or mount volumes: unmounted volumes=[local], unattached volumes=[local default-token-pwztl]: error processing PVC wduan/mypvc02: PVC is not bound

Comment 2 Jan Safranek 2020-06-29 11:30:08 UTC
The reason is that kube-controller-manager is degraded:


$ oc get clusteroperator kube-controller-manager -o yaml
  - lastTransitionTime: "2020-06-29T08:15:44Z"
    message: "ConfigObservationDegraded: .spec.featureSet %!q(*v1.FeatureGateEnabledDisabled=<nil>)
      not found\nStaticPodsDegraded: pod/kube-controller-manager-yinzhougcp-k2fqv-master-1.c.openshift-qe.internal
      container \"cluster-policy-controller\" is not ready: unknown reason\nStaticPodsDegraded:
      pod/kube-controller-manager-yinzhougcp-k2fqv-master-1.c.openshift-qe.internal
      container \"cluster-policy-controller\" is terminated: Error: I0629 11:22:13.896668
      \      1 policy_controller.go:41] Starting controllers on 0.0.0.0:10357 (d88621be)\nStaticPodsDegraded:
      I0629 11:22:13.898539       1 standalone_apiserver.go:103] Started health checks
      at 0.0.0.0:10357\nStaticPodsDegraded: I0629 11:22:13.899177       1 leaderelection.go:242]
      attempting to acquire leader lease  openshift-kube-controller-manager/cluster-policy-controller...\nStaticPodsDegraded:
      F0629 11:22:13.899898       1 standalone_apiserver.go:119] listen tcp 0.0.0.0:10357:
      bind: address already in use\nStaticPodsDegraded: \nStaticPodsDegraded: pod/kube-controller-manager-yinzhougcp-k2fqv-master-1.c.openshift-qe.internal
      container \"kube-controller-manager\" is not ready: unknown reason"
    reason: ConfigObservation_Error::StaticPods_Error
    status: "True"
    type: Degraded


After deleting the kube-controller-manager pods (not the operator), I got this from kube-controller-manager:
  - lastTransitionTime: "2020-06-29T08:15:44Z"
    message: 'ConfigObservationDegraded: .spec.featureSet %!q(*v1.FeatureGateEnabledDisabled=<nil>)
      not found'
    reason: ConfigObservation_Error
    status: "True"
    type: Degraded

Comment 3 Maciej Szulik 2020-07-01 09:31:10 UTC
Can you either provide us with a cluster where this is happening or must-gather dump from that cluster?
I'm especially interested in the following resources:

oc get featuregates/cluster -oyaml
oc get kubecontrollermanager/cluster -oyaml

Comment 4 Wei Duan 2020-07-06 03:14:46 UTC
Sorry I missed the "needinfo" notify and the cluster was removed already.
I set up a new cluster with the same flexy template and the 4.5.0-0.nightly-2020-07-02-190154 last friday, I did not hit this issue again.

Comment 5 Maciej Szulik 2020-07-06 11:11:01 UTC
I'm lowering the priority based on previous comment, when you hit the issue again please let us know.

Comment 6 Wei Duan 2020-07-07 01:16:04 UTC
I removed the TestBlocker tag.

Comment 7 Maciej Szulik 2020-07-07 08:30:28 UTC
It looks like this might have been fixed with https://github.com/openshift/cluster-kube-controller-manager-operator/pull/415 moving to qa for verification.

Comment 11 zhou ying 2020-07-08 15:02:08 UTC
Confirmed with payload: 4.6.0-0.nightly-2020-07-07-141639

[root@dhcp-140-138 ~]# oc get po 
NAME    READY   STATUS         RESTARTS   AGE
mypod   0/1     ErrImagePull   0          4m24s

[zhouying@dhcp-140-138 ~]$ oc get pvc
NAME   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ebs    Bound    pvc-f439d62c-3611-49cb-8cc8-ca4931998394   1Gi        RWO            standard       49s

Comment 14 errata-xmlrpc 2020-10-27 16:09:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196