Bug 1868646 - With pre-configured pv-pools before OCS upgrade, noobaa-operator pod reports panic and is in CLBO post upgrade to 4.5-rc1
Summary: With pre-configured pv-pools before OCS upgrade, noobaa-operator pod reports ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.5.0
Assignee: Jacky Albo
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-13 11:33 UTC by Neha Berry
Modified: 2020-09-23 09:04 UTC (History)
7 users (show)

Fixed In Version: 4.5.0-64.ci
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-15 10:18:38 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github noobaa noobaa-operator pull 393 0 None closed Change pod delete handling for pvpool 2020-08-30 07:16:25 UTC
Github noobaa noobaa-operator pull 396 0 None closed change with pod delete handling for pvpool 2020-08-30 07:16:24 UTC
Red Hat Product Errata RHBA-2020:3754 0 None None None 2020-09-15 10:19:02 UTC

Description Neha Berry 2020-08-13 11:33:08 UTC
Description of problem (please be detailed as possible and provide log
snippests):
-------------------------------------------------------------------------
Created a few pv-pool backingstores, bucketclass and OBC. Started OCS upgrade from 4.5.0-521.ci to 4.5.0-526.ci(4.5-rc1) and post upgrade following pods are in CLBO

P.S: The build  4.5.0-526.ci has few new fixes for pv-pool (e.g.Bug 1867762 ) , could it be that these changes resulted in issues during upgrade? 

$ oc get pods -o wide|grep CrashLoopBackOff
bs-pv1-noobaa-pod-27320f1b                                        0/1     CrashLoopBackOff   24         152m   10.129.2.27   compute-2   <none>           <none>
neha-cli-noobaa-pod-a3cdf302                                      0/1     CrashLoopBackOff   24         151m   10.129.2.28   compute-2   <none>           <none>
noobaa-operator-6f4954d46f-qc88c                                  0/1     CrashLoopBackOff   31         143m   10.129.2.31   compute-2   <none>           <none>


>> Both backingstore are in connecting state

======= backingstore ==========
NAME                           TYPE            PHASE        AGE
bs-pv1                         pv-pool         Connecting   39m
neha-cli                       pv-pool         Connecting   38m
noobaa-default-backing-store   s3-compatible   Ready        2d21h

oc logs noobaa-operator-6f4954d46f-qc88c

E0813 09:51:08.930413       1 runtime.go:78] Observed a panic: &logrus.Entry{Logger:(*logrus.Logger)(0xc0000e4150), Data:logrus.Fields{}, Time:time.Time{wall:0xbfc561233771e51f, ext:5480383991, loc:(*time.Location)(0x2dfdb80)}, Level:0x0, Caller:(*runtime.Frame)(nil), Message:"☠️  Panic Attack: [Invalid] Pod \"bs-pv1-noobaa-pod-27320f1b\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n\u00a0\u00a0core.PodSpec{\n\u00a0\u00a0\t... // 11 identical fields\n\u00a0\u00a0\tNodeName:         \"compute-2\",\n\u00a0\u00a0\tSecurityContext:  &core.PodSecurityContext{SELinuxOptions: &core.SELinuxOptions{Level: \"s0:c24,c14\"}, FSGroup: &1000580000},\n-\u00a0\tImagePullSecrets: nil,\n+\u00a0\tImagePullSecrets: []core.LocalObjectReference{{Name: \"default-dockercfg-sd5pn\"}},\n\u00a0\u00a0\tHostname:         \"\",\n\u00a0\u00a0\tSubdomain:        \"\",\n\u00a0\u00a0\t... // 13 identical fields\n\u00a0\u00a0}\n", Buffer:(*bytes.Buffer)(nil), Context:(*context.emptyCtx)(0xc000048270), err:""} (&{0xc0000e4150 map[] 2020-08-13 09:51:08.930211103 +0000 UTC m=+5.480383991 panic <nil> ☠️  Panic Attack: [Invalid] Pod "bs-pv1-noobaa-pod-27320f1b" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)
  core.PodSpec{
  	... // 11 identical fields
  	NodeName:         "compute-2",
  	SecurityContext:  &core.PodSecurityContext{SELinuxOptions: &core.SELinuxOptions{Level: "s0:c24,c14"}, FSGroup: &1000580000},
- 	ImagePullSecrets: nil,
+ 	ImagePullSecrets: []core.LocalObjectReference{{Name: "default-dockercfg-sd5pn"}},
  	Hostname:         "",
  	Subdomain:        "",
  	... // 13 identical fields
  }
 <nil> context.TODO })
goroutine 607 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1ba7f40, 0xc000550fc0)





Version of all relevant components (if applicable):
-----------------------------------------------------
OCS before upgrade = ocs-operator.v4.5.0-521.ci
OCS after upgrade = ocs-operator.v4.5.0-526.ci
OCP = 4.5.0-0.nightly-2020-08-10-150345

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
----------------------------------------------------
Yes.

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------
Fresh deployments for v4.5.0-526.ci has been successful so far.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
--------------------------------------------------
4

Can this issue reproducible?
-----------------------------
Tested once on vmware based cluster with existing pv-pool Backingstores pre-configured

Can this issue reproduce from the UI?
---------------------------------
NA

If this is a regression, please provide more details to justify this:
----------------------------------------
PV-pool is new in OCS 4.5

Steps to Reproduce:
1. Create an Internal mode cluster before the RC build - e.g. v4.5.0-521.ci
2. Create 1-2 pv-pool based backingstores. Outputs added in Additional Info
3. Edit the cetsrc to upgrade the OCS version
$ oc get catsrc -n openshift-marketplace ocs-catalogsource -o yaml|grep -i image
        f:image: {}
  image: quay.io/rhceph-dev/ocs-olm-operator:4.5.0-526.ci

4. Check the progress of the upgrade, the states of the Backingstores and the noobaa pods.

Actual results:
--------------------
The noobaa-operator pod reports panic if we had some pv-pools configured before upgrade

Expected results:
------------------------
Operator pod should not report panic

Additional info:
----------------------
Before upgrade;
---------------------

======= backingstore ==========
NAME                           TYPE            PHASE   AGE
bs-pv1                         pv-pool         Ready   7m40s
neha-cli                       pv-pool         Ready   6m44s
noobaa-default-backing-store   s3-compatible   Ready   2d21h

====PVC=========
bs-pv1-noobaa-pvc-27320f1b     Bound    pvc-dc81465b-5e41-4624-b3c5-da1f680a9268   50Gi       RWO            ocs-storagecluster-ceph-rbd   7m33s
db-noobaa-db-0                 Bound    pvc-42b94d16-9421-43e3-9809-28161111b53f   50Gi       RWO            ocs-storagecluster-ceph-rbd   2d21h
neha-cli-noobaa-pvc-a3cdf302   Bound    pvc-68fb9d44-15f5-4848-a9d5-0ce4c16cbef5   40Gi       RWO            ocs-storagecluster-ceph-rbd   6m34s

========PODS=====
bs-pv1-noobaa-pod-27320f1b                                        1/1     Running   0          7m31s   10.129.2.27   compute-2   <none>           <none>
neha-cli-noobaa-pod-a3cdf302                                      1/1     Running   0          6m32s   10.129.2.28   compute-2   <none>           <none>


After upgrade
===================

$ oc get bucketclass
NAME                          PLACEMENT                                                        PHASE       AGE
noobaa-default-bucket-class   map[tiers:[map[backingStores:[noobaa-default-backing-store]]]]   Ready       2d23h
pv-pool-bucket                map[tiers:[map[backingStores:[bs-pv1] placement:Spread]]]        Verifying   154m

[nberry@localhost upgrade-521-526_dc3]$ oc get backingstore
NAME                           TYPE            PHASE        AGE
bs-pv1                         pv-pool         Connecting   160m
neha-cli                       pv-pool         Connecting   159m
noobaa-default-backing-store   s3-compatible   Ready        2d23h

[nberry@localhost upgrade-521-526_dc3]$ oc get obc
NAME      STORAGE-CLASS                 PHASE   AGE
nbio      openshift-storage.noobaa.io   Bound   2d2h
nbio1     ocs-storagecluster-ceph-rgw   Bound   2d2h
obc-pv1   openshift-storage.noobaa.io   Bound   154m

Comment 3 Jacky Albo 2020-08-13 12:19:44 UTC
looks like ImagePullSecrets was changed from before and after the upgrade. 
Can you share oc get noobaa -o yaml from before and after the upgrade so we can be sure that's this is the cause of the issue?
And if indeed it was changed can you please verify that a normal upgrade without a change to the secrets works? thanks

Comment 4 Neha Berry 2020-08-13 13:08:20 UTC
(In reply to Jacky Albo from comment #3)
> looks like ImagePullSecrets was changed from before and after the upgrade. 
> Can you share oc get noobaa -o yaml from before and after the upgrade so we
> can be sure that's this is the cause of the issue?
> And if indeed it was changed can you please verify that a normal upgrade
> without a change to the secrets works? thanks

@Jacky

I do not have oc get noobaa -o yaml before upgrade. Does it get collected as part of must-gather?

Moreover, I am not sure how the secret got changed as we had only created few pv-pools and obcs... We do not even know how to change the secret

on get noobaa -o yaml after upgrade


$ oc get noobaa -o yaml
apiVersion: v1
items:
- apiVersion: noobaa.io/v1alpha1
  kind: NooBaa
  metadata:
    creationTimestamp: "2020-08-10T11:39:51Z"
    finalizers:
    - noobaa.io/graceful_finalizer
    generation: 3
    labels:
      app: noobaa
    managedFields:
    - apiVersion: noobaa.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
          f:ownerReferences: {}
        f:spec:
          .: {}
          f:affinity:
            .: {}
            f:nodeAffinity:
              .: {}
              f:requiredDuringSchedulingIgnoredDuringExecution:
                .: {}
                f:nodeSelectorTerms: {}
          f:coreResources:
            .: {}
            f:limits:
              .: {}
              f:cpu: {}
              f:memory: {}
            f:requests:
              .: {}
              f:cpu: {}
              f:memory: {}
          f:dbImage: {}
          f:dbResources:
            .: {}
            f:limits:
              .: {}
              f:cpu: {}
              f:memory: {}
            f:requests:
              .: {}
              f:cpu: {}
              f:memory: {}
          f:dbStorageClass: {}
          f:dbVolumeResources:
            .: {}
            f:requests:
              .: {}
              f:storage: {}
          f:endpoints:
            .: {}
            f:maxCount: {}
            f:minCount: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:cpu: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
          f:image: {}
          f:pvPoolDefaultStorageClass: {}
          f:tolerations: {}
      manager: ocs-operator
      operation: Update
      time: "2020-08-13T08:54:09Z"
    - apiVersion: noobaa.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers: {}
        f:spec:
          f:cleanupPolicy: {}
        f:status:
          .: {}
          f:accounts:
            .: {}
            f:admin:
              .: {}
              f:secretRef:
                .: {}
                f:name: {}
                f:namespace: {}
          f:actualImage: {}
          f:conditions: {}
          f:endpoints:
            .: {}
            f:readyCount: {}
            f:virtualHosts: {}
          f:observedGeneration: {}
          f:phase: {}
          f:readme: {}
          f:services:
            .: {}
            f:serviceMgmt:
              .: {}
              f:externalDNS: {}
              f:internalDNS: {}
              f:internalIP: {}
              f:nodePorts: {}
              f:podPorts: {}
            f:serviceS3:
              .: {}
              f:externalDNS: {}
              f:internalDNS: {}
              f:internalIP: {}
              f:nodePorts: {}
              f:podPorts: {}
      manager: noobaa-operator
      operation: Update
      time: "2020-08-13T12:42:24Z"
    name: noobaa
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ocs.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: StorageCluster
      name: ocs-storagecluster
      uid: a8c09b98-373e-4e11-9b5a-08f241de2bc8
    resourceVersion: "33882083"
    selfLink: /apis/noobaa.io/v1alpha1/namespaces/openshift-storage/noobaas/noobaa
    uid: 84ccf3cf-753c-433f-8ece-41bdaa53405a
  spec:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: cluster.ocs.openshift.io/openshift-storage
              operator: Exists
    coreResources:
      limits:
        cpu: "1"
        memory: 4Gi
      requests:
        cpu: "1"
        memory: 4Gi
    dbImage: registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:ba74027bb4b244df0b0823ee29aa927d729da33edaa20ebdf51a2430cc6b4e95
    dbResources:
      limits:
        cpu: 500m
        memory: 500Mi
      requests:
        cpu: 500m
        memory: 500Mi
    dbStorageClass: ocs-storagecluster-ceph-rbd
    dbVolumeResources:
      requests:
        storage: 50Gi
    endpoints:
      maxCount: 1
      minCount: 1
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
    image: quay.io/rhceph-dev/mcg-core@sha256:d2e4edc717533ae0bdede3d8ada917cec06a946e0662b560ffd4493fa1b51f27
    pvPoolDefaultStorageClass: ocs-storagecluster-ceph-rbd
    tolerations:
    - effect: NoSchedule
      key: node.ocs.openshift.io/storage
      operator: Equal
      value: "true"
  status:
    accounts:
      admin:
        secretRef:
          name: noobaa-admin
          namespace: openshift-storage
    actualImage: quay.io/rhceph-dev/mcg-core@sha256:d2e4edc717533ae0bdede3d8ada917cec06a946e0662b560ffd4493fa1b51f27
    conditions:
    - lastHeartbeatTime: "2020-08-10T11:39:52Z"
      lastTransitionTime: "2020-08-13T12:42:24Z"
      message: noobaa operator completed reconcile - system is ready
      reason: SystemPhaseReady
      status: "True"
      type: Available
    - lastHeartbeatTime: "2020-08-10T11:39:52Z"
      lastTransitionTime: "2020-08-13T12:42:24Z"
      message: noobaa operator completed reconcile - system is ready
      reason: SystemPhaseReady
      status: "False"
      type: Progressing
    - lastHeartbeatTime: "2020-08-10T11:39:52Z"
      lastTransitionTime: "2020-08-10T11:39:52Z"
      message: noobaa operator completed reconcile - system is ready
      reason: SystemPhaseReady
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2020-08-10T11:39:52Z"
      lastTransitionTime: "2020-08-13T12:42:24Z"
      message: noobaa operator completed reconcile - system is ready
      reason: SystemPhaseReady
      status: "True"
      type: Upgradeable
    endpoints:
      readyCount: 1
      virtualHosts:
      - s3.openshift-storage.svc
    observedGeneration: 3
    phase: Ready
    readme: "\n\n\tWelcome to NooBaa!\n\t-----------------\n\tNooBaa Core Version:
      \    5.5.0-3ff3e13\n\tNooBaa Operator Version: 2.3.0\n\n\tLets get started:\n\n\t1.
      Connect to Management console:\n\n\t\tRead your mgmt console login information
      (email & password) from secret: \"noobaa-admin\".\n\n\t\t\tkubectl get secret
      noobaa-admin -n openshift-storage -o json | jq '.data|map_values(@base64d)'\n\n\t\tOpen
      the management console service - take External IP/DNS or Node Port or use port
      forwarding:\n\n\t\t\tkubectl port-forward -n openshift-storage service/noobaa-mgmt
      11443:443 &\n\t\t\topen https://localhost:11443\n\n\t2. Test S3 client:\n\n\t\tkubectl
      port-forward -n openshift-storage service/s3 10443:443 &\n\t\tNOOBAA_ACCESS_KEY=$(kubectl
      get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_ACCESS_KEY_ID|@base64d')\n\t\tNOOBAA_SECRET_KEY=$(kubectl
      get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|@base64d')\n\t\talias
      s3='AWS_ACCESS_KEY_ID=$NOOBAA_ACCESS_KEY AWS_SECRET_ACCESS_KEY=$NOOBAA_SECRET_KEY
      aws --endpoint https://localhost:10443 --no-verify-ssl s3'\n\t\ts3 ls\n\n"
    services:
      serviceMgmt:
        externalDNS:
        - https://noobaa-mgmt-openshift-storage.apps.sagrawal-dc3-ind.qe.rh-ocs.com
        internalDNS:
        - https://noobaa-mgmt.openshift-storage.svc:443
        internalIP:
        - https://172.30.158.249:443
        nodePorts:
        - https://10.70.60.44:30117
        podPorts:
        - https://10.129.2.40:8443
      serviceS3:
        externalDNS:
        - https://s3-openshift-storage.apps.sagrawal-dc3-ind.qe.rh-ocs.com
        internalDNS:
        - https://s3.openshift-storage.svc:443
        internalIP:
        - https://172.30.96.165:443
        nodePorts:
        - https://10.70.60.44:31431
        podPorts:
        - https://10.129.2.38:6443
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 5 Jacky Albo 2020-08-13 16:35:13 UTC
ok so @nberry provided me with the cluster creds. Thank you
As I was thinking it seems that before the upgrade the system was using a secret in order to reach quay docker hub - default-dockercfg-sd5pn
After the upgrade it was change to not using a secret at all
removing a secret after a upgrade seems to not be handled correctly - we will need to think of the right way in attacking this
old image: quay.io/rhceph-dev/mcg-core@sha256:f5fa382c8bcf832d079692e1980b0560ba5a12e155e8bc0715cfd6acc314f602
new image: quay.io/rhceph-dev/mcg-core@sha256:d2e4edc717533ae0bdede3d8ada917cec06a946e0662b560ffd4493fa1b51f27
maybe it's recommended to check upgrade for now without changing secrets

Comment 7 Jacky Albo 2020-08-20 09:08:15 UTC
fixed a issue with changing the imagePullSecret in the wrong way

Comment 14 Jacky Albo 2020-08-30 07:43:54 UTC
1. there will be IO disruptions on the pods - as they will get restarted for using the new image
2. they are restarted as part of the upgrade - deleting the pod running the old image and starting a new one instead running the new version
3. this is great :) this is important info for us 

In short this is the expected behaviour and you can go ahead and close it

Comment 15 Neha Berry 2020-08-31 11:41:38 UTC
(In reply to Jacky Albo from comment #14)
> 1. there will be IO disruptions on the pods - as they will get restarted for
> using the new image
> 2. they are restarted as part of the upgrade - deleting the pod running the
> old image and starting a new one instead running the new version
> 3. this is great :) this is important info for us 
> 
> In short this is the expected behaviour and you can go ahead and close it

thank you Jacky for all the confirmations.

Moving the BZ to verified state based on  Comment#13 and Comment#14.

Thanks

Comment 18 errata-xmlrpc 2020-09-15 10:18:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754


Note You need to log in before you can comment on or make changes to this bug.