Description of problem (please be detailed as possible and provide log snippests): ------------------------------------------------------------------------- Created a few pv-pool backingstores, bucketclass and OBC. Started OCS upgrade from 4.5.0-521.ci to 4.5.0-526.ci(4.5-rc1) and post upgrade following pods are in CLBO P.S: The build 4.5.0-526.ci has few new fixes for pv-pool (e.g.Bug 1867762 ) , could it be that these changes resulted in issues during upgrade? $ oc get pods -o wide|grep CrashLoopBackOff bs-pv1-noobaa-pod-27320f1b 0/1 CrashLoopBackOff 24 152m 10.129.2.27 compute-2 <none> <none> neha-cli-noobaa-pod-a3cdf302 0/1 CrashLoopBackOff 24 151m 10.129.2.28 compute-2 <none> <none> noobaa-operator-6f4954d46f-qc88c 0/1 CrashLoopBackOff 31 143m 10.129.2.31 compute-2 <none> <none> >> Both backingstore are in connecting state ======= backingstore ========== NAME TYPE PHASE AGE bs-pv1 pv-pool Connecting 39m neha-cli pv-pool Connecting 38m noobaa-default-backing-store s3-compatible Ready 2d21h oc logs noobaa-operator-6f4954d46f-qc88c E0813 09:51:08.930413 1 runtime.go:78] Observed a panic: &logrus.Entry{Logger:(*logrus.Logger)(0xc0000e4150), Data:logrus.Fields{}, Time:time.Time{wall:0xbfc561233771e51f, ext:5480383991, loc:(*time.Location)(0x2dfdb80)}, Level:0x0, Caller:(*runtime.Frame)(nil), Message:"☠️ Panic Attack: [Invalid] Pod \"bs-pv1-noobaa-pod-27320f1b\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n\u00a0\u00a0core.PodSpec{\n\u00a0\u00a0\t... // 11 identical fields\n\u00a0\u00a0\tNodeName: \"compute-2\",\n\u00a0\u00a0\tSecurityContext: &core.PodSecurityContext{SELinuxOptions: &core.SELinuxOptions{Level: \"s0:c24,c14\"}, FSGroup: &1000580000},\n-\u00a0\tImagePullSecrets: nil,\n+\u00a0\tImagePullSecrets: []core.LocalObjectReference{{Name: \"default-dockercfg-sd5pn\"}},\n\u00a0\u00a0\tHostname: \"\",\n\u00a0\u00a0\tSubdomain: \"\",\n\u00a0\u00a0\t... // 13 identical fields\n\u00a0\u00a0}\n", Buffer:(*bytes.Buffer)(nil), Context:(*context.emptyCtx)(0xc000048270), err:""} (&{0xc0000e4150 map[] 2020-08-13 09:51:08.930211103 +0000 UTC m=+5.480383991 panic <nil> ☠️ Panic Attack: [Invalid] Pod "bs-pv1-noobaa-pod-27320f1b" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations) core.PodSpec{ ... // 11 identical fields NodeName: "compute-2", SecurityContext: &core.PodSecurityContext{SELinuxOptions: &core.SELinuxOptions{Level: "s0:c24,c14"}, FSGroup: &1000580000}, - ImagePullSecrets: nil, + ImagePullSecrets: []core.LocalObjectReference{{Name: "default-dockercfg-sd5pn"}}, Hostname: "", Subdomain: "", ... // 13 identical fields } <nil> context.TODO }) goroutine 607 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1ba7f40, 0xc000550fc0) Version of all relevant components (if applicable): ----------------------------------------------------- OCS before upgrade = ocs-operator.v4.5.0-521.ci OCS after upgrade = ocs-operator.v4.5.0-526.ci OCP = 4.5.0-0.nightly-2020-08-10-150345 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ---------------------------------------------------- Yes. Is there any workaround available to the best of your knowledge? ---------------------------------------------------------- Fresh deployments for v4.5.0-526.ci has been successful so far. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? -------------------------------------------------- 4 Can this issue reproducible? ----------------------------- Tested once on vmware based cluster with existing pv-pool Backingstores pre-configured Can this issue reproduce from the UI? --------------------------------- NA If this is a regression, please provide more details to justify this: ---------------------------------------- PV-pool is new in OCS 4.5 Steps to Reproduce: 1. Create an Internal mode cluster before the RC build - e.g. v4.5.0-521.ci 2. Create 1-2 pv-pool based backingstores. Outputs added in Additional Info 3. Edit the cetsrc to upgrade the OCS version $ oc get catsrc -n openshift-marketplace ocs-catalogsource -o yaml|grep -i image f:image: {} image: quay.io/rhceph-dev/ocs-olm-operator:4.5.0-526.ci 4. Check the progress of the upgrade, the states of the Backingstores and the noobaa pods. Actual results: -------------------- The noobaa-operator pod reports panic if we had some pv-pools configured before upgrade Expected results: ------------------------ Operator pod should not report panic Additional info: ---------------------- Before upgrade; --------------------- ======= backingstore ========== NAME TYPE PHASE AGE bs-pv1 pv-pool Ready 7m40s neha-cli pv-pool Ready 6m44s noobaa-default-backing-store s3-compatible Ready 2d21h ====PVC========= bs-pv1-noobaa-pvc-27320f1b Bound pvc-dc81465b-5e41-4624-b3c5-da1f680a9268 50Gi RWO ocs-storagecluster-ceph-rbd 7m33s db-noobaa-db-0 Bound pvc-42b94d16-9421-43e3-9809-28161111b53f 50Gi RWO ocs-storagecluster-ceph-rbd 2d21h neha-cli-noobaa-pvc-a3cdf302 Bound pvc-68fb9d44-15f5-4848-a9d5-0ce4c16cbef5 40Gi RWO ocs-storagecluster-ceph-rbd 6m34s ========PODS===== bs-pv1-noobaa-pod-27320f1b 1/1 Running 0 7m31s 10.129.2.27 compute-2 <none> <none> neha-cli-noobaa-pod-a3cdf302 1/1 Running 0 6m32s 10.129.2.28 compute-2 <none> <none> After upgrade =================== $ oc get bucketclass NAME PLACEMENT PHASE AGE noobaa-default-bucket-class map[tiers:[map[backingStores:[noobaa-default-backing-store]]]] Ready 2d23h pv-pool-bucket map[tiers:[map[backingStores:[bs-pv1] placement:Spread]]] Verifying 154m [nberry@localhost upgrade-521-526_dc3]$ oc get backingstore NAME TYPE PHASE AGE bs-pv1 pv-pool Connecting 160m neha-cli pv-pool Connecting 159m noobaa-default-backing-store s3-compatible Ready 2d23h [nberry@localhost upgrade-521-526_dc3]$ oc get obc NAME STORAGE-CLASS PHASE AGE nbio openshift-storage.noobaa.io Bound 2d2h nbio1 ocs-storagecluster-ceph-rgw Bound 2d2h obc-pv1 openshift-storage.noobaa.io Bound 154m
looks like ImagePullSecrets was changed from before and after the upgrade. Can you share oc get noobaa -o yaml from before and after the upgrade so we can be sure that's this is the cause of the issue? And if indeed it was changed can you please verify that a normal upgrade without a change to the secrets works? thanks
(In reply to Jacky Albo from comment #3) > looks like ImagePullSecrets was changed from before and after the upgrade. > Can you share oc get noobaa -o yaml from before and after the upgrade so we > can be sure that's this is the cause of the issue? > And if indeed it was changed can you please verify that a normal upgrade > without a change to the secrets works? thanks @Jacky I do not have oc get noobaa -o yaml before upgrade. Does it get collected as part of must-gather? Moreover, I am not sure how the secret got changed as we had only created few pv-pools and obcs... We do not even know how to change the secret on get noobaa -o yaml after upgrade $ oc get noobaa -o yaml apiVersion: v1 items: - apiVersion: noobaa.io/v1alpha1 kind: NooBaa metadata: creationTimestamp: "2020-08-10T11:39:51Z" finalizers: - noobaa.io/graceful_finalizer generation: 3 labels: app: noobaa managedFields: - apiVersion: noobaa.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:app: {} f:ownerReferences: {} f:spec: .: {} f:affinity: .: {} f:nodeAffinity: .: {} f:requiredDuringSchedulingIgnoredDuringExecution: .: {} f:nodeSelectorTerms: {} f:coreResources: .: {} f:limits: .: {} f:cpu: {} f:memory: {} f:requests: .: {} f:cpu: {} f:memory: {} f:dbImage: {} f:dbResources: .: {} f:limits: .: {} f:cpu: {} f:memory: {} f:requests: .: {} f:cpu: {} f:memory: {} f:dbStorageClass: {} f:dbVolumeResources: .: {} f:requests: .: {} f:storage: {} f:endpoints: .: {} f:maxCount: {} f:minCount: {} f:resources: .: {} f:limits: .: {} f:cpu: {} f:memory: {} f:requests: .: {} f:cpu: {} f:memory: {} f:image: {} f:pvPoolDefaultStorageClass: {} f:tolerations: {} manager: ocs-operator operation: Update time: "2020-08-13T08:54:09Z" - apiVersion: noobaa.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: {} f:spec: f:cleanupPolicy: {} f:status: .: {} f:accounts: .: {} f:admin: .: {} f:secretRef: .: {} f:name: {} f:namespace: {} f:actualImage: {} f:conditions: {} f:endpoints: .: {} f:readyCount: {} f:virtualHosts: {} f:observedGeneration: {} f:phase: {} f:readme: {} f:services: .: {} f:serviceMgmt: .: {} f:externalDNS: {} f:internalDNS: {} f:internalIP: {} f:nodePorts: {} f:podPorts: {} f:serviceS3: .: {} f:externalDNS: {} f:internalDNS: {} f:internalIP: {} f:nodePorts: {} f:podPorts: {} manager: noobaa-operator operation: Update time: "2020-08-13T12:42:24Z" name: noobaa namespace: openshift-storage ownerReferences: - apiVersion: ocs.openshift.io/v1 blockOwnerDeletion: true controller: true kind: StorageCluster name: ocs-storagecluster uid: a8c09b98-373e-4e11-9b5a-08f241de2bc8 resourceVersion: "33882083" selfLink: /apis/noobaa.io/v1alpha1/namespaces/openshift-storage/noobaas/noobaa uid: 84ccf3cf-753c-433f-8ece-41bdaa53405a spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists coreResources: limits: cpu: "1" memory: 4Gi requests: cpu: "1" memory: 4Gi dbImage: registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:ba74027bb4b244df0b0823ee29aa927d729da33edaa20ebdf51a2430cc6b4e95 dbResources: limits: cpu: 500m memory: 500Mi requests: cpu: 500m memory: 500Mi dbStorageClass: ocs-storagecluster-ceph-rbd dbVolumeResources: requests: storage: 50Gi endpoints: maxCount: 1 minCount: 1 resources: limits: cpu: "1" memory: 2Gi requests: cpu: "1" memory: 2Gi image: quay.io/rhceph-dev/mcg-core@sha256:d2e4edc717533ae0bdede3d8ada917cec06a946e0662b560ffd4493fa1b51f27 pvPoolDefaultStorageClass: ocs-storagecluster-ceph-rbd tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" status: accounts: admin: secretRef: name: noobaa-admin namespace: openshift-storage actualImage: quay.io/rhceph-dev/mcg-core@sha256:d2e4edc717533ae0bdede3d8ada917cec06a946e0662b560ffd4493fa1b51f27 conditions: - lastHeartbeatTime: "2020-08-10T11:39:52Z" lastTransitionTime: "2020-08-13T12:42:24Z" message: noobaa operator completed reconcile - system is ready reason: SystemPhaseReady status: "True" type: Available - lastHeartbeatTime: "2020-08-10T11:39:52Z" lastTransitionTime: "2020-08-13T12:42:24Z" message: noobaa operator completed reconcile - system is ready reason: SystemPhaseReady status: "False" type: Progressing - lastHeartbeatTime: "2020-08-10T11:39:52Z" lastTransitionTime: "2020-08-10T11:39:52Z" message: noobaa operator completed reconcile - system is ready reason: SystemPhaseReady status: "False" type: Degraded - lastHeartbeatTime: "2020-08-10T11:39:52Z" lastTransitionTime: "2020-08-13T12:42:24Z" message: noobaa operator completed reconcile - system is ready reason: SystemPhaseReady status: "True" type: Upgradeable endpoints: readyCount: 1 virtualHosts: - s3.openshift-storage.svc observedGeneration: 3 phase: Ready readme: "\n\n\tWelcome to NooBaa!\n\t-----------------\n\tNooBaa Core Version: \ 5.5.0-3ff3e13\n\tNooBaa Operator Version: 2.3.0\n\n\tLets get started:\n\n\t1. Connect to Management console:\n\n\t\tRead your mgmt console login information (email & password) from secret: \"noobaa-admin\".\n\n\t\t\tkubectl get secret noobaa-admin -n openshift-storage -o json | jq '.data|map_values(@base64d)'\n\n\t\tOpen the management console service - take External IP/DNS or Node Port or use port forwarding:\n\n\t\t\tkubectl port-forward -n openshift-storage service/noobaa-mgmt 11443:443 &\n\t\t\topen https://localhost:11443\n\n\t2. Test S3 client:\n\n\t\tkubectl port-forward -n openshift-storage service/s3 10443:443 &\n\t\tNOOBAA_ACCESS_KEY=$(kubectl get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_ACCESS_KEY_ID|@base64d')\n\t\tNOOBAA_SECRET_KEY=$(kubectl get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|@base64d')\n\t\talias s3='AWS_ACCESS_KEY_ID=$NOOBAA_ACCESS_KEY AWS_SECRET_ACCESS_KEY=$NOOBAA_SECRET_KEY aws --endpoint https://localhost:10443 --no-verify-ssl s3'\n\t\ts3 ls\n\n" services: serviceMgmt: externalDNS: - https://noobaa-mgmt-openshift-storage.apps.sagrawal-dc3-ind.qe.rh-ocs.com internalDNS: - https://noobaa-mgmt.openshift-storage.svc:443 internalIP: - https://172.30.158.249:443 nodePorts: - https://10.70.60.44:30117 podPorts: - https://10.129.2.40:8443 serviceS3: externalDNS: - https://s3-openshift-storage.apps.sagrawal-dc3-ind.qe.rh-ocs.com internalDNS: - https://s3.openshift-storage.svc:443 internalIP: - https://172.30.96.165:443 nodePorts: - https://10.70.60.44:31431 podPorts: - https://10.129.2.38:6443 kind: List metadata: resourceVersion: "" selfLink: ""
ok so @nberry provided me with the cluster creds. Thank you As I was thinking it seems that before the upgrade the system was using a secret in order to reach quay docker hub - default-dockercfg-sd5pn After the upgrade it was change to not using a secret at all removing a secret after a upgrade seems to not be handled correctly - we will need to think of the right way in attacking this old image: quay.io/rhceph-dev/mcg-core@sha256:f5fa382c8bcf832d079692e1980b0560ba5a12e155e8bc0715cfd6acc314f602 new image: quay.io/rhceph-dev/mcg-core@sha256:d2e4edc717533ae0bdede3d8ada917cec06a946e0662b560ffd4493fa1b51f27 maybe it's recommended to check upgrade for now without changing secrets
fixed a issue with changing the imagePullSecret in the wrong way
1. there will be IO disruptions on the pods - as they will get restarted for using the new image 2. they are restarted as part of the upgrade - deleting the pod running the old image and starting a new one instead running the new version 3. this is great :) this is important info for us In short this is the expected behaviour and you can go ahead and close it
(In reply to Jacky Albo from comment #14) > 1. there will be IO disruptions on the pods - as they will get restarted for > using the new image > 2. they are restarted as part of the upgrade - deleting the pod running the > old image and starting a new one instead running the new version > 3. this is great :) this is important info for us > > In short this is the expected behaviour and you can go ahead and close it thank you Jacky for all the confirmations. Moving the BZ to verified state based on Comment#13 and Comment#14. Thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754