Created attachment 1425626 [details] hawkular-cassandra pod log Description of problem: Deploy metrics 3.10 on AWS with dynamic pv, hawkular-cassandra-1-sxgwh pod failed to start up, checked logs, it showed "Has no permission to create directory /cassandra_data/data" # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-sxgwh 0/1 CrashLoopBackOff 21 1h hawkular-metrics-mc7h2 0/1 Running 10 1h hawkular-metrics-schema-ztwnc 1/1 Running 0 1h heapster-gpf8t 0/1 Running 10 1h # oc logs hawkular-cassandra-1-sxgwh *************************************snipped*********************************************************************** WARN [main] 2018-04-23 09:19:24,619 StartupChecks.java:275 - Directory /cassandra_data/data doesn't exist ERROR [main] 2018-04-23 09:19:24,620 CassandraDaemon.java:710 - Has no permission to create directory /cassandra_data/data *************************************snipped*********************************************************************** # oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE metrics-cassandra-1 Bound pvc-03c79a0c-46ca-11e8-9a3c-0ef0bdfe7e28 10Gi RWO gp2 1h # mount | grep pvc-03c79a0c-46ca-11e8-9a3c-0ef0bdfe7e28 /dev/xvdco on /var/lib/origin/openshift.local.volumes/pods/0bd439c8-46ca-11e8-9a3c-0ef0bdfe7e28/volumes/kubernetes.io~aws-ebs/pvc-03c79a0c-46ca-11e8-9a3c-0ef0bdfe7e28 type ext4 (rw,relatime,seclabel,data=ordered) # ls -al /var/lib/origin/openshift.local.volumes/pods/0bd439c8-46ca-11e8-9a3c-0ef0bdfe7e28/volumes/kubernetes.io~aws-ebs/pvc-03c79a0c-46ca-11e8-9a3c-0ef0bdfe7e28 total 20 drwxr-xr-x. 3 root root 4096 Apr 23 03:44 . drwxr-x---. 3 root root 54 Apr 23 03:44 .. drwx------. 2 root root 16384 Apr 23 03:44 lost+found # oc get sc gp2 -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.beta.kubernetes.io/is-default-class: "true" creationTimestamp: 2018-04-23T02:24:08Z name: gp2 resourceVersion: "1714" selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2 uid: 6700058b-469d-11e8-9a3c-0ef0bdfe7e28 parameters: encrypted: "false" kmsKeyId: "" type: gp2 provisioner: kubernetes.io/aws-ebs reclaimPolicy: Delete volumeBindingMode: Immediate Version-Release number of selected component (if applicable): openshift-ansible-3.10.0-0.27.0 Metrics version: v3.10.0-0.27.0.0 How reproducible: Always Steps to Reproduce: 1. Deploy metrics 3.10 on AWS with dynamic pv, parameters see the [Additional info] part 2. 3. Actual results: hawkular-cassandra pod failed to start up Expected results: hawkular-cassandra pod should be in running status Additional info: openshift_metrics_install_metrics=true openshift_metrics_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_metrics_image_version=v3.10 openshift_metrics_cassandra_storage_type=dynamic
same issue with gce dynamic pv
Is there a workaround for this? I'm using a fresh EBS volume
Found in 3.9 env which used aws dynamic pv for metrics # ls -al /var/lib/origin/openshift.local.volumes/pods/1d6f3aa9-3210-11e8-bf71-0aa08eeabc12/volumes/kubernetes.io~aws-ebs/pvc-bfb5ae17-fa28-11e7-a195-0a0f36bed07c total 44 drwxrwsr-x. 6 root 1000030000 4096 Mar 27 22:01 . drwxr-x---. 3 root root 54 Mar 27 22:42 .. -rw-rw-r--. 1 1000030000 1000030000 16 Mar 27 22:43 .cassandra.version drwxrwsr-x. 2 1000030000 1000030000 4096 May 29 03:01 commitlog drwxrwsr-x. 10 1000030000 1000030000 4096 Jan 15 19:19 data drwxrwsr-x. 2 1000030000 1000030000 4096 May 29 03:00 failure_reports drwxrwS---. 2 root 1000030000 16384 Jan 15 19:18 lost+found -rw-rw-r--. 1 1000030000 1000030000 150 Mar 27 22:42 .shutdown.drain -rw-rw-r--. 1 1000030000 1000030000 150 Jan 15 19:19 .upgrade.upgradesstables
I took a look at this and hopefully my discoveries will be helpful in running down the root cause. Here's what I've found: Pods created in the openshift-infra project do not get any securityContext attributes set. If there is no securityContext.fsGroup attribute on the pod, the volume mounter will not go through the PV and set all the permissions. I'm not super well versed in SCCs, but it appears that the restricted scc is not being applied to the pods. I'm not sure why. If I create a pod in an arbitrary namespace, it gets the annotation openshift.io/scc, and it gets a value in its securityContext.fsGroup. If I create the same pod in the openshift-infra namespace, there is no openshift.io/scc annotation and no securityContext.fsGroup. I don't know what is different about openshift-infra that would prevent it from having the restricted (or some other) scc applied to its pods.
I asked liggitt about this and he and deads told me that SCCs aren't run on pods in namespaces that are used to bring up the OpenShift control plane such as openshift-infra. If I understand correctly, they're saying we shouldn't be running pods in the openshift-infra namespace.
The Hawkular metrics deployment has run in openshift-infra since 3.2.
(In reply to Joel Smith from comment #20) > I asked liggitt about this and he and deads told me that SCCs aren't run on > pods in namespaces that are used to bring up the OpenShift control plane > such as openshift-infra. If I understand correctly, they're saying we > shouldn't be running pods in the openshift-infra namespace. Jeff, can you share any additional insights on this? On the one hand I am relieved to have a better understanding of what the problem is and how to resolve it. On the other hand, I am a bit nervous about what seems like a pretty big change at this late cycle.
(In reply to John Sanda from comment #23) shouldn't be running pods in the openshift-infra namespace. > > Jeff, can you share any additional insights on this? On the one hand I am > relieved to have a better understanding of what the problem is and how to > resolve it. On the other hand, I am a bit nervous about what seems like a > pretty big change at this late cycle. I believe hawkular metrics runs in this namespace because it requires to be in the same one as heapster. I can not speak to if there is a specific reason heapster runs in 'openshift-infra'. As Mike, pointed out, nothing has changed from hawkular deployment perspective since 3.2. This, IMO, is a regression caused by a change that did not account for hawkular.
I Submitted a PR with the initial changes: https://github.com/openshift/openshift-ansible/pull/8613
Once OCP are upgraded to v3.10 on GCE&AWS. The metrics will hit issue. It block the testing of metrics upgrade.
The big challenge with moving components to a new namespace is avoiding data loss. Yesterday I asked on the aos-storage list how I can migrate data from a PV. Here are the detail steps with which I was provided: 1. Find your PV. 2. Check PV.Spec.PersistentVolumeReclaimPolicy. If it Delete or Recycle, change it to Retain (`oc edit pv <xyz>` or `oc patch`) Whatever happens now, the worst thing that can happen to your PV is that it can get to Released phase. Data won't be deleted. Rebind: 3. Create a new PVC in the new namespace. The new PVC should be the same as the old PVC - storage classes, labels, selectors, ... Explicitly, PVC.Spec.VolumeName *must* be set to PV.Name. This effectively turns off dynamic provisioning for this PVC. The new PVC will be Pending. That's OK, the old PVC is still the one that's bound to the PV. 4. Here comes the tricky part: change PV.Spec.ClaimRef exactly in this way: PV.Spec.ClaimRef.Namespace = <new PVC namespace> PV.Spec.ClaimRef.Name = <new PVC name> PV.Spec.ClaimRef.UID = <new PVC UID> The old PVC should get "Lost" in couple of seconds (and you can safely delete it). New PVC should be "Bound". PV should be "Bound" to the new PVC. 5. Restore original PV.Spec.PersistentVolumeReclaimPolicy, if needed. Note that this just rebinds the PV. It does not stop pods in the old namespace that use the PV and start them in the new one. Something else must do that. You should delete the deployment first and re-create it in the new namespace when the new PVC is bound.
@John, Could we we provide a playbook to migrate the namespaces? If not, I think we have to provide a paper to guide the openshift administor.
All of the metrics components with the exception of Cassandra are stateless. For the stateless components it is simply a matter of deleting them from openshift-infra and deploying them into a different namespace. I reached out to the storage team, and Jan Safranek provided me with the detailed steps in comment 27. Implementing these steps will effectively allow us to migrate Cassandra's persistent volume without data loss. Ruben and I have been working on it today and should have a PR ready for review tomorrow.
Current fix will make metrics pods deployed to the project of openshift-metrics. (Previously deployed to openshift-infra). This change will make HPA related test fail, due to failing to retrieve the metrics data
(In reply to Joel Smith from comment #20) > I asked liggitt about this and he and deads told me that SCCs aren't run on > pods in namespaces that are used to bring up the OpenShift control plane > such as openshift-infra. If I understand correctly, they're saying we > shouldn't be running pods in the openshift-infra namespace. @Jordan As I know the the OpenShift control plane are deployed in kube-system namespace in v3.10. Must we restrict the openshift-infra?
The fix have been merged into openshift-ansible:v3.10.0-0.67.0.0. and it block HPA testing.
Will verify this bug after we have final conclusion on bug 1590748
default namespace for metrics is openshift-metrics and dynamic pv can be attached to cassandra pod now, but there is another regression bug: 1591077 # rpm -qa | grep openshift-ansible openshift-ansible-playbooks-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch openshift-ansible-roles-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch openshift-ansible-docs-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch
Created attachment 1457910 [details] Attaching log for hitting this on metrics 3.10.15
My inventory is part of my overall cluster install inventory. I will attach it with credential information redacted.
Created attachment 1457959 [details] Inventory I should mention this cluster is an upgrade to 3.10.15 from 3.9.27. Metrics was not installed at the 3.9 level. The upgrade was successful and then I installed metrics with the attached inventory. I'll try with a standalone metrics-only inventory and see if that is successful.
New/compact inventory below. The same issue occurred. [OSEv3:children] masters etcd [masters] ip-172-31-18-202 [etcd] ip-172-31-18-202 [OSEv3:vars] deployment_type=openshift-enterprise openshift_docker_additional_registries=registry.reg-aws.openshift.com openshift_metrics_hawkular_hostname=hawkular-metrics.apps.0710-sis.qe.rhcloud.com openshift_metrics_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_metrics_image_version=v3.10.15 openshift_metrics_install_metrics=true openshift_metrics_cassandra_replicas=1 openshift_metrics_hawkular_replicas=1 openshift_metrics_cassandra_storage_type=dynamic openshift_metrics_cassandra_pvc_size=20Gi openshift_metrics_cassandra_nodeselector={"node-role.kubernetes.io/compute": "true"} openshift_metrics_hawkular_nodeselector={"node-role.kubernetes.io/compute": "true"} openshift_metrics_heapster_nodeselector={"node-role.kubernetes.io/compute": "true"}
@juzhao @rvargasp Apologies, openshift-ansible was backlevel and did not match the rest of the env. After upgrading to openshift-ansible 3.10.15, all is working fine. Sorry for the scare.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816