Bug 1570583 - Has no permission to create directory /cassandra_data/data on dynamic pv
Summary: Has no permission to create directory /cassandra_data/data on dynamic pv
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.10.0
Assignee: Ruben Vargas Palma
QA Contact: Junqi Zhao
Vikram Goyal
URL:
Whiteboard: aos-scalability-310
Depends On:
Blocks: 1590748
TreeView+ depends on / blocked
 
Reported: 2018-04-23 09:40 UTC by Junqi Zhao
Modified: 2018-07-30 19:14 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-30 19:13:42 UTC
Target Upstream Version:
sjenning: needinfo-


Attachments (Terms of Use)
hawkular-cassandra pod log (17.95 KB, text/plain)
2018-04-23 09:40 UTC, Junqi Zhao
no flags Details
Attaching log for hitting this on metrics 3.10.15 (18.06 KB, text/plain)
2018-07-10 20:47 UTC, Mike Fiedler
no flags Details
Inventory (10.83 KB, text/plain)
2018-07-10 23:55 UTC, Mike Fiedler
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:14:17 UTC

Description Junqi Zhao 2018-04-23 09:40:00 UTC
Created attachment 1425626 [details]
hawkular-cassandra pod log

Description of problem:
Deploy metrics 3.10 on AWS with dynamic pv, hawkular-cassandra-1-sxgwh pod failed to start up, checked logs,
it showed "Has no permission to create directory /cassandra_data/data"

# oc get po
NAME                            READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-sxgwh      0/1       CrashLoopBackOff   21         1h
hawkular-metrics-mc7h2          0/1       Running            10         1h
hawkular-metrics-schema-ztwnc   1/1       Running            0          1h
heapster-gpf8t                  0/1       Running            10         1h

# oc logs hawkular-cassandra-1-sxgwh
*************************************snipped***********************************************************************
WARN  [main] 2018-04-23 09:19:24,619 StartupChecks.java:275 - Directory /cassandra_data/data doesn't exist
ERROR [main] 2018-04-23 09:19:24,620 CassandraDaemon.java:710 - Has no permission to create directory /cassandra_data/data
*************************************snipped***********************************************************************


# oc get pvc
NAME                  STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
metrics-cassandra-1   Bound     pvc-03c79a0c-46ca-11e8-9a3c-0ef0bdfe7e28   10Gi       RWO            gp2            1h

# mount | grep pvc-03c79a0c-46ca-11e8-9a3c-0ef0bdfe7e28
/dev/xvdco on /var/lib/origin/openshift.local.volumes/pods/0bd439c8-46ca-11e8-9a3c-0ef0bdfe7e28/volumes/kubernetes.io~aws-ebs/pvc-03c79a0c-46ca-11e8-9a3c-0ef0bdfe7e28 type ext4 (rw,relatime,seclabel,data=ordered)

# ls -al /var/lib/origin/openshift.local.volumes/pods/0bd439c8-46ca-11e8-9a3c-0ef0bdfe7e28/volumes/kubernetes.io~aws-ebs/pvc-03c79a0c-46ca-11e8-9a3c-0ef0bdfe7e28
total 20
drwxr-xr-x. 3 root root  4096 Apr 23 03:44 .
drwxr-x---. 3 root root    54 Apr 23 03:44 ..
drwx------. 2 root root 16384 Apr 23 03:44 lost+found

# oc get sc gp2 -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
  creationTimestamp: 2018-04-23T02:24:08Z
  name: gp2
  resourceVersion: "1714"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2
  uid: 6700058b-469d-11e8-9a3c-0ef0bdfe7e28
parameters:
  encrypted: "false"
  kmsKeyId: ""
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: Immediate


Version-Release number of selected component (if applicable):
openshift-ansible-3.10.0-0.27.0

Metrics version: v3.10.0-0.27.0.0

How reproducible:
Always

Steps to Reproduce:
1. Deploy metrics 3.10 on AWS with dynamic pv, parameters see the [Additional info] part
2.
3.

Actual results:
hawkular-cassandra pod failed to start up

Expected results:
hawkular-cassandra pod should be in running status

Additional info:
openshift_metrics_install_metrics=true
openshift_metrics_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_metrics_image_version=v3.10
openshift_metrics_cassandra_storage_type=dynamic

Comment 4 Junqi Zhao 2018-05-11 03:33:11 UTC
same issue with gce dynamic pv

Comment 5 Mike Fiedler 2018-05-11 17:27:34 UTC
Is there a workaround for this?   I'm using a fresh EBS volume

Comment 10 Junqi Zhao 2018-05-29 03:13:41 UTC
Found in 3.9 env which used aws dynamic pv for metrics

# ls -al /var/lib/origin/openshift.local.volumes/pods/1d6f3aa9-3210-11e8-bf71-0aa08eeabc12/volumes/kubernetes.io~aws-ebs/pvc-bfb5ae17-fa28-11e7-a195-0a0f36bed07c

total 44
drwxrwsr-x.  6 root       1000030000  4096 Mar 27 22:01 .
drwxr-x---.  3 root       root          54 Mar 27 22:42 ..
-rw-rw-r--.  1 1000030000 1000030000    16 Mar 27 22:43 .cassandra.version
drwxrwsr-x.  2 1000030000 1000030000  4096 May 29 03:01 commitlog
drwxrwsr-x. 10 1000030000 1000030000  4096 Jan 15 19:19 data
drwxrwsr-x.  2 1000030000 1000030000  4096 May 29 03:00 failure_reports
drwxrwS---.  2 root       1000030000 16384 Jan 15 19:18 lost+found
-rw-rw-r--.  1 1000030000 1000030000   150 Mar 27 22:42 .shutdown.drain
-rw-rw-r--.  1 1000030000 1000030000   150 Jan 15 19:19 .upgrade.upgradesstables

Comment 17 Joel Smith 2018-05-30 19:17:07 UTC
I took a look at this and hopefully my discoveries will be helpful in running down the root cause. Here's what I've found:

Pods created in the openshift-infra project do not get any securityContext attributes set.  If there is no securityContext.fsGroup attribute on the pod, the volume mounter will not go through the PV and set all the permissions.

I'm not super well versed in SCCs, but it appears that the restricted scc is not being applied to the pods. I'm not sure why.

If I create a pod in an arbitrary namespace, it gets the annotation openshift.io/scc, and it gets a value in its securityContext.fsGroup. If I create the same pod in the openshift-infra namespace, there is no openshift.io/scc annotation and no securityContext.fsGroup.

I don't know what is different about openshift-infra that would prevent it from having the restricted (or some other) scc applied to its pods.

Comment 20 Joel Smith 2018-05-30 23:24:38 UTC
I asked liggitt about this and he and deads told me that SCCs aren't run on pods in namespaces that are used to bring up the OpenShift control plane such as openshift-infra. If I understand correctly, they're saying we shouldn't be running pods in the openshift-infra namespace.

Comment 21 Mike Fiedler 2018-05-31 02:06:33 UTC
The Hawkular metrics deployment has run in openshift-infra since 3.2.

Comment 22 John Sanda 2018-05-31 02:14:52 UTC
(In reply to Joel Smith from comment #20)
> I asked liggitt about this and he and deads told me that SCCs aren't run on
> pods in namespaces that are used to bring up the OpenShift control plane
> such as openshift-infra. If I understand correctly, they're saying we
> shouldn't be running pods in the openshift-infra namespace.

Jeff, can you share any additional insights on this? On the one hand I am relieved to have a better understanding of what the problem is and how to resolve it. On the other hand, I am a bit nervous about what seems like a pretty big change at this late cycle.

Comment 23 John Sanda 2018-05-31 02:14:53 UTC
(In reply to Joel Smith from comment #20)
> I asked liggitt about this and he and deads told me that SCCs aren't run on
> pods in namespaces that are used to bring up the OpenShift control plane
> such as openshift-infra. If I understand correctly, they're saying we
> shouldn't be running pods in the openshift-infra namespace.

Jeff, can you share any additional insights on this? On the one hand I am relieved to have a better understanding of what the problem is and how to resolve it. On the other hand, I am a bit nervous about what seems like a pretty big change at this late cycle.

Comment 24 Jeff Cantrill 2018-05-31 12:43:22 UTC
(In reply to John Sanda from comment #23)
shouldn't be running pods in the openshift-infra namespace.
> 
> Jeff, can you share any additional insights on this? On the one hand I am
> relieved to have a better understanding of what the problem is and how to
> resolve it. On the other hand, I am a bit nervous about what seems like a
> pretty big change at this late cycle.

I believe hawkular metrics runs in this namespace because it requires to be in the same one as heapster.  I can not speak to if there is a specific reason  heapster runs in 'openshift-infra'.  As Mike, pointed out, nothing has changed from hawkular deployment perspective since 3.2.  This, IMO, is a regression caused by a change that did not account for hawkular.

Comment 25 John Sanda 2018-06-01 22:03:44 UTC
I Submitted a PR with the initial changes:

https://github.com/openshift/openshift-ansible/pull/8613

Comment 26 Anping Li 2018-06-05 06:43:20 UTC
Once OCP are upgraded to v3.10 on GCE&AWS. The metrics will hit issue. It block the testing of metrics upgrade.

Comment 27 John Sanda 2018-06-05 14:46:12 UTC
The big challenge with moving components to a new namespace is avoiding data loss. Yesterday I asked on the aos-storage list how I can migrate data from a PV. Here are the detail steps with which I was provided:

1. Find your PV.
2. Check PV.Spec.PersistentVolumeReclaimPolicy. If it Delete or Recycle,
change it to Retain (`oc edit pv <xyz>` or `oc patch`)

Whatever happens now, the worst thing that can happen to your PV is that
it can get to Released phase. Data won't be deleted.

Rebind:
3. Create a new PVC in the new namespace. The new PVC should be the same
as the old PVC - storage classes, labels, selectors, ... Explicitly,
PVC.Spec.VolumeName *must* be set to PV.Name. This effectively turns off
dynamic provisioning for this PVC. The new PVC will be Pending. That's
OK, the old PVC is still the one that's bound to the PV.

4. Here comes the tricky part: change PV.Spec.ClaimRef exactly in this way:
  PV.Spec.ClaimRef.Namespace = <new PVC namespace>
  PV.Spec.ClaimRef.Name = <new PVC name>
  PV.Spec.ClaimRef.UID = <new PVC UID>

The old PVC should get "Lost" in couple of seconds (and you can safely
delete it). New PVC should be "Bound". PV should be "Bound" to the new PVC.

5. Restore original PV.Spec.PersistentVolumeReclaimPolicy, if needed.

Note that this just rebinds the PV. It does not stop pods in the old
namespace that use the PV and start them in the new one. Something else
must do that. You should delete the deployment first and re-create it in
the new namespace when the new PVC is bound.

Comment 28 Anping Li 2018-06-06 01:40:40 UTC
@John, Could we we provide a playbook to migrate the namespaces? If not, I think we have to provide a paper to guide the openshift administor.

Comment 29 John Sanda 2018-06-06 02:28:14 UTC
All of the metrics components with the exception of Cassandra are stateless. For the stateless components it is simply a matter of deleting them from openshift-infra and deploying them into a different namespace.

I reached out to the storage team, and Jan Safranek provided me with the detailed steps in comment 27. Implementing these steps will effectively allow us to migrate Cassandra's persistent volume without data loss. Ruben and I have been working on it today and should have a PR ready for review tomorrow.

Comment 31 Weinan Liu 2018-06-13 09:31:18 UTC
Current fix will make metrics pods deployed to the project of openshift-metrics.
(Previously deployed to openshift-infra). This change will make HPA related test fail, due to failing to retrieve the metrics data

Comment 32 Anping Li 2018-06-13 09:56:46 UTC
(In reply to Joel Smith from comment #20)
> I asked liggitt about this and he and deads told me that SCCs aren't run on
> pods in namespaces that are used to bring up the OpenShift control plane
> such as openshift-infra. If I understand correctly, they're saying we
> shouldn't be running pods in the openshift-infra namespace.

@Jordan
As I know the the OpenShift control plane are deployed in kube-system namespace in v3.10.  Must we restrict the openshift-infra?

Comment 33 Anping Li 2018-06-13 10:22:08 UTC
The fix have been merged into openshift-ansible:v3.10.0-0.67.0.0. and it block HPA testing.

Comment 36 Junqi Zhao 2018-06-14 00:30:05 UTC
Will verify this bug after we have final conclusion on bug 1590748

Comment 37 Junqi Zhao 2018-06-14 04:31:59 UTC
default namespace for metrics is openshift-metrics and dynamic pv can be attached to cassandra pod now, but there is another regression bug: 1591077

# rpm -qa | grep openshift-ansible
openshift-ansible-playbooks-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch
openshift-ansible-roles-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch
openshift-ansible-docs-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch
openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch

Comment 39 Mike Fiedler 2018-07-10 20:47:23 UTC
Created attachment 1457910 [details]
Attaching log for hitting this on metrics 3.10.15

Comment 41 Mike Fiedler 2018-07-10 23:52:10 UTC
My inventory is part of my overall cluster install inventory.   I will attach it with credential information redacted.

Comment 42 Mike Fiedler 2018-07-10 23:55:53 UTC
Created attachment 1457959 [details]
Inventory

I should mention this cluster is an upgrade to 3.10.15 from 3.9.27.  Metrics was not installed at the 3.9 level.   The upgrade was successful and then I installed metrics with the attached inventory.

I'll try with a standalone metrics-only inventory and see if that is successful.

Comment 43 Mike Fiedler 2018-07-11 00:08:15 UTC
New/compact inventory below.  The same issue occurred.

[OSEv3:children]
masters
etcd

[masters]
ip-172-31-18-202

[etcd]
ip-172-31-18-202

[OSEv3:vars]
deployment_type=openshift-enterprise
openshift_docker_additional_registries=registry.reg-aws.openshift.com


openshift_metrics_hawkular_hostname=hawkular-metrics.apps.0710-sis.qe.rhcloud.com
openshift_metrics_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_metrics_image_version=v3.10.15

openshift_metrics_install_metrics=true
openshift_metrics_cassandra_replicas=1
openshift_metrics_hawkular_replicas=1
openshift_metrics_cassandra_storage_type=dynamic
openshift_metrics_cassandra_pvc_size=20Gi

openshift_metrics_cassandra_nodeselector={"node-role.kubernetes.io/compute": "true"}
openshift_metrics_hawkular_nodeselector={"node-role.kubernetes.io/compute": "true"}
openshift_metrics_heapster_nodeselector={"node-role.kubernetes.io/compute": "true"}

Comment 46 Mike Fiedler 2018-07-11 12:23:37 UTC
@juzhao @rvargasp  Apologies, openshift-ansible was backlevel and did not match the rest of the env.   After upgrading to openshift-ansible 3.10.15, all is working fine.   Sorry for the scare.

Comment 48 errata-xmlrpc 2018-07-30 19:13:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.