Bug 1381336 - [DOCS] Using deployer MODE=refresh to upgrade metrics failed, pod updates forbidden to change certain fields
Summary: [DOCS] Using deployer MODE=refresh to upgrade metrics failed, pod updates for...
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation   
(Show other bugs)
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Ashley Hardin
QA Contact: Peng Li
Vikram Goyal
URL:
Whiteboard:
Keywords: Unconfirmed
: 1382783 1383881 1387130 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-03 18:13 UTC by Eric Jones
Modified: 2017-04-18 02:02 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Creating the metrics-deployer pod as a privileged user can result in a pod that cannot be updated by the deployer process. Consequence: The metrics-deployer process fails. Fix: Run the metrics deployer as the service account (--as=system:serviceaccount:openshift-infra:metrics-deployer) Result: The metrics-deployer pod can complete successfully.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-11-07 17:38:07 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
oc get pod metrics-deployer-r8w4b -o yaml (3.78 KB, text/plain)
2016-10-05 17:40 UTC, dlbewley
no flags Details

Description Eric Jones 2016-10-03 18:13:06 UTC
Description of problem:
Customer upgraded their cluster from 3.2 to 3.3 and were updating their metrics per the documentation and ran [0] but this eventually failed. The logs for the deployer ended with [1].

[0] # oc new-app -f \
>         /usr/share/openshift/examples/infrastructure-templates/enterprise/metrics-deployer.yaml \
>     -p HAWKULAR_METRICS_HOSTNAME=metrics.test.os.example.com,MODE=refresh

[1] PREFLIGHT CHECK SUCCEEDED
validate_master_accessible: ok
validate_hostname: The HAWKULAR_METRICS_HOSTNAME value is deemed acceptable.
validate_deployer_secret: ok
Deleting any previous deployment (leaving route and PVCs)
POD_NAME metrics-deployer-r8w4b
The Pod "metrics-deployer-r8w4b" is invalid.
spec: Forbidden: pod updates may not change fields other than `containers[*].image` or `spec.activeDeadlineSeconds`

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.3.0.32
kubernetes v1.3.0+52492b4
etcd 2.3.0+git

Additional info:
Shortly attaching the full logs from the deployer pod, the full output from running the deployer command, and the template (just to make sure it was properly updated)

Comment 3 Matt Wringe 2016-10-03 20:49:36 UTC
I am reassigning this to the command line interface group.

The commands that we are running are 'oc label pod ${POD_NAME} metrics-infra-' and 'oc label pod ${POD_NAME} metrics-infra=deployer' and yet the error message we are getting is:

"spec: Forbidden: pod updates may not change fields other than `containers[*].image` or `spec.activeDeadlineSeconds`"

Comment 5 Michail Kargakis 2016-10-04 16:53:29 UTC
Matt, is this a problem only when the deployer pod runs `oc label`? Can you label pods if you run `oc label` as a normal user?

Comment 8 Matt Wringe 2016-10-04 21:22:01 UTC
it looks like this might be related to a permission issue, but I don't understand how exactly its manifesting in this case.

The error message in this case also needs to be updated to reflect its a permission problem.

See: https://botbot.me/freenode/openshift-dev/2016-06-16/?msg=68042321&page=5

Even if we could figure out the scenario which causes this to get triggered would be a good first step. I have never seen this issue before.

Comment 9 dlbewley 2016-10-05 04:03:35 UTC
Let me know if I can provide any more helpful info from my environment.

# oc debug metrics-deployer-r8w4b
Debugging with pod/metrics-deployer-r8w4b-debug, original command: <image entrypoint>
Waiting for pod to start ...
Pod IP: 10.1.0.3
If you don't see a command prompt, try pressing enter.
sh-4.2$ export POD_name=metrics-deployer-r8w4b-debug
sh-4.2$ oc get pod $POD_NAME --show-labels
NAME                           READY     STATUS    RESTARTS   AGE       LABELS
metrics-deployer-r8w4b-debug   1/1       Running   0          1m        <none>
sh-4.2$ oc label pod ${POD_NAME} metrics-infra-
label "metrics-infra" not found.
The Pod "metrics-deployer-r8w4b-debug" is invalid.
spec: Forbidden: pod updates may not change fields other than `containers[*].image` or `spec.activeDeadlineSeconds`
sh-4.2$ oc label pod ${POD_NAME} metrics-infra=foo
The Pod "metrics-deployer-r8w4b-debug" is invalid.
spec: Forbidden: pod updates may not change fields other than `containers[*].image` or `spec.activeDeadlineSeconds`
sh-4.2$ oc whoami
system:serviceaccount:openshift-infra:metrics-deployer

Comment 10 Matt Wringe 2016-10-05 15:50:57 UTC
Setting this to the security component as it may be security related.

Its a weird issue and one that I have not been able to reproduce. But it looks like it might be a permission issue from: https://botbot.me/freenode/openshift-dev/2016-06-16/?msg=68042321&page=5

Comment 11 Matt Wringe 2016-10-05 15:52:03 UTC
Oops, this should have be sent to the 'Auth' component, not security.

Comment 12 Jordan Liggitt 2016-10-05 17:05:28 UTC
can you include the yaml of the pod that is being labeled?

Comment 13 dlbewley 2016-10-05 17:40 UTC
Created attachment 1207659 [details]
oc get pod metrics-deployer-r8w4b -o yaml

Yaml for deployer pod being labeled.

Comment 14 Jordan Liggitt 2016-10-07 13:59:06 UTC
The pod was admitted with the 'anyuid' SCC, and has no uid set in the pod spec.

I doubt the system:serviceaccount:openshift-infra:metrics-deployer user has access to that SCC, so when it tries to update the pod, it is likely the 'restricted' SCC is used during admission, which sets a uid, which then fails validation (since pod spec is immutable on update)

In the SCC admission plugin, we should not select a SCC that requires mutating the pod.

That would provide a better message here.

It would also fix a bug that can occur if a user has access to an SCC that would allow the update without mutating the spec, but a higher priority SCC is selected that does mutate.

Comment 15 Matt Wringe 2016-10-07 18:18:54 UTC
*** Bug 1382783 has been marked as a duplicate of this bug. ***

Comment 16 Matt Wringe 2016-10-07 18:21:27 UTC
@liggitt

How exactly can we reproduce this problem?

The way that this is currently run is that a cluster-admin creates the pod, and the heapster SA modifies it. This is the way that I have always run this, its how QE runs this, and its how the docs describe how to run it.

I have only seen this issue reported by two other people and never myself, so I suspect there is some other step that I am missing.

Comment 17 Jordan Liggitt 2016-10-07 18:50:27 UTC
to recreate:

# create a project as a normal user
oc login -u bob -p password
oc new-project bob-project

# create a pod as a privileged user, verify it was admitted with "openshift.io/scc: anyuid"
oc login -u system:admin
oc create -f pod.yaml
oc get pod/testpod -o yaml

# label the pod as a normal user
oc login -u bob -p password
oc label pod/testpod foo=bar



pod.yaml:
---
apiVersion: v1
kind: Pod
metadata:
  name: testpod
spec:
  containers:
  - command: ["sh","-c","sleep 100000"]
    image: busybox
    imagePullPolicy: Always
    name: busybox

Comment 18 Jordan Liggitt 2016-10-07 18:53:59 UTC
initial prototype of omitting mutating sccs when validating a pod update at https://github.com/liggitt/origin/commits/scc-error

Comment 19 Matt Wringe 2016-10-07 19:24:50 UTC
Ok, so I am trying to figure out how exactly this is all functioning.

If I create the pod as a normal user with 'cluster-admin' privileges, and my deployer service account has edit permissions for the project, then everything works fine. The deployer is able to edit the pods configuration without issue.

This is expected and desired behaviour.

Add the 'anyuid' ssc to my admin user, then any pod that they create can only be modified by another user with the 'anyuid' ssc. This means that users with the edit permission cannot edit the pod anymore unless they also have the 'anyuid' ssc.

Since our deployer doesn't have 'anyuid' permissions, their 'edit' permission for the project is no longer sufficient to modify the pod anymore. Which is what is causing the problem here.

But, my pod doesn't need to run as any arbitrary uid, it doesn't need a user with the 'anyuid' ssc for it to function. There is no reason for it to require the 'anyuid' ssc.

So the next question is:

How do we specify that our pod doesn't need the 'anyuid' permission so that if a user creates the pod with the 'anyuid' it doesn't poison the pod and a user with edit permissions can continue to modify it.

Comment 21 Matt Wringe 2016-10-07 21:07:12 UTC
Possible workaround for people wondering is to grant the 'anyuid' permission to the deployer service account:

oc adm policy add-scc-to-user anyuid system:serviceaccount:openshift-infra:metrics-deployer

Once the deployer is finished, they can remove the permission via:

oc adm policy remove-scc-from-user anyuid system:serviceaccount:openshift-infra:metrics-deployer

Granting the 'anyuid' permission to the deployer is not ideal as it gives the deployer way more power and authorization than the deployer actually needs.

Comment 22 dlbewley 2016-10-07 21:09:54 UTC
I had just done that and redeploying did move on. Cassandra is still not in ready state, but I think that is probably unrelated.

Comment 23 Matt Wringe 2016-10-07 21:22:23 UTC
Actually, a better work around that doesn't require having to add extra permissions to the deployer service account would be to just run the command as a user that has cluster admin privileges but doesn't have the extra 'anyuid' permission.

Its a bit weird since the pods doesn't need this extra permission and we can't tell the pod not to run without it. So depending on the permission of who runs the command its not-determinable if it will work or not.

We are also working to update metrics so that it will work in a different fashion if it encounters this situation.

Comment 24 dlbewley 2016-10-07 21:42:57 UTC
I'm OK with the first workaround for the moment. A previous cassandra pod created by a failed deployer is stuck in Terminating. A new deployer pod (run with the workaround) is fixated on the fact that this defunct cassandra pod is not yet ready, so the deployment is stalled. Meanwhile the cassandra pod that this deployer pod actually created is 'ready'.

Comment 25 Matt Wringe 2016-10-07 21:47:06 UTC
After chatting with Jordan, the best option here would be to just run the deployer pod as the deployer service account.

Eg:

oc new-app --as=system:serviceaccount:openshift-infra:metrics-deployer -f metrics.yaml -p HAWKULAR_METRICS_HOSTNAME=....

This will just require an change to the docs and should always work when done this way. No extra steps would be needed, no updated containers are needed.

Comment 26 dlbewley 2016-10-07 22:25:28 UTC
The `--as` workaround also worked after scaling all the RCs down and removing the `anyuid` scc from the metrics-deployer. 

Not to pollute this BZ too much. I assume that upgrades have been tested, but there is a cassalog error that causes hawkular-metrics to enter CrashLoopBackoff. Deployer stays stuck in `validate_deployment_artifacts` 

```
18:13:26,663 INFO  [org.cassalog.core.CassalogImpl] (metricsservice-lifecycle-thread) Executing [script:vfs:/content/hawkular-metrics-api-jaxrs.war/WEB-INF/lib/hawkular-metrics-schema-0.18.4.Final-redhat-1.jar/org/hawkular/schema/cassalog.groovy, tags:[0.15.x, 0.18.x], vars:[keyspace:hawkular_metrics, reset:false, session:com.datastax.driver.core.SessionManager@7c4a5c29]]
18:13:27,218 INFO  [org.cassalog.core.CassalogImpl] (metricsservice-lifecycle-thread) Applying ChangeSet
-- version: set-keyspace
USE hawkular_metrics
--
18:13:27,240 INFO  [org.cassalog.core.CassalogImpl] (metricsservice-lifecycle-thread) Applying ChangeSet
-- version: 1.2
ALTER TABLE data ADD tags map<text,text>
--
18:13:27,261 FATAL [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200006: An error occurred trying to connect to the Cassandra cluster: org.cassalog.core.ChangeSetException: com.datastax.driver.core.exceptions.InvalidQueryException: Invalid column name tags because it conflicts with an existing column
```

Comment 27 Matt Wringe 2016-10-11 13:15:54 UTC
Would it be possible to attach the full Hawkular Metrics and Cassandra logs?

Comment 28 Matt Wringe 2016-10-11 13:16:43 UTC
Actually, could you please open a new bugzilla for this Cassalog issue? It will make tracking of it easier.

Comment 29 Matt Wringe 2016-10-11 13:17:57 UTC
The fix for this is to update the docs to specify to use the deployer sa when deploying. Please see https://github.com/openshift/openshift-docs/pull/3018

Comment 30 dlbewley 2016-10-11 15:12:18 UTC
The fix in comment 25 worked for me when I updated my production environment.

The Cassalog issue did not recur. I imagine it is an artifact of the failed deploys in my test environment. I'm planning to deploy from scratch as a workaround.

Comment 31 Peng Li 2016-10-12 04:55:53 UTC
*** Bug 1383881 has been marked as a duplicate of this bug. ***

Comment 35 Matt Wringe 2016-10-20 14:19:28 UTC
*** Bug 1387130 has been marked as a duplicate of this bug. ***

Comment 37 openshift-github-bot 2016-11-03 19:51:28 UTC
Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/ee1f05198c489be30fd03c1644c1bcd3b939fcfd
Merge pull request #3018 from mwringe/bz-1381336

Bug 1381336, deploy metrics as the deployer service account.


Note You need to log in before you can comment on or make changes to this bug.