Bug 1847318 - FIPS enabled cluster, upgrade from 4.5 to 4.6 is failed for calling webhook "prometheusrules.openshift.io"
Summary: FIPS enabled cluster, upgrade from 4.5 to 4.6 is failed for calling webhook "...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-16 08:11 UTC by Junqi Zhao
Modified: 2022-12-30 08:36 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:07:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 826 0 None closed Bug 1847318: set TLS min version to 1.2 2021-02-03 03:30:30 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:08:00 UTC

Description Junqi Zhao 2020-06-16 08:11:59 UTC
Description of problem:
upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-16-005956 in vsphere is failed for "remote error: tls: protocol version not supported"
# oc get co/monitoring -oyaml
...
  - lastTransitionTime: "2020-06-16T03:43:10Z"
    message: 'Failed to rollout the stack. Error: running task Updating Prometheus-k8s
      failed: reconciling Prometheus rules PrometheusRule failed: updating PrometheusRule
      object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io":
      Post https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s:
      remote error: tls: protocol version not supported'
    reason: UpdatingPrometheusK8SFailed
    status: "True"
    type: Degraded
...
# oc get ValidatingWebhookConfiguration prometheusrules.openshift.io -oyaml
...
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURVVENDQWptZ0F3SUJBZ0lJWGdoeFpGdkZKSTR3RFFZSktvWklodmNOQVFFTEJRQXdOakUwTURJR0ExVUUKQXd3cmIzQmxibk5vYVdaMExYTmxjblpwWTJVdGMyVnlkbWx1WnkxemFXZHVaWEpBTVRVNU1qSTNNamN6T0RBZQpGdzB5TURBMk1UWXdNVFU0TlRkYUZ3MHlNakE0TVRVd01UVTROVGhhTURZeE5EQXlCZ05WQkFNTUsyOXdaVzV6CmFHbG1kQzF6WlhKMmFXTmxMWE5sY25acGJtY3RjMmxuYm1WeVFERTFPVEl5TnpJM016Z3dnZ0VpTUEwR0NTcUcKU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLQW9JQkFRQ1hBNFhsTEg5K3lsTmQ1dE1vNFh5VXFQRmtXaDJGd3J3VwpieVdMMEJOOHJPQTlua3g0NWJEVE5XRndQQVZNTjV0OUlqY05UeWgvTkRRLzlmYmtiajJEUXFGQXdSSVRhL1V1ClVHVTA3M05pYXYyS3pFRXBwR3llTVEvV24xZlpXRUZ3VTJpSklmVkxyQXQ0eUZxYnFWK3VORVliY1l3bzJJZm4Kdlc5b3hWMmRqdTZtSVUwYzF2L3MrLzQ2S01yZWRpSWJpUjdmMWhFV09OQnV0T01MaXN0R20reE9nYlVjZi90MAozeTBNazY1WGJiVHBFNjNZZzlsN1NQTEFkLzUrQkg0dkZjUG9xSmZtdXpxVjgzL0hjTFJ6alpuMnNTS0M0QzhsClpUa0w5UURhL2JnVnlRcjd4L0tjUWRhWE1xc2VDZnVLZHlhRjZYSUU0RVI3T3pBYUw4d2JBZ01CQUFHall6QmgKTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQQmdOVkhSTUJBZjhFQlRBREFRSC9NQjBHQTFVZERnUVdCQlJneEFkWApYUkUydGlqTFNFOTh4OVZPWEk4eDZEQWZCZ05WSFNNRUdEQVdnQlJneEFkWFhSRTJ0aWpMU0U5OHg5Vk9YSTh4CjZEQU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUFQYWlsb0dqK28rUGdQR29XMWhiaksvRjFzcm5QdTFXUmJPOGcKbVQ0VEhOeUsycldtN0hnRWMwSjErVTJkaHpjQ21YMDVadW0xYks1YjhxOUNVQlJSdkp4YTFnYVJRQVJJR1FZQQpIOHp0eCtyUmVyUEYybVJLV243VjNMdWM3SE1tQnBWVkEzZ2ZMRHdXOURpMFRydDhOejYrZWJTRGJsRHdkNS9nCjk2c1Z1M1E1ZHlMMC9kTGVGOE1ndGdBc0dxY3BBSFJTVjJwVUcybkdhdWNsUlQ3ZVh4Z2ptaWJrVGtLMGFXTy8KeFFRZUFEVitvVkxrNnRhdDdubGtZK3dtYUdiclRpeXoyTkJzV3IxS1UxZDdKemJGTk1kUStPQ05weXR4Wk1rNApmbTY2bTh3azJTdllkZmVEQUZ5RkY4K0k5NlZXV0tHTU9qRWR3NnVYK3NXdDkyUW45UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
    service:
      name: prometheus-operator
      namespace: openshift-monitoring
      path: /admission-prometheusrules/validate
      port: 8080
  failurePolicy: Fail
  matchPolicy: Equivalent
...

maybe caused by 
https://github.com/brancz/kube-rbac-proxy/blob/281be3a9475138da375ca7d0986d333b9cb70e8c/main.go#L269-L275


Version-Release number of selected component (if applicable):
upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-16-005956

How reproducible:
not sure

Steps to Reproduce:
1. upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-16-005956 in vsphere
2.
3.

Actual results:
upgrade failed

Expected results:
no issue

Additional info:

Comment 2 Standa Laznicka 2020-06-17 13:53:06 UTC
Not sure you noticed, but the POST goes to 8080 port, whereas the service publishes the 8443 port, and, at least in my 4.5-nightly cluster, the service targets the "https" port, which is set to be the 8443 port on the kube-rbac-proxy container. So, possibly, this could be a case of a misconfigured webhook, since it apparently sets the port to 8080 as well.

Comment 3 Paul Gier 2020-06-17 20:34:03 UTC
For 4.6, the 8080 port was changed from http to https.  It's somewhat against convention to use 8080 for https, but the kube-rbac-proxy was already listening on 8443 and requires authentication information which the API server doesn't provide.  So for the new webhook, the prometheus-operator listens for https connections on port 8080.  But maybe we should change this port number to avoid confusion.

As far as I can tell the cipher suite list is the same for kube-rbac-proxy, the prometheus-operator, and the settings from cluster-monitoring-operator.
https://github.com/openshift/cluster-monitoring-operator/blob/783927973303f78bbf162c777330418357db6957/assets/prometheus-operator/deployment.yaml#L36

Does this issue only occur on specific platforms, or does it happen for all platforms?

Comment 6 Junqi Zhao 2020-06-18 06:51:56 UTC
Tested with AWS FIPS enabled cluster, same error with Comment 5, since the upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-17-180933 in aws is Success
https://openshift-release.svc.ci.openshift.org/releasestream/4.6.0-0.nightly/release/4.6.0-0.nightly-2020-06-17-180933

I think the only difference is FIPS, that is the failure cause

Comment 7 Junqi Zhao 2020-06-18 10:19:41 UTC
confirmed on one AWS cluster without FIPS enabled, upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-17-180933 in aws is Success

Comment 8 Paul Gier 2020-06-24 14:19:20 UTC
I'm not sure what is going on with this yet.  It seems that we are setting the TLS cipher suites correctly in prometheus-operator via the web.tls-cipher-suites setting, AFAICT.  Unfortunately the error message doesn't say which cipher suites are expected by the API server.  I need to check with the API server team whethere there are any changes to the client settings for accepted cipher suites.

Comment 9 Maru Newby 2020-06-24 19:39:22 UTC
Can you please provide must-gather for the fips cluster after the error is reported? I'm hoping to see a more explicit error message in the logs and see what configuration is involved.

Comment 11 Paul Gier 2020-06-25 16:41:20 UTC
Looking at the other validating webhook configs currently in openshift (autoscaling.openshift.io, machine-api, multus.openshift.io) it appears that none of them set a specific list of tls cipher suites, so they are just using the default golang cipher suites.  So maybe we shouldn't be setting this for the prometheus operator webhook?

Comment 12 Junqi Zhao 2020-06-28 02:46:19 UTC
(In reply to Maru Newby from comment #9)
> Can you please provide must-gather for the fips cluster after the error is
> reported? I'm hoping to see a more explicit error message in the logs and
> see what configuration is involved.

see the URL provided by Standa Laznicka, you can find logs in many failed upgrade runs

Comment 13 Paul Gier 2020-06-29 18:09:32 UTC
Promtheus operator by default sets the TLS min version to 1.3, but it seems this is not compatible with FIPS.  Setting the min version to 1.3 seems to resolve the issue.  https://github.com/openshift/cluster-monitoring-operator/pull/826

Comment 14 Junqi Zhao 2020-07-01 01:49:41 UTC
(In reply to Paul Gier from comment #13)
> Setting the min version to 1.3 seems to
> resolve the issue. 

should be:
Setting the min version to 1.2

Comment 18 Junqi Zhao 2020-07-08 12:16:03 UTC
upgrade from 4.5.0-rc.6 to 4.6.0-0.nightly-2020-07-07-233934 with FIPS enabled is successful
before upgrade
*********************************************
# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-rc.6   True        False         64m     Cluster version is 4.5.0-rc.6

# oc get node
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-137-131.us-east-2.compute.internal   Ready    master   82m   v1.18.3+6025c28
...

# oc debug node/ip-10-0-137-131.us-east-2.compute.internal
Starting pod/ip-10-0-137-131us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.137.131
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# cat /proc/sys/crypto/fips_enabled
1
*********************************************

after upgrade
*********************************************
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-07-233934   True        False         97m     Cluster version is 4.6.0-0.nightly-2020-07-07-233934

# oc get co/monitoring
NAME         VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring   4.6.0-0.nightly-2020-07-07-233934   True        False         False      105m


# oc -n openshift-monitoring get deploy prometheus-operator -oyaml | grep "web.tls-min-version"
        - --web.tls-min-version=VersionTLS12

Comment 21 errata-xmlrpc 2020-10-27 16:07:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.