Description of problem: upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-16-005956 in vsphere is failed for "remote error: tls: protocol version not supported" # oc get co/monitoring -oyaml ... - lastTransitionTime: "2020-06-16T03:43:10Z" message: 'Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: reconciling Prometheus rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s: remote error: tls: protocol version not supported' reason: UpdatingPrometheusK8SFailed status: "True" type: Degraded ... # oc get ValidatingWebhookConfiguration prometheusrules.openshift.io -oyaml ... webhooks: - admissionReviewVersions: - v1beta1 clientConfig: caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURVVENDQWptZ0F3SUJBZ0lJWGdoeFpGdkZKSTR3RFFZSktvWklodmNOQVFFTEJRQXdOakUwTURJR0ExVUUKQXd3cmIzQmxibk5vYVdaMExYTmxjblpwWTJVdGMyVnlkbWx1WnkxemFXZHVaWEpBTVRVNU1qSTNNamN6T0RBZQpGdzB5TURBMk1UWXdNVFU0TlRkYUZ3MHlNakE0TVRVd01UVTROVGhhTURZeE5EQXlCZ05WQkFNTUsyOXdaVzV6CmFHbG1kQzF6WlhKMmFXTmxMWE5sY25acGJtY3RjMmxuYm1WeVFERTFPVEl5TnpJM016Z3dnZ0VpTUEwR0NTcUcKU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLQW9JQkFRQ1hBNFhsTEg5K3lsTmQ1dE1vNFh5VXFQRmtXaDJGd3J3VwpieVdMMEJOOHJPQTlua3g0NWJEVE5XRndQQVZNTjV0OUlqY05UeWgvTkRRLzlmYmtiajJEUXFGQXdSSVRhL1V1ClVHVTA3M05pYXYyS3pFRXBwR3llTVEvV24xZlpXRUZ3VTJpSklmVkxyQXQ0eUZxYnFWK3VORVliY1l3bzJJZm4Kdlc5b3hWMmRqdTZtSVUwYzF2L3MrLzQ2S01yZWRpSWJpUjdmMWhFV09OQnV0T01MaXN0R20reE9nYlVjZi90MAozeTBNazY1WGJiVHBFNjNZZzlsN1NQTEFkLzUrQkg0dkZjUG9xSmZtdXpxVjgzL0hjTFJ6alpuMnNTS0M0QzhsClpUa0w5UURhL2JnVnlRcjd4L0tjUWRhWE1xc2VDZnVLZHlhRjZYSUU0RVI3T3pBYUw4d2JBZ01CQUFHall6QmgKTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQQmdOVkhSTUJBZjhFQlRBREFRSC9NQjBHQTFVZERnUVdCQlJneEFkWApYUkUydGlqTFNFOTh4OVZPWEk4eDZEQWZCZ05WSFNNRUdEQVdnQlJneEFkWFhSRTJ0aWpMU0U5OHg5Vk9YSTh4CjZEQU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUFQYWlsb0dqK28rUGdQR29XMWhiaksvRjFzcm5QdTFXUmJPOGcKbVQ0VEhOeUsycldtN0hnRWMwSjErVTJkaHpjQ21YMDVadW0xYks1YjhxOUNVQlJSdkp4YTFnYVJRQVJJR1FZQQpIOHp0eCtyUmVyUEYybVJLV243VjNMdWM3SE1tQnBWVkEzZ2ZMRHdXOURpMFRydDhOejYrZWJTRGJsRHdkNS9nCjk2c1Z1M1E1ZHlMMC9kTGVGOE1ndGdBc0dxY3BBSFJTVjJwVUcybkdhdWNsUlQ3ZVh4Z2ptaWJrVGtLMGFXTy8KeFFRZUFEVitvVkxrNnRhdDdubGtZK3dtYUdiclRpeXoyTkJzV3IxS1UxZDdKemJGTk1kUStPQ05weXR4Wk1rNApmbTY2bTh3azJTdllkZmVEQUZ5RkY4K0k5NlZXV0tHTU9qRWR3NnVYK3NXdDkyUW45UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K service: name: prometheus-operator namespace: openshift-monitoring path: /admission-prometheusrules/validate port: 8080 failurePolicy: Fail matchPolicy: Equivalent ... maybe caused by https://github.com/brancz/kube-rbac-proxy/blob/281be3a9475138da375ca7d0986d333b9cb70e8c/main.go#L269-L275 Version-Release number of selected component (if applicable): upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-16-005956 How reproducible: not sure Steps to Reproduce: 1. upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-16-005956 in vsphere 2. 3. Actual results: upgrade failed Expected results: no issue Additional info:
Not sure you noticed, but the POST goes to 8080 port, whereas the service publishes the 8443 port, and, at least in my 4.5-nightly cluster, the service targets the "https" port, which is set to be the 8443 port on the kube-rbac-proxy container. So, possibly, this could be a case of a misconfigured webhook, since it apparently sets the port to 8080 as well.
For 4.6, the 8080 port was changed from http to https. It's somewhat against convention to use 8080 for https, but the kube-rbac-proxy was already listening on 8443 and requires authentication information which the API server doesn't provide. So for the new webhook, the prometheus-operator listens for https connections on port 8080. But maybe we should change this port number to avoid confusion. As far as I can tell the cipher suite list is the same for kube-rbac-proxy, the prometheus-operator, and the settings from cluster-monitoring-operator. https://github.com/openshift/cluster-monitoring-operator/blob/783927973303f78bbf162c777330418357db6957/assets/prometheus-operator/deployment.yaml#L36 Does this issue only occur on specific platforms, or does it happen for all platforms?
Tested with AWS FIPS enabled cluster, same error with Comment 5, since the upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-17-180933 in aws is Success https://openshift-release.svc.ci.openshift.org/releasestream/4.6.0-0.nightly/release/4.6.0-0.nightly-2020-06-17-180933 I think the only difference is FIPS, that is the failure cause
confirmed on one AWS cluster without FIPS enabled, upgrade from 4.5.0-rc.1 to 4.6.0-0.nightly-2020-06-17-180933 in aws is Success
I'm not sure what is going on with this yet. It seems that we are setting the TLS cipher suites correctly in prometheus-operator via the web.tls-cipher-suites setting, AFAICT. Unfortunately the error message doesn't say which cipher suites are expected by the API server. I need to check with the API server team whethere there are any changes to the client settings for accepted cipher suites.
Can you please provide must-gather for the fips cluster after the error is reported? I'm hoping to see a more explicit error message in the logs and see what configuration is involved.
Free for grabs at: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-fips-4.6&sort-by-flakiness=
Looking at the other validating webhook configs currently in openshift (autoscaling.openshift.io, machine-api, multus.openshift.io) it appears that none of them set a specific list of tls cipher suites, so they are just using the default golang cipher suites. So maybe we shouldn't be setting this for the prometheus operator webhook?
(In reply to Maru Newby from comment #9) > Can you please provide must-gather for the fips cluster after the error is > reported? I'm hoping to see a more explicit error message in the logs and > see what configuration is involved. see the URL provided by Standa Laznicka, you can find logs in many failed upgrade runs
Promtheus operator by default sets the TLS min version to 1.3, but it seems this is not compatible with FIPS. Setting the min version to 1.3 seems to resolve the issue. https://github.com/openshift/cluster-monitoring-operator/pull/826
(In reply to Paul Gier from comment #13) > Setting the min version to 1.3 seems to > resolve the issue. should be: Setting the min version to 1.2
upgrade from 4.5.0-rc.6 to 4.6.0-0.nightly-2020-07-07-233934 with FIPS enabled is successful before upgrade ********************************************* # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-rc.6 True False 64m Cluster version is 4.5.0-rc.6 # oc get node NAME STATUS ROLES AGE VERSION ip-10-0-137-131.us-east-2.compute.internal Ready master 82m v1.18.3+6025c28 ... # oc debug node/ip-10-0-137-131.us-east-2.compute.internal Starting pod/ip-10-0-137-131us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.137.131 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# cat /proc/sys/crypto/fips_enabled 1 ********************************************* after upgrade ********************************************* # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-07-07-233934 True False 97m Cluster version is 4.6.0-0.nightly-2020-07-07-233934 # oc get co/monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE monitoring 4.6.0-0.nightly-2020-07-07-233934 True False False 105m # oc -n openshift-monitoring get deploy prometheus-operator -oyaml | grep "web.tls-min-version" - --web.tls-min-version=VersionTLS12
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196