Description of problem:
Installing knative via Operator Hub appears to install an istio-gateway.
The istio-ingressgateway deployed in istio-system namespace has size=1 and the istio-ingressgatway pdb has a minAvailable=1. This setup prevents node drain from occurring on cluster upgrade, and blocks maintenance on OpenShift.
Version-Release number of selected component (if applicable):
Knative Serving Operator
0.7.1 provided by Red Hat
Steps to Reproduce:
1. Install knative serving operator and operand
2. Upgrade from 4.1.z to 4.1.z+1
Cluster is unable to upgrade
The installation of this component should not block maintenance on the underlying cluster.
Thanks for the bug report, and for creating our first bugzilla in the OpenShift Serverless product! While our team works out the proper workflow to triage bugs coming in via Bugzilla instead of JIRA, I just wanted to comment that we're aware of your bug report. In this community operator we automatically install Istio via the Maistra operator and we'll need to see if there's a way to influence that operator to deploy Istio in a way that doesn't disrupt cluster upgrades.
In the meantime, to upgrade this cluster, you should be able to delete the Maistra ControlPlane which will delete all those deployments in istio-system. This will break Knative until you recreate that ControlPlane, but it should let your cluster upgrade.
To delete the Maistra ControlPlane:
oc delete controlplane minimal-istio -n istio-system
After the upgrade, if you want to get Knative / Istio working again:
oc apply -f https://raw.githubusercontent.com/openshift-knative/knative-serving-operator/0.7.0/deploy/resources/maistra/maistra-controlplane-0.10.0.yaml
The same problem is there when Service Mesh CR2 is installed:
[root@us-west-2-983 ~]# oc get PodDisruptionBudget -n istio-system
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
cluster-local-gateway 1 N/A 0 18m
istio-galley 1 N/A 0 19m
istio-ingressgateway 1 N/A 0 18m
istio-pilot 1 N/A 0 18m
Listing clusterversion after triggering the update:
[root@us-west-2-983 ~]# oc get clusterversion -o json|jq ".items.status.history"
"state": "Partial", <-----------------------------this shows the failed upgrade
[root@us-west-2-983 ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.11 True False 6m52s Cluster version is 4.1.13
I believe this is a bug that belongs in Service Mesh and also believe it impacts their latest RC2 / GA builds. If a user creates a ServiceMeshControlPlane using their default yaml template I believe it will block the cluster from being able to upgrade. I was unable to find Service Mesh as a product in Bugzilla, so will have to track down where they keep product bugs.
This bug is tracked in OpenShift Service Mesh as https://issues.jboss.org/browse/OSSM-39 . To insulate OpenShift Serverless customers from this bug, we provide an example ServiceMeshControlPlane that has global.defaultPodDisruptionBudget.enabled set to false which workarounds this issue - https://docs.openshift.com/container-platform/4.1/serverless/installing-openshift-serverless.html#installing-service-mesh-control-plane_installing-openshift-serverless .
If OpenShift Serverless customers choose not to use our example ServiceMeshControlPlane and create their own then they can still hit this issue until a fix for the referenced Service Mesh bug is released.
Should this bug be blocked on: https://bugzilla.redhat.com/show_bug.cgi?id=1762888
(In reply to Eric Rich from comment #6)
> Should this bug be blocked on:
No. OCP/PDBs/MCO is working as designed. The application told us that it could not go down and we did the right thing not taking it down.
I have re-purposed 1762888 to tracking making this situation obvious and clear to users. But there is no behavior changed expected from OCP. The problem needs to be fixed by serverless. We're just going to make the problem easier to run down when a customer makes this same mistake.
The specific issue reported in this bug is fixed by our Serverless operator to the best of our ability. Our documentation now guides the user through installing a ServiceMeshControlPlane that does not set PDBs to avoid hitting this issue. If the user ignores our docs and installs a misconfigured ServiceMeshControlPlane, then they'll still have this problem.
I believe the Service Mesh team is working on making it harder for users to misconfigure their ServiceMeshControlPlane this way in their next release.
Closing as this has been worked around for several releases of OpenShift Serverless. And, Service Mesh has also fixed the underlying bug on their side in the meantime.