Bug 1747472 - Installing Knative on an OpenShift 4 cluster blocks cluster upgrades [NEEDINFO]
Summary: Installing Knative on an OpenShift 4 cluster blocks cluster upgrades
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Serverless
Classification: Red Hat
Component: serving
Version: 1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: Ben Browning
QA Contact: Nobody
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-30 14:06 UTC by Derek Carr
Modified: 2020-01-24 22:06 UTC (History)
6 users (show)

Fixed In Version: 1.1.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-24 22:06:02 UTC
Target Upstream Version:
inecas: needinfo? (decarr)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
JBoss Issue Tracker OSSM-39 'Major' 'Awaiting Release' 'Default ServiceMeshControlPlane Breaks OCP Cluster Upgrades' 2020-01-24 15:25:26 UTC

Description Derek Carr 2019-08-30 14:06:16 UTC
Description of problem:
Installing knative via Operator Hub appears to install an istio-gateway.

The istio-ingressgateway deployed in istio-system namespace has size=1 and the istio-ingressgatway pdb has a minAvailable=1.  This setup prevents node drain from occurring on cluster upgrade, and blocks maintenance on OpenShift.

Version-Release number of selected component (if applicable):

Knative Serving Operator
0.7.1 provided by Red Hat

How reproducible:
Always

Steps to Reproduce:
1. Install knative serving operator and operand
2. Upgrade from 4.1.z to 4.1.z+1

Actual results:
Cluster is unable to upgrade

Expected results:
The installation of this component should not block maintenance on the underlying cluster.

Additional info:

Comment 1 Ben Browning 2019-08-30 18:38:58 UTC
Thanks for the bug report, and for creating our first bugzilla in the OpenShift Serverless product! While our team works out the proper workflow to triage bugs coming in via Bugzilla instead of JIRA, I just wanted to comment that we're aware of your bug report. In this community operator we automatically install Istio via the Maistra operator and we'll need to see if there's a way to influence that operator to deploy Istio in a way that doesn't disrupt cluster upgrades.

In the meantime, to upgrade this cluster, you should be able to delete the Maistra ControlPlane which will delete all those deployments in istio-system. This will break Knative until you recreate that ControlPlane, but it should let your cluster upgrade.

To delete the Maistra ControlPlane:
oc delete controlplane minimal-istio -n istio-system

After the upgrade, if you want to get Knative / Istio working again:
 oc apply -f https://raw.githubusercontent.com/openshift-knative/knative-serving-operator/0.7.0/deploy/resources/maistra/maistra-controlplane-0.10.0.yaml

Comment 2 Martin Gencur 2019-09-02 07:41:42 UTC
The same problem is there when Service Mesh CR2 is installed:

[root@us-west-2-983 ~]# oc get PodDisruptionBudget -n istio-system
NAME                    MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
cluster-local-gateway   1               N/A               0                     18m
istio-galley            1               N/A               0                     19m
istio-ingressgateway    1               N/A               0                     18m
istio-pilot             1               N/A               0                     18m

Listing clusterversion after triggering the update:

[root@us-west-2-983 ~]# oc get clusterversion -o json|jq ".items[0].status.history"
[
  {
    "version": "4.1.13",
    "verified": true,
    "state": "Partial",  <-----------------------------this shows the failed upgrade
    "startedTime": "2019-09-02T07:24:47Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94",
    "completionTime": null
  },
  {
    "version": "4.1.11",
    "verified": false,
    "state": "Completed",
    "startedTime": "2019-09-02T06:34:00Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:bfca31dbb518b35f312cc67516fa18aa40df9925dc84fdbcd15f8bbca425d7ff",
    "completionTime": "2019-09-02T07:24:47Z"
  }
]

[root@us-west-2-983 ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.11    True        False         6m52s   Cluster version is 4.1.13

Comment 3 Ben Browning 2019-09-04 11:23:47 UTC
I believe this is a bug that belongs in Service Mesh and also believe it impacts their latest RC2 / GA builds. If a user creates a ServiceMeshControlPlane using their default yaml template I believe it will block the cluster from being able to upgrade. I was unable to find Service Mesh as a product in Bugzilla, so will have to track down where they keep product bugs.

Comment 5 Ben Browning 2019-10-17 10:28:11 UTC
This bug is tracked in OpenShift Service Mesh as https://issues.jboss.org/browse/OSSM-39 . To insulate OpenShift Serverless customers from this bug, we provide an example ServiceMeshControlPlane that has global.defaultPodDisruptionBudget.enabled set to false which workarounds this issue - https://docs.openshift.com/container-platform/4.1/serverless/installing-openshift-serverless.html#installing-service-mesh-control-plane_installing-openshift-serverless .

If OpenShift Serverless customers choose not to use our example ServiceMeshControlPlane and create their own then they can still hit this issue until a fix for the referenced Service Mesh bug is released.

Comment 6 Eric Rich 2019-10-17 19:22:02 UTC
Should this bug be blocked on: https://bugzilla.redhat.com/show_bug.cgi?id=1762888

Comment 7 Eric Paris 2019-10-17 21:34:09 UTC
(In reply to Eric Rich from comment #6)
> Should this bug be blocked on:
> https://bugzilla.redhat.com/show_bug.cgi?id=1762888

No. OCP/PDBs/MCO is working as designed. The application told us that it could not go down and we did the right thing not taking it down.

I have re-purposed 1762888 to tracking making this situation obvious and clear to users. But there is no behavior changed expected from OCP. The problem needs to be fixed by serverless. We're just going to make the problem easier to run down when a customer makes this same mistake.

Comment 8 Ben Browning 2019-10-17 21:41:51 UTC
The specific issue reported in this bug is fixed by our Serverless operator to the best of our ability. Our documentation now guides the user through installing a ServiceMeshControlPlane that does not set PDBs to avoid hitting this issue. If the user ignores our docs and installs a misconfigured ServiceMeshControlPlane, then they'll still have this problem.

I believe the Service Mesh team is working on making it harder for users to misconfigure their ServiceMeshControlPlane this way in their next release.

Comment 9 Eric Paris 2019-10-17 21:43:26 UTC
Awesome Ben!

Comment 12 Ben Browning 2020-01-24 22:06:02 UTC
Closing as this has been worked around for several releases of OpenShift Serverless. And, Service Mesh has also fixed the underlying bug on their side in the meantime.


Note You need to log in before you can comment on or make changes to this bug.