Bug 1881481 - CVO hotloops on some service manifests
Summary: CVO hotloops on some service manifests
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Jack Ottofaro
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-22 13:51 UTC by Stefan Schimanski
Modified: 2021-07-27 22:33 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:33:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 558 0 None closed Bug 1881481: Only compare ServiceType when set in manifest 2021-05-14 17:34:47 UTC
Github openshift cluster-version-operator pull 563 0 None closed Bug 1881481: TargetPort should default to port in ServicePort if unset 2021-05-17 20:47:58 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:33:45 UTC

Description Stefan Schimanski 2020-09-22 13:51:33 UTC
CVO keeps updating services that lack spec.sessionAffinity:

{"count":77,"path":"/api/v1/namespaces/openshift-cluster-samples-operator/services/metrics"}
{"count":78,"path":"/api/v1/namespaces/openshift-cluster-version/services/cluster-version-operator"}
{"count":50,"path":"/api/v1/namespaces/openshift-config-managed/configmaps/signatures-managed"}
{"count":77,"path":"/api/v1/namespaces/openshift-console/services/downloads"}
{"count":75,"path":"/api/v1/namespaces/openshift-image-registry/services/image-registry-operator"}
{"count":78,"path":"/api/v1/namespaces/openshift-machine-config-operator/services/machine-config-daemon"}
{"count":77,"path":"/api/v1/namespaces/openshift-marketplace/services/marketplace-operator-metrics"}

Comment 1 W. Trevor King 2020-09-22 21:36:34 UTC
Stefan suggests possibly waiting until API-server support for server-side apply [1] goes GA and rerolling the CVO's apply logic to use that instead of client-side merging, which might help here.  And bug 1879184 might end up with a [Late] CI guard based on the audit logs.  But whatever is going on here is unlikely to be new in 4.6, so punting to 4.7.

[1]: https://kubernetes.io/blog/2020/04/01/kubernetes-1.18-feature-server-side-apply-beta-2/

Comment 2 Stefan Schimanski 2020-09-23 13:13:27 UTC
I don't think #c1 reflects what I meant. I meant that it is an infinite game with lots of chance for mistakes and failure in writing the perfect client-side merging funcs for all types. Instead the right solution is to triage these bugs, fix the manifests for now and add an e2e test that uncovers the issues before new manifests merge.

Comment 3 Stefan Schimanski 2020-09-23 13:14:47 UTC
I fear rewriting services leads to load of the endpoints controllers and possibly to load on the networking stack updating iptables. Increasing severity until proven otherwise.

Comment 4 Stefan Schimanski 2020-09-23 13:35:00 UTC
I double checked and it seems that kube-apiserver notices for this one that nothing changed and omits updating the object. So this BZ is purely about unnecessary load on the apiserver and etcd.

Comment 5 W. Trevor King 2020-09-23 17:03:09 UTC
> So this BZ is purely about unnecessary load on the apiserver and etcd.

Punting back to 4.7, because I don't see any new-in-4.6 regressions here, and it's really late in the 4.6 cycle to make new 4.6 blockers unless we have a solid story around why this is a critical issue.

> ... and add an e2e test that uncovers the issues before new manifests merge.

This is bug 1879184, right?

Comment 6 W. Trevor King 2020-10-02 23:12:15 UTC
It's end of sprint, and this is not going to get fixed in the next few hours.  Hopefully we will at least get the Late audit guard from bug 1879184 in next sprint, and then we'll see which team should fix this issue.

Comment 7 Jack Ottofaro 2020-10-23 18:59:45 UTC
Adding UpcomingSprint as we have reached the end of the current sprint and pushing this bug to the next sprint.

Comment 8 Lalatendu Mohanty 2020-12-11 15:19:44 UTC
Marking this as not a blocker as this is not a regression and we have the design for long time.

Comment 10 Jack Ottofaro 2021-05-12 14:23:51 UTC
CVO's Service validation does not check sessionAffinity however it does check spec.type. The spec.type check is incorrect in that it should only be checking the field if it was set in the manifest otherwise CVO is continuously trying to clear the server set default of "ClusterIP".

Comment 11 W. Trevor King 2021-05-12 16:40:04 UTC
Made the title more generic, because the issue is the presence of hotlooping, not a particular property.  For example, if we had two separate properties that both triggered Session hotlooping, we'd want to fix both of them to close out this bug.

Comment 13 liujia 2021-05-24 09:08:22 UTC
Reproduced with 4.8.0-fc.3-x86_64(take openshift-cluster-version/services/cluster-version-operator and openshift-marketplace/services/marketplace-operator-metrics for example)

# cat manifests/0000_50_operator-marketplace_08_service.yaml|grep -A15 "spec:"
spec:
  selector:
    name: marketplace-operator
  ports:
  - name: metrics
    port: 8383
    protocol: TCP
    targetPort: 8383
  - name: https-metrics
    port: 8081
    protocol: TCP
    targetPort: 8081
# curl -s https://raw.githubusercontent.com/openshift/cluster-version-operator/master/install/0001_00_cluster-version-operator_03_service.yaml|grep -A7 "spec:"
spec:
  type: ClusterIP
  selector:
    k8s-app: cluster-version-operator
  ports:
  - name: metrics
    port: 9099 # chosen to be in the internal open range

Check from cvo logs to see there are continuous service update request every 3 minutes for above two resources.
# ./oc logs cluster-version-operator-6b5f889866-d9t88|grep -E "request: PUT.*services.*marketplace-operator-metrics|request: PUT.*services.*cluster-version-operator"|head -n8
I0524 07:26:12.785812       1 request.go:591] Throttling request took 95.823647ms, request: PUT:https://api-int.jliu-48.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-marketplace/services/marketplace-operator-metrics
I0524 07:26:14.986352       1 request.go:591] Throttling request took 96.07034ms, request: PUT:https://api-int.jliu-48.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/services/cluster-version-operator
I0524 07:29:31.949862       1 request.go:591] Throttling request took 96.718612ms, request: PUT:https://api-int.jliu-48.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-marketplace/services/marketplace-operator-metrics
I0524 07:29:34.100938       1 request.go:591] Throttling request took 97.068142ms, request: PUT:https://api-int.jliu-48.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/services/cluster-version-operator
I0524 07:32:50.843349       1 request.go:591] Throttling request took 95.999508ms, request: PUT:https://api-int.jliu-48.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-marketplace/services/marketplace-operator-metrics
I0524 07:32:53.043228       1 request.go:591] Throttling request took 94.302534ms, request: PUT:https://api-int.jliu-48.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/services/cluster-version-operator
I0524 07:36:10.015879       1 request.go:591] Throttling request took 95.76657ms, request: PUT:https://api-int.jliu-48.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-marketplace/services/marketplace-operator-metrics
I0524 07:36:12.215440       1 request.go:591] Throttling request took 95.327531ms, request: PUT:https://api-int.jliu-48.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/services/cluster-version-operator

# ./oc logs cluster-version-operator-6b5f889866-d9t88|grep "request.go.*request: PUT.*services.*marketplace-operator-metrics"|wc -l
29
# ./oc logs cluster-version-operator-6b5f889866-d9t88|grep "request.go.*request: PUT.*services.*cluster-version-operator"|wc -l
31

Comment 14 liujia 2021-05-25 08:49:48 UTC
Verified on 4.8.0-0.nightly-2021-05-21-233425

# ./oc logs cluster-version-operator-69b746856b-6v6nt|grep "request.go.*request: PUT.*services"|wc -l
0

Comment 17 errata-xmlrpc 2021-07-27 22:33:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.