Bug 1916865
| Summary: | Some operators are not getting updated, stalling openshift upgrade | ||
|---|---|---|---|
| Product: | Red Hat OpenShift Pipelines | Reporter: | Shivkumar Ople <sople> |
| Component: | pipelines | Assignee: | Vincent Demeester <vdemeest> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | Ruchir Garg <rgarg> |
| Severity: | urgent | Docs Contact: | Robert Krátký <rkratky> |
| Priority: | urgent | ||
| Version: | unspecified | CC: | aos-bugs, cboudjna, deads, jmencak, lszaszki, maszulik, mfojtik, nagrawal, nikthoma, openshift-bugs-escalate, ppitonak, rbaumgar, rcarrier, sejug, sgreene, shsaxena, skraft, vdemeest, vpagar, vrutkovs, wking, zkosic |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 09:35:37 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Hello,
Yes, removing the lock helped to upgrade the operator to the latest version, as their node-tuning operator and cluster-storage-operator got updated, with the same steps.
Now, there are three more operators which are showing older version and status is showing as ready.
dns 4.5.16 True False False 180d
machine-config 4.5.16 True False False 6d21h
network 4.5.16 True False False 460d
There are no errors in the dns operator pod:-
[vpagar@supportshell 02815381]$ cat 0130-dns.txt
Name: dns
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2019-10-16T13:43:10Z
Generation: 1
Resource Version: 394446855
Self Link: /apis/config.openshift.io/v1/clusteroperators/dns
UID: e4c08dc8-f01a-11e9-af10-005056915271
Spec:
Status:
Conditions:
Last Transition Time: 2021-01-03T16:36:06Z
Message: All desired DNS DaemonSets available and operand Namespace exists
Reason: AsExpected
Status: False
Type: Degraded
Last Transition Time: 2021-01-11T17:57:24Z
Message: Desired and available number of DNS DaemonSets are equal
Reason: AsExpected
Status: False
Type: Progressing
Last Transition Time: 2020-07-22T07:38:46Z
Message: At least 1 DNS DaemonSet available
Reason: AsExpected
Status: True
Type: Available
Extension: <nil>
I can't seem to connect to that server via SSH, can you place the the must-gather in a more accessible location must-gather I got from Ashish doesn't appear to match the description here. In that must-gather configmap/node-tuning-operator-lock references pod with uid bce757b6-f3fd-4d0f-b206-0d08efbc74f5 which exists. To debug this issue, the first step will be a must-gather that shows the issue happening so we can investigate the set of resources present and compare against various logs. Hi, I have downloaded the attached must-gather. I checked KCM’s logs and it looks like that the configmap wasn’t deleted because the GC wasn’t able to construct its dependency graph.
2021-01-15T15:57:54.482245366Z I0115 15:57:54.482215 1 event.go:278] Event(v1.ObjectReference{Kind:"CronJob", Namespace:"dev-parcelcp", Name:"parcel-integration-cronjob-tracking-usps", UID:"b8107a0d-4cde-478e-9f23-89d8423dc64f", APIVersion:"batch/v1beta1", ResourceVersion:"394122549", FieldPath:""}): type: 'Warning' reason: 'FailedNeedsStart' Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
2021-01-15T15:58:03.424505482Z E0115 15:58:03.424444 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: conversion webhook for tekton.dev/v1alpha1, Kind=ClusterTask failed: Post https://tekton-pipelines-webhook.openshift-pipelines.svc:443/?timeout=30s: x509: certificate signed by unknown authority
2021-01-15T15:58:04.151463253Z I0115 15:58:04.151341 1 shared_informer.go:249] stop requested
2021-01-15T15:58:04.151609141Z E0115 15:58:04.151576 1 shared_informer.go:226] unable to sync caches for garbage collector
2021-01-15T15:58:04.151676792Z E0115 15:58:04.151661 1 garbagecollector.go:228] timed out waiting for dependency graph builder sync during GC sync (attempt 19802)
suggest a certificate issue (time skew issue) - the controller wasn’t able to connect the Kube API. I will assign the issue to the appropriate person for further investigation.
Based on 2021-01-15T15:58:03.424505482Z E0115 15:58:03.424444 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: conversion webhook for tekton.dev/v1alpha1, Kind=ClusterTask failed: Post https://tekton-pipelines-webhook.openshift-pipelines.svc:443/?timeout=30s: x509: certificate signed by unknown authority The problem is not with GC which can't proceed without processing entire dependency tree, but with tekton conversion webhook which apparently has a problem with certificate. I'm moving this to build team to deal with. build team does not handle tekton bugs Reached out to that team, and am fixing product / component accordingly. Dear colleagues, I'm Solution Sales in Austria, responsible for iLogistics. I'm following the case since the beginning (1st Dec 2020) and the customer still is facing a situation where his operation is degraded as the upgrade hasn't run through properly. We are now getting heavy push-back from senior customer management who have asked us for clarification and a clear road-map how we are going to address this unpleasant situation. I understand that it is difficult to grant any resolution dates, but I would kindly ask you to: - engage directly with the customer in order to have delays through a ping-pong (e.g. a regular video conf until the problem is resolved)? - let us know WHO is working on the resolution - share your plans about a potential Plan B if the bug can't be resolved Thanks & kind regards, Stephan Dear colleagues, I'm Solution Sales in Austria, responsible for iLogistics. I'm following the case since the beginning (1st Dec 2020) and the customer still is facing a situation where his operation is degraded as the upgrade hasn't run through properly. We are now getting heavy push-back from senior customer management who have asked us for clarification and a clear road-map how we are going to address this unpleasant situation. I understand that it is difficult to grant any resolution dates, but I would kindly ask you to: - engage directly with the customer in order to have delays through a ping-pong (e.g. a regular video conf until the problem is resolved)? - let us know WHO is working on the resolution - share your plans about a potential Plan B if the bug can't be resolved Thanks & kind regards, Stephan Hi team, I tried the above steps but one crd 'clustertasks.tekton.dev' was not deleting so we removed finalizers as per https://access.redhat.com/solutions/4165791 after which crd got deleted, pipeline operator uninstalled and cluster is upgrading further. Will keep you posted if come across any issues. *** Bug 1958885 has been marked as a duplicate of this bug. *** |
A brief look at the must-gather reveals: conditions: - lastTransitionTime: "2020-11-03T14:30:33Z" message: Cluster has deployed "4.5.16" reason: AsExpected status: "True" type: Available - lastTransitionTime: "2020-11-03T14:35:17Z" message: Cluster version is "4.5.16" reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2019-10-16T13:46:48Z" message: DaemonSet "tuned" available reason: AsExpected status: "False" type: Degraded NTO is working as expected and it is not degraded, which means NTO is not failing at upgrading and not "blocking" any upgrade.