Bug 1916865

Summary:	Some operators are not getting updated, stalling openshift upgrade
Product:	Red Hat OpenShift Pipelines	Reporter:	Shivkumar Ople <sople>
Component:	pipelines	Assignee:	Vincent Demeester <vdemeest>
Status:	CLOSED NEXTRELEASE	QA Contact:	Ruchir Garg <rgarg>
Severity:	urgent	Docs Contact:	Robert Krátký <rkratky>
Priority:	urgent
Version:	unspecified	CC:	aos-bugs, cboudjna, deads, jmencak, lszaszki, maszulik, mfojtik, nagrawal, nikthoma, openshift-bugs-escalate, ppitonak, rbaumgar, rcarrier, sejug, sgreene, shsaxena, skraft, vdemeest, vpagar, vrutkovs, wking, zkosic
Target Milestone:	---	Keywords:	Upgrades
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 09:35:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 4 Zvonko Kosic 2021-01-15 21:28:05 UTC

A brief look at the must-gather reveals:

  conditions:
  - lastTransitionTime: "2020-11-03T14:30:33Z"
    message: Cluster has deployed "4.5.16"
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2020-11-03T14:35:17Z"
    message: Cluster version is "4.5.16"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-10-16T13:46:48Z"
    message: DaemonSet "tuned" available
    reason: AsExpected
    status: "False"
    type: Degraded


NTO is working as expected and it is not degraded, which means NTO is not failing at upgrading and not "blocking" any upgrade.

Comment 12 vaibhav 2021-01-18 18:40:48 UTC

Hello,

Yes, removing the lock helped to upgrade the operator to the latest version, as their node-tuning operator and cluster-storage-operator got updated, with the same steps.

Now, there are three more operators which are showing older version and status is showing as ready.

dns                                        4.5.16    True        False         False      180d    
machine-config                             4.5.16    True        False         False      6d21h  
network                                    4.5.16    True        False         False      460d    

There are no errors in the dns operator pod:-

[vpagar@supportshell 02815381]$ cat 0130-dns.txt 
Name:         dns
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-10-16T13:43:10Z
  Generation:          1
  Resource Version:    394446855
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/dns
  UID:                 e4c08dc8-f01a-11e9-af10-005056915271
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-01-03T16:36:06Z
    Message:               All desired DNS DaemonSets available and operand Namespace exists
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2021-01-11T17:57:24Z
    Message:               Desired and available number of DNS DaemonSets are equal
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-07-22T07:38:46Z
    Message:               At least 1 DNS DaemonSet available
    Reason:                AsExpected
    Status:                True
    Type:                  Available
  Extension:               <nil>

Comment 13 David Eads 2021-01-19 18:25:23 UTC

I can't seem to connect to that server via SSH, can you place the the must-gather in a more accessible location

Comment 14 David Eads 2021-01-19 19:24:41 UTC

must-gather I got from Ashish doesn't appear to match the description here.  In that must-gather configmap/node-tuning-operator-lock references pod with uid bce757b6-f3fd-4d0f-b206-0d08efbc74f5 which exists.  To debug this issue, the first step will be a must-gather that shows the issue happening so we can investigate the set of resources present and compare against various logs.

Comment 17 Lukasz Szaszkiewicz 2021-01-20 09:31:33 UTC

Hi, I have downloaded the attached must-gather. I checked KCM’s logs and it looks like that the configmap wasn’t deleted because the GC wasn’t able to construct its dependency graph.

2021-01-15T15:57:54.482245366Z I0115 15:57:54.482215       1 event.go:278] Event(v1.ObjectReference{Kind:"CronJob", Namespace:"dev-parcelcp", Name:"parcel-integration-cronjob-tracking-usps", UID:"b8107a0d-4cde-478e-9f23-89d8423dc64f", APIVersion:"batch/v1beta1", ResourceVersion:"394122549", FieldPath:""}): type: 'Warning' reason: 'FailedNeedsStart' Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
2021-01-15T15:58:03.424505482Z E0115 15:58:03.424444       1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: conversion webhook for tekton.dev/v1alpha1, Kind=ClusterTask failed: Post https://tekton-pipelines-webhook.openshift-pipelines.svc:443/?timeout=30s: x509: certificate signed by unknown authority
2021-01-15T15:58:04.151463253Z I0115 15:58:04.151341       1 shared_informer.go:249] stop requested
2021-01-15T15:58:04.151609141Z E0115 15:58:04.151576       1 shared_informer.go:226] unable to sync caches for garbage collector
2021-01-15T15:58:04.151676792Z E0115 15:58:04.151661       1 garbagecollector.go:228] timed out waiting for dependency graph builder sync during GC sync (attempt 19802)

suggest a certificate issue (time skew issue) - the controller wasn’t able to connect the Kube API. I will assign the issue to the appropriate person for further investigation.

Comment 18 Maciej Szulik 2021-01-20 11:44:35 UTC

Based on

2021-01-15T15:58:03.424505482Z E0115 15:58:03.424444       1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: conversion webhook for tekton.dev/v1alpha1, Kind=ClusterTask failed: Post https://tekton-pipelines-webhook.openshift-pipelines.svc:443/?timeout=30s: x509: certificate signed by unknown authority

The problem is not with GC which can't proceed without processing entire dependency tree, but with tekton 
conversion webhook which apparently has a problem with certificate. I'm moving this to build team to deal with.

Comment 19 Gabe Montero 2021-01-20 13:24:46 UTC

build team does not handle tekton bugs

Reached out to that team, and am fixing product / component accordingly.

Comment 33 skraft 2021-02-01 10:45:20 UTC

Dear colleagues,

I'm Solution Sales in Austria, responsible for iLogistics. I'm following the case since the beginning (1st Dec 2020) and the customer still is facing a situation where his operation is degraded as the upgrade hasn't run through properly. 

We are now getting heavy push-back from senior customer management who have asked us for clarification and a clear road-map how we are going to address this unpleasant situation.

I understand that it is difficult to grant any resolution dates, but I would kindly ask you to:
- engage directly with the customer in order to have delays through a ping-pong (e.g. a regular video conf until the problem is resolved)?
- let us know WHO is working on the resolution
- share your plans about a potential Plan B if the bug can't be resolved

Thanks & kind regards,
Stephan

Comment 34 skraft 2021-02-01 10:45:51 UTC

Dear colleagues,

I'm Solution Sales in Austria, responsible for iLogistics. I'm following the case since the beginning (1st Dec 2020) and the customer still is facing a situation where his operation is degraded as the upgrade hasn't run through properly. 

We are now getting heavy push-back from senior customer management who have asked us for clarification and a clear road-map how we are going to address this unpleasant situation.

I understand that it is difficult to grant any resolution dates, but I would kindly ask you to:
- engage directly with the customer in order to have delays through a ping-pong (e.g. a regular video conf until the problem is resolved)?
- let us know WHO is working on the resolution
- share your plans about a potential Plan B if the bug can't be resolved

Thanks & kind regards,
Stephan

Comment 37 Shubhag Saxena 2021-02-02 14:58:16 UTC

Hi team,

I tried the above steps but one crd 'clustertasks.tekton.dev' was not deleting so we removed finalizers as per https://access.redhat.com/solutions/4165791 after which crd got deleted, pipeline operator uninstalled and cluster is upgrading further. Will keep you posted if come across any issues.

Comment 40 Jiří Mencák 2021-05-10 14:15:45 UTC

*** Bug 1958885 has been marked as a duplicate of this bug. ***