A brief look at the must-gather reveals: conditions: - lastTransitionTime: "2020-11-03T14:30:33Z" message: Cluster has deployed "4.5.16" reason: AsExpected status: "True" type: Available - lastTransitionTime: "2020-11-03T14:35:17Z" message: Cluster version is "4.5.16" reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2019-10-16T13:46:48Z" message: DaemonSet "tuned" available reason: AsExpected status: "False" type: Degraded NTO is working as expected and it is not degraded, which means NTO is not failing at upgrading and not "blocking" any upgrade.
Hello, Yes, removing the lock helped to upgrade the operator to the latest version, as their node-tuning operator and cluster-storage-operator got updated, with the same steps. Now, there are three more operators which are showing older version and status is showing as ready. dns 4.5.16 True False False 180d machine-config 4.5.16 True False False 6d21h network 4.5.16 True False False 460d There are no errors in the dns operator pod:- [vpagar@supportshell 02815381]$ cat 0130-dns.txt Name: dns Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-10-16T13:43:10Z Generation: 1 Resource Version: 394446855 Self Link: /apis/config.openshift.io/v1/clusteroperators/dns UID: e4c08dc8-f01a-11e9-af10-005056915271 Spec: Status: Conditions: Last Transition Time: 2021-01-03T16:36:06Z Message: All desired DNS DaemonSets available and operand Namespace exists Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2021-01-11T17:57:24Z Message: Desired and available number of DNS DaemonSets are equal Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2020-07-22T07:38:46Z Message: At least 1 DNS DaemonSet available Reason: AsExpected Status: True Type: Available Extension: <nil>
I can't seem to connect to that server via SSH, can you place the the must-gather in a more accessible location
must-gather I got from Ashish doesn't appear to match the description here. In that must-gather configmap/node-tuning-operator-lock references pod with uid bce757b6-f3fd-4d0f-b206-0d08efbc74f5 which exists. To debug this issue, the first step will be a must-gather that shows the issue happening so we can investigate the set of resources present and compare against various logs.
Hi, I have downloaded the attached must-gather. I checked KCM’s logs and it looks like that the configmap wasn’t deleted because the GC wasn’t able to construct its dependency graph. 2021-01-15T15:57:54.482245366Z I0115 15:57:54.482215 1 event.go:278] Event(v1.ObjectReference{Kind:"CronJob", Namespace:"dev-parcelcp", Name:"parcel-integration-cronjob-tracking-usps", UID:"b8107a0d-4cde-478e-9f23-89d8423dc64f", APIVersion:"batch/v1beta1", ResourceVersion:"394122549", FieldPath:""}): type: 'Warning' reason: 'FailedNeedsStart' Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew 2021-01-15T15:58:03.424505482Z E0115 15:58:03.424444 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: conversion webhook for tekton.dev/v1alpha1, Kind=ClusterTask failed: Post https://tekton-pipelines-webhook.openshift-pipelines.svc:443/?timeout=30s: x509: certificate signed by unknown authority 2021-01-15T15:58:04.151463253Z I0115 15:58:04.151341 1 shared_informer.go:249] stop requested 2021-01-15T15:58:04.151609141Z E0115 15:58:04.151576 1 shared_informer.go:226] unable to sync caches for garbage collector 2021-01-15T15:58:04.151676792Z E0115 15:58:04.151661 1 garbagecollector.go:228] timed out waiting for dependency graph builder sync during GC sync (attempt 19802) suggest a certificate issue (time skew issue) - the controller wasn’t able to connect the Kube API. I will assign the issue to the appropriate person for further investigation.
Based on 2021-01-15T15:58:03.424505482Z E0115 15:58:03.424444 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: conversion webhook for tekton.dev/v1alpha1, Kind=ClusterTask failed: Post https://tekton-pipelines-webhook.openshift-pipelines.svc:443/?timeout=30s: x509: certificate signed by unknown authority The problem is not with GC which can't proceed without processing entire dependency tree, but with tekton conversion webhook which apparently has a problem with certificate. I'm moving this to build team to deal with.
build team does not handle tekton bugs Reached out to that team, and am fixing product / component accordingly.
Dear colleagues, I'm Solution Sales in Austria, responsible for iLogistics. I'm following the case since the beginning (1st Dec 2020) and the customer still is facing a situation where his operation is degraded as the upgrade hasn't run through properly. We are now getting heavy push-back from senior customer management who have asked us for clarification and a clear road-map how we are going to address this unpleasant situation. I understand that it is difficult to grant any resolution dates, but I would kindly ask you to: - engage directly with the customer in order to have delays through a ping-pong (e.g. a regular video conf until the problem is resolved)? - let us know WHO is working on the resolution - share your plans about a potential Plan B if the bug can't be resolved Thanks & kind regards, Stephan
Hi team, I tried the above steps but one crd 'clustertasks.tekton.dev' was not deleting so we removed finalizers as per https://access.redhat.com/solutions/4165791 after which crd got deleted, pipeline operator uninstalled and cluster is upgrading further. Will keep you posted if come across any issues.
*** Bug 1958885 has been marked as a duplicate of this bug. ***