1845242 – daemonsets fail to rollout during upgrade

Bug 1845242 - daemonsets fail to rollout during upgrade

Summary: daemonsets fail to rollout during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Tomáš Nožička
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:	1843319
Blocks:	1851407
TreeView+	depends on / blocked

Reported:	2020-06-08 18:33 UTC by Tomáš Nožička
Modified:	2020-07-16 16:12 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1843319
Clones:	1851407 (view as bug list)
Environment:
Last Closed:	2020-07-16 16:12:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25209	0	None	closed	[release-4.5] Bug 1845242: Fix DS expectations on recreate	2020-10-21 01:39:00 UTC
Red Hat Product Errata	RHBA-2020:2909	0	None	None	None	2020-07-16 16:12:45 UTC

Description Tomáš Nožička 2020-06-08 18:33:29 UTC

+++ This bug was initially created as a clone of Bug #1843319 +++

Description of problem:

Upgrading from 4.4 to 4.5, monitoring failed to achieve the new level because its node-exporter daemonset failed to rollout:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/56


The clusteroperator reports:

"message": "Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 0, unavailable: 6)",


David(who asked this bug be opened) referenced:
https://github.com/kubernetes/kubernetes/pull/91008

as a possible cause and then also indicated there might be a GC issue related to the pods being reaped when the damonset was deleted.

Hopefully he can provide more details about what he saw that made him point this towards workloads vs the monitoring component itself.

--- Additional comment from Maciej Szulik on 2020-06-03 10:54:58 UTC ---

Yeah, David already opened bug 1843187, so I'll close this as a duplicate.

--- Additional comment from Tomáš Nožička on 2020-06-03 12:22:32 UTC ---

fyi, I've confirmed the pods are actually ready so this is likely the expectations bug we are tracking

--- Additional comment from David Eads on 2020-06-05 13:09:03 UTC ---

This is a different bug. The GC controller is not cleaning up six pods in openshift-monitoring that do not have valid owner references.

--- Additional comment from Ben Parees on 2020-06-05 13:17:24 UTC ---

reopening based on https://bugzilla.redhat.com/show_bug.cgi?id=1843319#c3

--- Additional comment from Ben Parees on 2020-06-05 14:59:54 UTC ---

This is causing upgrade failures in 4.5, what is the basis for deferring it?

(in general comments should always be added to a bug explaining a deferral, when the bug is deferred)

--- Additional comment from Maciej Szulik on 2020-06-08 08:23:44 UTC ---

(In reply to Ben Parees from comment #5)
> This is causing upgrade failures in 4.5, what is the basis for deferring it?
> 
> (in general comments should always be added to a bug explaining a deferral,
> when the bug is deferred)

After bug 1843187 is fixed Tomas will need to dig through the logs and identify what is causing the actual problem.
The fix has to land in 4.6 first and only then be back-ported (through clone of this BZ) to 4.5. It's not that
we are deferring thos bug, we're following the process, but it takes a bit of time to nail down the root cause and
find a fix.

--- Additional comment from Tomáš Nožička on 2020-06-08 09:32:49 UTC ---

I think the DS expectations didn't get clear on re-create case, working on a fix upstream.

--- Additional comment from Ben Parees on 2020-06-08 13:55:15 UTC ---

> The fix has to land in 4.6 first and only then be back-ported (through clone of this BZ) to 4.5. It's not that
we are deferring thos bug, we're following the process, but it takes a bit of time to nail down the root cause and
find a fix.

you can still open the 4.5 clone now so we have a complete view of our blocker list for 4.5.  Otherwise 4.5 risks going out the door w/o this being addressed (because no one except you and I are aware it affects 4.5, it doesn't show up on any 4.5 lists).

So by not opening the 4.5 BZ now, you are (implicitly) saying you're ok shipping as is/deferring this bug.

Comment 2 Tomáš Nožička 2020-06-18 09:15:48 UTC

This bug is actively worked on.

Comment 7 zhou ying 2020-07-14 07:44:56 UTC

Checked with the unit test code , no issue found:

[root@dhcp-140-138 origin]# git branch
  master
  release-3.9
* release-4.5
[root@dhcp-140-138 origin]# cd vendor/k8s.io/kubernetes/pkg/controller/daemon

[root@dhcp-140-138 daemon]# go test -v -run TestExpectationsOnRecreate
=== RUN   TestExpectationsOnRecreate
I0714 15:42:38.226397   11985 shared_informer.go:223] Waiting for caches to sync for test dsc
I0714 15:42:38.326668   11985 shared_informer.go:230] Caches are synced for test dsc 
--- PASS: TestExpectationsOnRecreate (0.41s)
PASS
ok  	k8s.io/kubernetes/pkg/controller/daemon	0.446s

Comment 9 errata-xmlrpc 2020-07-16 16:12:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2909

Note You need to log in before you can comment on or make changes to this bug.