1879524 – Node goes into unschedulable after a MC is applied and node rebooted.

Bug 1879524 - Node goes into unschedulable after a MC is applied and node rebooted.

Summary: Node goes into unschedulable after a MC is applied and node rebooted.

Keywords:
Status:	CLOSED DUPLICATE of bug 1874696
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-16 13:08 UTC by Jiří Mencák
Modified:	2020-09-17 13:12 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-17 13:12:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jiří Mencák 2020-09-16 13:08:24 UTC

Description of problem:
This is a rare but a very real bug I noticed during testing a new functionality that creates MachineConfigs. Node goes into unschedulable after applying a MC, MCP never updates and MCD reports issues "failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout".

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2020-09-16-000734 True False 6h37m Cluster version is 4.6.0-0.nightly-2020-09-16-000734

How reproducible:
Rare. In the past, I was noticing these issues, but this is the first time I caught it in time prior to doing more changes on the cluster.

Steps to Reproduce:
1. Create a machineconfig using a custom MCP causing a node reboot.

Actual results:
On a rare occassion (one in ~20?) times a node will go into unschedulable and the custom MCP never updates. MCD reporting:
failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout

Expected results:
Node in schedulable, MCP updated.

Additional info:
Other pods on the affected node have the same issue as the mcd pod. The must-gather below was taken "after" I've set the node into schedulable and deleted the mcp pod.

http://file.rdu.redhat.com/jmencak/bugzilla/2020-09-19/must-gather.local.7417114535902373082.tar.xz

Comment 4 Ben Bennett 2020-09-17 13:12:01 UTC


*** This bug has been marked as a duplicate of bug 1874696 ***

Note You need to log in before you can comment on or make changes to this bug.