Bug 1879524

Summary:	Node goes into unschedulable after a MC is applied and node rebooted.
Product:	OpenShift Container Platform	Reporter:	Jiří Mencák <jmencak>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, jokerman
Version:	4.6
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-17 13:12:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jiří Mencák 2020-09-16 13:08:24 UTC

Description of problem:
This is a rare but a very real bug I noticed during testing a new functionality that creates MachineConfigs. Node goes into unschedulable after applying a MC, MCP never updates and MCD reports issues "failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout".

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2020-09-16-000734 True False 6h37m Cluster version is 4.6.0-0.nightly-2020-09-16-000734

How reproducible:
Rare. In the past, I was noticing these issues, but this is the first time I caught it in time prior to doing more changes on the cluster.

Steps to Reproduce:
1. Create a machineconfig using a custom MCP causing a node reboot.

Actual results:
On a rare occassion (one in ~20?) times a node will go into unschedulable and the custom MCP never updates. MCD reporting:
failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout

Expected results:
Node in schedulable, MCP updated.

Additional info:
Other pods on the affected node have the same issue as the mcd pod. The must-gather below was taken "after" I've set the node into schedulable and deleted the mcp pod.

http://file.rdu.redhat.com/jmencak/bugzilla/2020-09-19/must-gather.local.7417114535902373082.tar.xz

Comment 4 Ben Bennett 2020-09-17 13:12:01 UTC


*** This bug has been marked as a duplicate of bug 1874696 ***