Bug 1879524 - Node goes into unschedulable after a MC is applied and node rebooted.
Summary: Node goes into unschedulable after a MC is applied and node rebooted.
Keywords:
Status: CLOSED DUPLICATE of bug 1874696
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-16 13:08 UTC by Jiří Mencák
Modified: 2020-09-17 13:12 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-17 13:12:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jiří Mencák 2020-09-16 13:08:24 UTC
Description of problem:
This is a rare but a very real bug I noticed during testing a new functionality that creates MachineConfigs.  Node goes into unschedulable after applying a MC, MCP never updates and MCD reports issues "failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout".   

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-16-000734   True        False         6h37m   Cluster version is 4.6.0-0.nightly-2020-09-16-000734


How reproducible:
Rare.  In the past, I was noticing these issues, but this is the first time I caught it in time prior to doing more changes on the cluster. 

Steps to Reproduce:
1. Create a machineconfig using a custom MCP causing a node reboot.

Actual results:
On a rare occassion (one in ~20?) times a node will go into unschedulable and the custom MCP never updates.  MCD reporting:
failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout

Expected results:
Node in schedulable, MCP updated.

Additional info:
Other pods on the affected node have the same issue as the mcd pod.  The must-gather below was taken "after" I've set the node into schedulable and deleted the mcp pod.

http://file.rdu.redhat.com/jmencak/bugzilla/2020-09-19/must-gather.local.7417114535902373082.tar.xz

Comment 4 Ben Bennett 2020-09-17 13:12:01 UTC

*** This bug has been marked as a duplicate of bug 1874696 ***


Note You need to log in before you can comment on or make changes to this bug.