Bug 1882490 - Azure installer misses hyphen in master NIC names
Summary: Azure installer misses hyphen in master NIC names
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Russell Teague
QA Contact: To Hung Sze
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-24 17:47 UTC by Jim Minter
Modified: 2021-10-18 17:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The network interface on master nodes on Azure are missing a hyphen in the interface name which is inconsistent with other platforms. Consequence: The inconsistency may cause issues in unexpected ways. Fix: Add a hyphen to the interface name to be consistent with other platforms. Result: Mater node interfaces are named the same regardless of the platform.
Clone Of:
Environment:
Last Closed: 2021-10-18 17:28:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 5082 0 None open Bug 1882490: data/azure/master: Add dash to nic name 2021-07-14 18:01:46 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:29:06 UTC

Description Jim Minter 2020-09-24 17:47:18 UTC
On Azure, the installer:

* names master Machine objects like <foo>-master-{0,1,2}. [1]
* creates NICs for the Master VMs named like <foo>-master{0,1,2}-nic. [2]
* adds those NICs into the appropriate load balancer(s) as a one-off activity.

If for whatever reason a master Machine object is removed and an identical one is recreated, the Azure cluster-api-provider uses a *different* naming format for the NIC on the new master VM.  Its format is <foo>-master-{0,1,2}-nic [3].  Note the additional hyphen.

This has a couple of consequences:

1. The new master will come up on a different IP address and fail to join etcd automatically.
2. The new master will silently not be a member of the load balancer group; cluster resilience is silently degraded.

I think that:
* the naming difference needs to be fixed for new clusters
* there is a strong argument for automation which detects and at the very least warns about clusters in this state
* I would like to see both of the above backported as far as 4.4

[1] https://github.com/openshift/installer/blob/master/pkg/asset/machines/azure/machines.go#L61
[2] https://github.com/openshift/installer/blob/master/data/data/azure/master/master.tf#L12
[3] https://github.com/openshift/cluster-api-provider-azure/blob/master/pkg/cloud/azure/defaults.go#L129

Comment 1 Abhinav Dahiya 2020-09-24 17:55:24 UTC
I don't really understand how it is important for installer to match the nic name to what the cluster-api names the nics it creates?

If a new machine is created by the machine-api, expecting to get the NIC to be part of the LB just because the names are same are previous is IMO incorrect. Someone should take explicit action to add the machine to LB by looking at the NICs of the latest machine and not depend on some _coincidence_ that the name matches the previous one.

There is no expectation that this is supposed to work as it is described in the description of the bug.

Also it looks like you are depending on machine deletion using machine-api to leave around the NIC since it seems like you are trying to re-use it. I think the machine-api is leaking the NIC here.

Comment 2 Abhinav Dahiya 2020-09-29 18:15:43 UTC
Brought it up with the team during one the standups and

1. The expected behaviour mentioned in the description is not correct, or not how it's supposed to work.
  since these LBs are not managed by the cluster, in DR scenarios the user is expected to update the backend with the new machines.
  so this is not be fixed by the installer team to match the names so that users can _coincidently_ depend on this behaviour.

2. We gave a thought to fixing this for new clusters to maybe allow this to work for ARO, backporting or changing existing clusters was rejected.
   But as in (1) we decided that changing new clusters has capacity to introduce dependencies that we do not intend to support. So even this was dropped.

3. If the user would the machine-api to add the machine to the LB because I think they have some fields, that could be an RFE

4. But I think the best long term solution would be make these API lbs managed by cluster itself using maybe SLBs are mentioned in https://github.com/openshift/enhancements/pull/459

(3) or (4) should be tracked in a JIRA RFE and not a bug.

Comment 3 Patrick Dillon 2021-07-12 14:51:09 UTC
Can we get some context of why this was reopened? Particularly a response to the first (or second in the case of ARO) point in https://bugzilla.redhat.com/show_bug.cgi?id=1882490#c2 would be helpful.

Comment 7 Russell Teague 2021-07-14 18:01:32 UTC
Opened a PR to address code consistency issue.

> $ git grep '${var.cluster_id}-master-${count.index}' | wc -l
21

> $ git grep '${var.cluster_id}-master${count.index}' | wc -l 
1

Comment 8 To Hung Sze 2021-07-14 18:41:26 UTC
4.8.0
tszeaz071421d-t9pvr-master-0
tszeaz071421d-t9pvr-master0-nic

Comment 9 Scott Dodson 2021-07-16 18:20:57 UTC
I'm resetting severity and priority to medium because an explanation for why this bug is difficult to fix was given in comment 2, then a solution was proposed to align things and that was rejected. I don't want this affecting SLO numbers and taking priority over other bugs which may more readily be addressed.

Comment 11 To Hung Sze 2021-08-19 15:17:40 UTC
openshift-install 4.9.0-0.nightly-2021-08-18-144658

gpei-shared-0819-08190939-master-0-nic

Comment 16 errata-xmlrpc 2021-10-18 17:28:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.