Bug 1894539

Summary:	[on-prem] Unable to deploy additional machinesets on separate subnets
Product:	OpenShift Container Platform	Reporter:	Martin André <m.andre>
Component:	Machine Config Operator	Assignee:	Martin André <m.andre>
Status:	CLOSED ERRATA	QA Contact:	weiwei jiang <wjiang>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.7	CC:	dgautam, dporter, ekasprzy, emarquez, javier.ordax, mkrejci, simore
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: For all on-premise platforms, when discovering the node IP, baremetal-runtimecfg wrongly assumed the nodes are always attached to a subnet that includes the VIP and looked for an address in this IP range. Consequence: Nodes fail to generate configuration files with baremetal-runtimcfg. Fix: Fallback to the IP address associated with default route. Result: New compute nodes can join the cluster when deployed on separate subnets.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:30:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1932967

Description Martin André 2020-11-04 13:21:27 UTC

Description of problem:

On on-prem platforms (BM, OpenStack, Ovirt, Vsphere) it is currently not possible to deploy additional compute nodes on separate subnets due to a few assumptions about the node being on a subnet that includes the VIP.

There are a few issues preventing from running additional workers on a separate subnet:
- baremetal-runtimecfg render complains when the node is not on a subnet on which the VIP is routable
- baremetal-runtimecfg node-ip accepts running without providing a VIP, but that means we need to update the templates in MCO and Kubelet doesn't allow changing the initial node role when registering it with the cluster. 
- we need to find a way to delete the keepalived manifest before kubelet starts



Steps to Reproduce:
1. Deploy OpenShift as usual with one compute machine pool
2. After installation completes, create a new subnet with a route to the initial node subnet.
3. Create a MachineSet with the additional nodes being on the new subnet

Actual results:

Baremetal-runtimecfg fails to generate the configuration file, and even if it did keepalived pod would mess around with the networking.

Expected results:

New compute nodes can join the cluster when deployed on separate subnets.

Additional info:

Comment 5 Emmanuel Kasper 2020-12-10 17:25:49 UTC

*** Bug 1905134 has been marked as a duplicate of this bug. ***

Comment 6 javier.ordax 2020-12-15 12:59:51 UTC

Hi, I'm facing this issue in 4.6 IPI on VMWare.

Is this tag as an error to solve?
If yes, as 4.6 is EUS will it be fixed in 4.6?

Thank you.

Comment 8 javier.ordax 2020-12-16 10:06:33 UTC

Other comment is that this is not just for two subnets, it could be on subnet for masters, one subnet for infra and several subnets for workers.

May be it should be configurable which nodes will be used for the VIPs.

Comment 9 Martin André 2020-12-16 10:48:13 UTC

Hi Javier, there are currently no plans to backport this changes to 4.6 and earlier versions unless there is a strong business case, as it would require a lot of testing.

There is also work going on to make this architecture more flexible: https://github.com/openshift/enhancements/pull/524.

Comment 10 javier.ordax 2020-12-16 11:58:36 UTC

Hi Martin, I'm seeing this as a bug in the current version, as 4.6 is still in full support I was assuming this would be fixed in 4.6, also taking into account 4.6 is EUS.

Being able to install in different VLANs is a feature of OpenShift, and it is not working with IPI.

I do not have a business case more than currently customers with this requirement of using different VLANs wants to use 4.6 to have a period of stability due to the EUS and this error is avoiding them to use IPI.

Comment 11 javier.ordax 2020-12-16 16:23:48 UTC

Hi,

I have created the following machineconfig, I overwrite the files related to keepalive pod and configuration and also disable the service "nodeip-configuration.service".

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: masters-chrony-configuration
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.1.0
    networkd: {}
    passwd: {}
    systemd:
      units:
      - contents: |
          [Unit]
          Description=Writes IP address configuration so that kubelet and crio services select a valid node IP
          # This only applies to VIP managing environments where the kubelet and crio IP
          # address picking logic is flawed and may end up selecting an address from a
          # different subnet or a deprecated address
          Wants=network-online.target
          After=network-online.target ignition-firstboot-complete.service
          Before=kubelet.service crio.service

          [Service]
          # Need oneshot to delay kubelet
          Type=oneshot
          # Would prefer to do Restart=on-failure instead of this bash retry loop, but
          # the version of systemd we have right now doesn't support it. It should be
          # available in systemd v244 and higher.
          ExecStart=/bin/bash -c " \
            until \
            /usr/bin/podman run --rm \
            --authfile /var/lib/kubelet/config.json \
            --net=host \
            --volume /etc/systemd/system:/etc/systemd/system:z \
            quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:44dffacfd1b61252df317adcdd1b549c06dfd9d98436adce60fd9a6ec72c7f97 \
            node-ip \
            set --retry-on-failure \
            192.168.1.247; \
            do \
            sleep 5; \
            done"
          ExecStart=/bin/systemctl daemon-reload

          [Install]
          WantedBy=multi-user.target
        enabled: false
        name: nodeip-configuration.service
    storage:
      files:
      - filesystem: root
        overwrite: true
        path: "/etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl"
        contents:
          source: data:,foo
        mode: 420
      - filesystem: root
        overwrite: true
        path: "/etc/kubernetes/manifests/keepalived.yaml"
        contents:
          source: data:,kind%3A%20Pod%0AapiVersion%3A%20v1%0Ametadata%3A%0A%20%20name%3A%20foo-keepalived%0A%20%20namespace%3A%20openshift-vsphere-infra%20%0A%20%20labels%3A%0A%20%20%20%20app%3A%20vsphere-infra-vrrp%0Aspec%3A%0A%20%20containers%3A%0A%20%20-%20name%3A%20foo-keepalived%20%20%20%20%0A%20%20%20%20image%3A%20docker.io%2Fbusybox%20%20%0A%20%20hostNetwork%3A%20true%0A%20%20tolerations%3A%0A%20%20-%20operator%3A%20Exists%0A%20%20priorityClassName%3A%20system-node-critical
        mode: 420
  osImageURL: ""


I overwrite the files because I didn't find who to remove a file using an ignition file.

With this machineconfig the worker nodes starts fine, there is no VIP for workers but that is not a problem because to be productive a Load Balancer is needed, the VIP for ingress is just "temporal" for installation.

What do you think, could this be a valid/supported workaround for OCP 4.6?

Thank you.

Comment 12 javier.ordax 2020-12-16 16:32:54 UTC

Appart of some "typo", I didn't comment that I tested it creating a cluster with 0 workers and master scheduling disabled, after the masters start I deploy the machineconfig and create machinesets using a network different of the masters network.

It could be possible also to add it to the <install folder>/openshift folder to deploy it with the cluster creation.

Comment 13 javier.ordax 2020-12-17 06:30:47 UTC

In case Ingress VIP is mandatory it would be possible to do not apply this machineconfig to a pair of worker nodes deployed in the same VLAN as the masters and apply it only to the workers deployed in different VLANs.

Comment 14 weiwei jiang 2020-12-18 03:29:48 UTC

Checked with 4.7.0-0.nightly-2020-12-17-201522, and it works well now.

$ oc get machineset -A 
NAMESPACE               NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   wj47ios1218a-rnntt-addit-0    1         1         1       1           8m9s
openshift-machine-api   wj47ios1218a-rnntt-worker-0   3         3         3       3           96m

$ oc get nodes -o wide
NAME                                STATUS   ROLES    AGE   VERSION           INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
wj47ios1218a-rnntt-addit-0-drz5s    Ready    worker   59s   v1.20.0+87544c5   192.168.66.124   <none>        Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
wj47ios1218a-rnntt-master-0         Ready    master   94m   v1.20.0+87544c5   192.168.0.79     <none>        Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
wj47ios1218a-rnntt-master-1         Ready    master   94m   v1.20.0+87544c5   192.168.3.90     <none>        Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
wj47ios1218a-rnntt-master-2         Ready    master   94m   v1.20.0+87544c5   192.168.3.194    <none>        Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
wj47ios1218a-rnntt-worker-0-ftt9z   Ready    worker   80m   v1.20.0+87544c5   192.168.0.250    <none>        Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
wj47ios1218a-rnntt-worker-0-gtsns   Ready    worker   80m   v1.20.0+87544c5   192.168.0.108    <none>        Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
wj47ios1218a-rnntt-worker-0-zxcbd   Ready    worker   80m   v1.20.0+87544c5   192.168.1.103    <none>        Red Hat Enterprise Linux CoreOS 47.83.202012171642-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39

Comment 18 errata-xmlrpc 2021-02-24 15:30:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633