Bug 2002495

Summary: Deploying OCP on RHOS with assisted-installer Failing validation blocks deployment
Product: OpenShift Container Platform Reporter: Alexander Chuzhoy <sasha>
Component: assisted-installerAssignee: Eran Cohen <ercohen>
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: urgent    
Priority: medium CC: ercohen, jkilzi, jtomasek, mfilanov, tjelinek, venkatasubramanian.b
Version: 4.9Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-28 08:45:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2021-09-09 02:08:59 UTC
Version:
Release tag
stable
Assisted Installer UI version
quay.io/ocpmetal/ocp-metal-ui:5f73c3c37938163c99d5559a27accd027eba3e40
Assisted Installer UI library version
1.5.35
Assisted Installer
quay.io/ocpmetal/assisted-installer:3673218609bec42b6cf64e2d81152e2cb25ced91
Assisted Installer Controller
quay.io/ocpmetal/assisted-installer-controller:3673218609bec42b6cf64e2d81152e2cb25ced91
assistedInstallerService
quay.io/ocpmetal/assisted-service:ae1fe9b323a2ba70e32cde08bde87aa93d707897
Discovery Agent
quay.io/ocpmetal/assisted-installer-agent:60ac74ef05e45fd612222f3bd17f0b148d346d98

OCP: 4.9.0-rc.0


Trying to deploy OCP on RHOS.

The discovered instances have "insufficient" status.
Platform: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.

The network configuration step actually comes after this step and requires the nodes to not be in "insufficient" state.

Comment 2 Michael Filanov 2021-09-09 05:55:53 UTC
cc: @jtomasek @tjelinek can you please take a look? 
We enabled to install with Openstack if users are defining none platform but looks like it collide with the UI wizard steps

Comment 3 Jonathan Kilzi 2021-09-09 10:12:25 UTC
@mfilanov, if the `valid-platform` host validation is in neither one of these states: `disabled`, `success` nor it has been explicitly marked in the UI as a `softValidation` it will fail the validation check and prevent the user from moving to the next wizard step.

@sasha do you have an environment where we can reproduce this? I would like to see what we receive from the BE during the polling to /v1/clusters/:cluster_id

Comment 4 Eran Cohen 2021-09-09 13:05:30 UTC
This is the cluster validation info:

{'configuration': [{'id': 'pull-secret-set',
                    'message': 'The pull secret is set.',
                    'status': 'success'}],
 'hosts-data': [{'id': 'all-hosts-are-ready-to-install',
                 'message': 'The cluster has hosts that are not ready to '
                            'install.',
                 'status': 'failure'},
                {'id': 'sufficient-masters-count',
                 'message': 'The cluster has a sufficient number of master '
                            'candidates.',
                 'status': 'success'}],
 'network': [{'id': 'api-vip-defined',
              'message': 'The API virtual IP is undefined; IP allocation from '
                         'the DHCP server timed out.',
              'status': 'failure'},
             {'id': 'api-vip-valid',
              'message': 'The API virtual IP is undefined.',
              'status': 'pending'},
             {'id': 'cluster-cidr-defined',
              'message': 'The Cluster Network CIDR is defined.',
              'status': 'success'},
             {'id': 'dns-domain-defined',
              'message': 'The base domain is defined.',
              'status': 'success'},
             {'id': 'ingress-vip-defined',
              'message': 'The Ingress virtual IP is undefined; IP allocation '
                         'from the DHCP server timed out.',
              'status': 'failure'},
             {'id': 'ingress-vip-valid',
              'message': 'The Ingress virtual IP is undefined.',
              'status': 'pending'},
             {'id': 'machine-cidr-defined',
              'message': 'The Machine Network CIDR is defined.',
              'status': 'success'},
             {'id': 'machine-cidr-equals-to-calculated-cidr',
              'message': 'The Machine Network CIDR, API virtual IP, or Ingress '
                         'virtual IP is undefined.',
              'status': 'pending'},
             {'id': 'network-prefix-valid',
              'message': 'The Cluster Network prefix is valid.',
              'status': 'success'},
             {'id': 'network-type-valid',
              'message': 'The cluster has a valid network type',
              'status': 'success'},
             {'id': 'no-cidrs-overlapping',
              'message': 'No CIDRS are overlapping.',
              'status': 'success'},
             {'id': 'ntp-server-configured',
              'message': 'No ntp problems found',
              'status': 'success'},
             {'id': 'service-cidr-defined',
              'message': 'The Service Network CIDR is defined.',
              'status': 'success'}],
 'operators': [{'id': 'cnv-requirements-satisfied',
                'message': 'cnv is disabled',
                'status': 'success'},
               {'id': 'lso-requirements-satisfied',
                'message': 'lso is disabled',
                'status': 'success'},
               {'id': 'ocs-requirements-satisfied',
                'message': 'ocs is disabled',
                'status': 'success'}]}


And this is the host validation info:
{'hardware': [{'id': 'has-inventory',
               'message': 'Valid inventory exists for the host',
               'status': 'success'},
              {'id': 'has-min-cpu-cores',
               'message': 'Sufficient CPU cores',
               'status': 'success'},
              {'id': 'has-min-memory',
               'message': 'Sufficient minimum RAM',
               'status': 'success'},
              {'id': 'has-min-valid-disks',
               'message': 'Sufficient disk capacity',
               'status': 'success'},
              {'id': 'has-cpu-cores-for-role',
               'message': 'Sufficient CPU cores for role master',
               'status': 'success'},
              {'id': 'has-memory-for-role',
               'message': 'Sufficient RAM for role master',
               'status': 'success'},
              {'id': 'hostname-unique',
               'message': 'Hostname '
                          'ci-vm-10-0-97-34.hosted.upshift.rdu2.redhat.com is '
                          'unique in cluster',
               'status': 'success'},
              {'id': 'hostname-valid',
               'message': 'Hostname '
                          'ci-vm-10-0-97-34.hosted.upshift.rdu2.redhat.com is '
                          'allowed',
               'status': 'success'},
              {'id': 'valid-platform',
               'message': 'Platform OpenStack Compute is allowed only for '
                          'Single Node OpenShift or user-managed networking',
               'status': 'failure'},
              {'id': 'sufficient-installation-disk-speed',
               'message': 'Speed of installation disk has not yet been '
                          'measured',
               'status': 'success'},
              {'id': 'compatible-with-cluster-platform',
               'message': 'Host is compatible with cluster platform baremetal',
               'status': 'success'}],
 'network': [{'id': 'connected',
              'message': 'Host is connected',
              'status': 'success'},
             {'id': 'machine-cidr-defined',
              'message': 'Machine Network CIDR is defined',
              'status': 'success'},
             {'id': 'belongs-to-machine-cidr',
              'message': 'Host belongs to all machine network CIDRs',
              'status': 'success'},
             {'id': 'belongs-to-majority-group',
              'message': 'Host has connectivity to the majority of hosts in '
                         'the cluster',
              'status': 'success'},
             {'id': 'ntp-synced',
              'message': "Host couldn't synchronize with any NTP server",
              'status': 'failure'},
             {'id': 'container-images-available',
              'message': 'All required container images were either pulled '
                         'successfully or no attempt was made to pull them',
              'status': 'success'},
             {'id': 'sufficient-network-latency-requirement-for-role',
              'message': 'Network latency requirement has been satisfied.',
              'status': 'success'},
             {'id': 'sufficient-packet-loss-requirement-for-role',
              'message': 'Packet loss requirement has been satisfied.',
              'status': 'success'},
             {'id': 'has-default-route',
              'message': 'Host has been configured with at least one default '
                         'route.',
              'status': 'success'},
             {'id': 'api-domain-name-resolved-correctly',
              'message': 'Domain name resolution is not required (managed '
                         'networking)',
              'status': 'success'},
             {'id': 'api-int-domain-name-resolved-correctly',
              'message': 'Domain name resolution is not required (managed '
                         'networking)',
              'status': 'success'},
             {'id': 'apps-domain-name-resolved-correctly',
              'message': 'Domain name resolution is not required (managed '
                         'networking)',
              'status': 'success'},
             {'id': 'dns-wildcard-not-configured',
              'message': 'DNS wildcard check was successful',
              'status': 'success'}],
 'operators': [{'id': 'cnv-requirements-satisfied',
                'message': 'cnv is disabled',
                'status': 'success'},
               {'id': 'lso-requirements-satisfied',
                'message': 'lso is disabled',
                'status': 'success'},
               {'id': 'ocs-requirements-satisfied',
                'message': 'ocs is disabled',
                'status': 'success'}]}

@jkilzi if we move (technically we will split the logic and add a new validation) the platform validation from "hardware" to "network" will the user be able to get to the network part inorder to set the "user managed networking"?

Comment 5 Jonathan Kilzi 2021-09-09 14:06:09 UTC
(In reply to Eran Cohen from comment #4)
> This is the cluster validation info:
> 
> {'configuration': [{'id': 'pull-secret-set',
>                     'message': 'The pull secret is set.',
>                     'status': 'success'}],
>  'hosts-data': [{'id': 'all-hosts-are-ready-to-install',
>                  'message': 'The cluster has hosts that are not ready to '
>                             'install.',
>                  'status': 'failure'},
>                 {'id': 'sufficient-masters-count',
>                  'message': 'The cluster has a sufficient number of master '
>                             'candidates.',
>                  'status': 'success'}],
>  'network': [{'id': 'api-vip-defined',
>               'message': 'The API virtual IP is undefined; IP allocation
> from '
>                          'the DHCP server timed out.',
>               'status': 'failure'},
>              {'id': 'api-vip-valid',
>               'message': 'The API virtual IP is undefined.',
>               'status': 'pending'},
>              {'id': 'cluster-cidr-defined',
>               'message': 'The Cluster Network CIDR is defined.',
>               'status': 'success'},
>              {'id': 'dns-domain-defined',
>               'message': 'The base domain is defined.',
>               'status': 'success'},
>              {'id': 'ingress-vip-defined',
>               'message': 'The Ingress virtual IP is undefined; IP allocation
> '
>                          'from the DHCP server timed out.',
>               'status': 'failure'},
>              {'id': 'ingress-vip-valid',
>               'message': 'The Ingress virtual IP is undefined.',
>               'status': 'pending'},
>              {'id': 'machine-cidr-defined',
>               'message': 'The Machine Network CIDR is defined.',
>               'status': 'success'},
>              {'id': 'machine-cidr-equals-to-calculated-cidr',
>               'message': 'The Machine Network CIDR, API virtual IP, or
> Ingress '
>                          'virtual IP is undefined.',
>               'status': 'pending'},
>              {'id': 'network-prefix-valid',
>               'message': 'The Cluster Network prefix is valid.',
>               'status': 'success'},
>              {'id': 'network-type-valid',
>               'message': 'The cluster has a valid network type',
>               'status': 'success'},
>              {'id': 'no-cidrs-overlapping',
>               'message': 'No CIDRS are overlapping.',
>               'status': 'success'},
>              {'id': 'ntp-server-configured',
>               'message': 'No ntp problems found',
>               'status': 'success'},
>              {'id': 'service-cidr-defined',
>               'message': 'The Service Network CIDR is defined.',
>               'status': 'success'}],
>  'operators': [{'id': 'cnv-requirements-satisfied',
>                 'message': 'cnv is disabled',
>                 'status': 'success'},
>                {'id': 'lso-requirements-satisfied',
>                 'message': 'lso is disabled',
>                 'status': 'success'},
>                {'id': 'ocs-requirements-satisfied',
>                 'message': 'ocs is disabled',
>                 'status': 'success'}]}
> 
> 
> And this is the host validation info:
> {'hardware': [{'id': 'has-inventory',
>                'message': 'Valid inventory exists for the host',
>                'status': 'success'},
>               {'id': 'has-min-cpu-cores',
>                'message': 'Sufficient CPU cores',
>                'status': 'success'},
>               {'id': 'has-min-memory',
>                'message': 'Sufficient minimum RAM',
>                'status': 'success'},
>               {'id': 'has-min-valid-disks',
>                'message': 'Sufficient disk capacity',
>                'status': 'success'},
>               {'id': 'has-cpu-cores-for-role',
>                'message': 'Sufficient CPU cores for role master',
>                'status': 'success'},
>               {'id': 'has-memory-for-role',
>                'message': 'Sufficient RAM for role master',
>                'status': 'success'},
>               {'id': 'hostname-unique',
>                'message': 'Hostname '
>                           'ci-vm-10-0-97-34.hosted.upshift.rdu2.redhat.com
> is '
>                           'unique in cluster',
>                'status': 'success'},
>               {'id': 'hostname-valid',
>                'message': 'Hostname '
>                           'ci-vm-10-0-97-34.hosted.upshift.rdu2.redhat.com
> is '
>                           'allowed',
>                'status': 'success'},
>               {'id': 'valid-platform',
>                'message': 'Platform OpenStack Compute is allowed only for '
>                           'Single Node OpenShift or user-managed networking',
>                'status': 'failure'},
>               {'id': 'sufficient-installation-disk-speed',
>                'message': 'Speed of installation disk has not yet been '
>                           'measured',
>                'status': 'success'},
>               {'id': 'compatible-with-cluster-platform',
>                'message': 'Host is compatible with cluster platform
> baremetal',
>                'status': 'success'}],
>  'network': [{'id': 'connected',
>               'message': 'Host is connected',
>               'status': 'success'},
>              {'id': 'machine-cidr-defined',
>               'message': 'Machine Network CIDR is defined',
>               'status': 'success'},
>              {'id': 'belongs-to-machine-cidr',
>               'message': 'Host belongs to all machine network CIDRs',
>               'status': 'success'},
>              {'id': 'belongs-to-majority-group',
>               'message': 'Host has connectivity to the majority of hosts in '
>                          'the cluster',
>               'status': 'success'},
>              {'id': 'ntp-synced',
>               'message': "Host couldn't synchronize with any NTP server",
>               'status': 'failure'},
>              {'id': 'container-images-available',
>               'message': 'All required container images were either pulled '
>                          'successfully or no attempt was made to pull them',
>               'status': 'success'},
>              {'id': 'sufficient-network-latency-requirement-for-role',
>               'message': 'Network latency requirement has been satisfied.',
>               'status': 'success'},
>              {'id': 'sufficient-packet-loss-requirement-for-role',
>               'message': 'Packet loss requirement has been satisfied.',
>               'status': 'success'},
>              {'id': 'has-default-route',
>               'message': 'Host has been configured with at least one default
> '
>                          'route.',
>               'status': 'success'},
>              {'id': 'api-domain-name-resolved-correctly',
>               'message': 'Domain name resolution is not required (managed '
>                          'networking)',
>               'status': 'success'},
>              {'id': 'api-int-domain-name-resolved-correctly',
>               'message': 'Domain name resolution is not required (managed '
>                          'networking)',
>               'status': 'success'},
>              {'id': 'apps-domain-name-resolved-correctly',
>               'message': 'Domain name resolution is not required (managed '
>                          'networking)',
>               'status': 'success'},
>              {'id': 'dns-wildcard-not-configured',
>               'message': 'DNS wildcard check was successful',
>               'status': 'success'}],
>  'operators': [{'id': 'cnv-requirements-satisfied',
>                 'message': 'cnv is disabled',
>                 'status': 'success'},
>                {'id': 'lso-requirements-satisfied',
>                 'message': 'lso is disabled',
>                 'status': 'success'},
>                {'id': 'ocs-requirements-satisfied',
>                 'message': 'ocs is disabled',
>                 'status': 'success'}]}
> 
> @jkilzi if we move (technically we will split the logic and add a
> new validation) the platform validation from "hardware" to "network" will
> the user be able to get to the network part inorder to set the "user managed
> networking"?

Yes. As I mentioned before, at the hosts-discovery step we only pay attention to the 'hardware' group. The 'network' group is evaluated at the networking step.

Comment 10 Alexander Chuzhoy 2021-09-30 17:11:48 UTC
FiledQA

Tried to verify with Assisted-ui-lib version:  1.5.36-2

The issue still persists.

Comment 11 Eran Cohen 2021-10-06 12:52:00 UTC
Unsure the assisted-ui-lib have something to do with it.
It failed QA on staging?
What is the assisted-service version you tested?

Comment 12 Alexander Chuzhoy 2021-10-07 16:38:44 UTC
Re-tested on staging and it works as expected now.
Assisted-ui-lib version:  1.5.37

Comment 14 Venkat B 2022-05-10 15:53:44 UTC
Hi,

In our 3 (control) + 2 (worker) AI based cluster deployment, hosts are not in Ready state to proceed with installation. Status of host is Insufficient even after NTP sync is successful

Output from the one of the cluster nodes
==============================================================================
$ chronyc sources
210 Number of sources = 1
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* 192.168.10.4                   3  10   377   947   -292us[ -295us] +/-   92ms

$ timedatectl
               Local time: Tue 2022-05-10 15:51:51 UTC
           Universal time: Tue 2022-05-10 15:51:51 UTC
                 RTC time: Tue 2022-05-10 15:51:51
                Time zone: UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no


Cluster events :
============================================================================================

5/10/2022, 7:21:04 PM	Updated status of the cluster to insufficient
5/10/2022, 7:21:04 PM	Cluster validation 'api-vip-defined' is now fixed
5/10/2022, 7:07:40 PM	Cluster validation 'ntp-server-configured' is now fixed
5/10/2022, 7:07:38 PM	Host sl12345.net: validation 'ntp-synced' is now fixed
5/10/2022, 7:07:14 PM	Host sl12346.net: validation 'ntp-synced' is now fixed
5/10/2022, 7:06:58 PM	Host sl12347.net: validation 'ntp-synced' is now fixed
5/10/2022, 7:06:34 PM	Host sl12348.net: validation 'ntp-synced' is now fixed
5/10/2022, 7:05:38 PM	Cluster validation 'sufficient-masters-count' is now fixed
5/10/2022, 7:05:38 PM	
warning
 Host sl12345.net: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server ; No connectivity to the majority of hosts in the cluster)
5/10/2022, 7:05:38 PM	Host sl12349.net: validation 'ntp-synced' is now fixed
5/10/2022, 7:05:14 PM	
warning
 Host sl12346.net: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server ; No connectivity to the majority of hosts in the cluster)
5/10/2022, 7:04:58 PM	
warning
 Host sl12347.net: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server ; No connectivity to the majority of hosts in the cluster)
5/10/2022, 7:04:54 PM	Host 67bb1e80-ccb6-2902-bebd-c722391c6b27: Successfully registered
5/10/2022, 7:04:39 PM	
warning
 Cluster validation 'ntp-server-configured' that used to succeed is now failing
5/10/2022, 7:04:34 PM	
warning
 Host sl12348.net: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server)
5/10/2022, 7:04:30 PM	Host 90570f72-619d-2327-e948-7a1ac68387b6: Successfully registered
5/10/2022, 7:03:38 PM	
warning
 Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing
5/10/2022, 7:03:38 PM	
warning
 Host sl12349.net: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server)


Validation info from the cluster :
=================================================================================================================

  "configuration": [
    {
      "id": "pull-secret-set",
      "status": "success",
      "message": "The pull secret is set."
    }
  ],
  "hosts-data": [
    {
      "id": "all-hosts-are-ready-to-install",
      "status": "failure",
      "message": "The cluster has hosts that are not ready to install."
    },
    {
      "id": "sufficient-masters-count",
      "status": "success",
      "message": "The cluster has a sufficient number of master candidates."
    }
  ],
  "network": [
    {
      "id": "api-vip-defined",
      "status": "success",
      "message": "The API virtual IP is defined."
    },
    {
      "id": "api-vip-valid",
      "status": "success",
      "message": "api vip 192.168.10.40 belongs to the Machine CIDR and is not in use."
    },
    {
      "id": "cluster-cidr-defined",
      "status": "success",
      "message": "The Cluster Network CIDR is defined."
    },
    {
      "id": "dns-domain-defined",
      "status": "success",
      "message": "The base domain is defined."
    },
    {
      "id": "ingress-vip-defined",
      "status": "success",
      "message": "The Ingress virtual IP is defined."
    },
    {
      "id": "ingress-vip-valid",
      "status": "success",
      "message": "ingress vip 192.168.10.41 belongs to the Machine CIDR and is not in use."
    },
    {
      "id": "machine-cidr-defined",
      "status": "success",
      "message": "The Machine Network CIDR is defined."
    },
    {
      "id": "machine-cidr-equals-to-calculated-cidr",
      "status": "success",
      "message": "The Cluster Machine CIDR is equivalent to the calculated CIDR."
    },
    {
      "id": "network-prefix-valid",
      "status": "success",
      "message": "The Cluster Network prefix is valid."
    },
    {
      "id": "network-type-valid",
      "status": "success",
      "message": "The cluster has a valid network type"
    },
    {
      "id": "networks-same-address-families",
      "status": "success",
      "message": "Same address families for all networks."
    },
    {
      "id": "no-cidrs-overlapping",
      "status": "success",
      "message": "No CIDRS are overlapping."
    },
    {
      "id": "ntp-server-configured",
      "status": "success",
      "message": "No ntp problems found"
    },
    {
      "id": "service-cidr-defined",
      "status": "success",
      "message": "The Service Network CIDR is defined."
    }
  ],
  "operators": [
    {
      "id": "cnv-requirements-satisfied",
      "status": "success",
      "message": "cnv is disabled"
    },
    {
      "id": "lso-requirements-satisfied",
      "status": "success",
      "message": "lso is disabled"
    },
    {
      "id": "odf-requirements-satisfied",
      "status": "success",
      "message": "odf is disabled"
    }
  ]
}


Used following images :
==============================================================================================

quay.io/edge-infrastructure/postgresql-12-centos7:0.3.25
quay.io/edge-infrastructure/assisted-service:v2.3.1
quay.io/edge-infrastructure/assisted-installer-ui:v2.3.9
quay.io/edge-infrastructure/assisted-image-service:v2.3.1
quay.io/edge-infrastructure/assisted-installer-agent:v2.3.1
quay.io/edge-infrastructure/assisted-installer:v2.3.1
quay.io/edge-infrastructure/assisted-installer-controller:v2.3.1

With this unable to proceed for the installation, how to proceed further?