Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1920867

Summary: OCP 4.7 on Z fails to install for KVM when specifying networkType OVNKubernetes in the install-config.yaml file
Product: OpenShift Container Platform Reporter: krmoser
Component: Multi-ArchAssignee: Dennis Gilmore <dgilmore>
Status: CLOSED NOTABUG QA Contact: Barry Donahue <bdonahue>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: amccrae, cbaus, chanphil, christian.lapolt, Holger.Wolf, krmoser, psundara, tdale, wvoesch
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-29 13:06:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1903544    
Attachments:
Description Flags
OCP 4.7 install with networkType "OpenShiftSDN"
none
OCP 4.7 install with networkType "OVNKubernetes" none

Description krmoser 2021-01-27 07:16:26 UTC
Description of problem:
1. OCP 4.7 on Z fails to install for KVM when specifying networkType "OVNKubernetes" in the install-config.yaml file.

2. OCP 4.7 on Z successfully installs for zVM when specifying networkType "OVNKubernetes" in the install-config.yaml file.

3. The exact same OCP 4.7 on Z builds successfully install for KVM when specifying (the default) networkType "OpenShiftSDN" value in the install-config.yaml file.


Version-Release number of selected component (if applicable):
4.7.0-0.nightly-s390x-2021-01-24-004935 and 3-4 additional recent OCP 4.7 builds tested.

How reproducible:
Consistently reproducible

Steps to Reproduce:
1. Update the networkType value to OVNKubernetes in the install-config.yaml file.
2. Proceed with OCP 4.7 on Z KVM cluster installation.


Actual results:
1. The OCP 4.7 on Z build fails to complete the installation, including after 10+ hours tested/observed.

2. The authentication, console, and kube-apiserver cluster operators do not achieve AVAILABLE status.

3. The kube-controller-manager cluster operator is in a degraded state.


4. Actual OCP 4.7 on Z cluster install information after 10+ hours:

#  ssh 192.168.79.1 oc get nodes
NAME                                           STATUS   ROLES    AGE   VERSION
master-0.pok-243.ocptest.pok.stglabs.ibm.com   Ready    master   10h   v1.20.0+70dd98e
master-1.pok-243.ocptest.pok.stglabs.ibm.com   Ready    master   10h   v1.20.0+70dd98e
master-2.pok-243.ocptest.pok.stglabs.ibm.com   Ready    master   10h   v1.20.0+70dd98e
worker-0.pok-243.ocptest.pok.stglabs.ibm.com   Ready    worker   10h   v1.20.0+70dd98e
worker-1.pok-243.ocptest.pok.stglabs.ibm.com   Ready    worker   10h   v1.20.0+70dd98e
[root@t90ocp3 ~]#  ssh 192.168.79.1 oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          10h     Unable to apply 4.7.0-0.nightly-s390x-2021-01-24-004935: an unknown error has occurred: MultipleErrors
#  ssh 192.168.79.1 oc get co
NAME                                       VERSION                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-s390x-2021-01-24-004935   False       False         True       10h
baremetal                                  4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
cloud-credential                           4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
cluster-autoscaler                         4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
config-operator                            4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
console                                    4.7.0-0.nightly-s390x-2021-01-24-004935   False       True          True       10h
csi-snapshot-controller                    4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
dns                                        4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
etcd                                       4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
image-registry                             4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
ingress                                    4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
insights                                   4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
kube-apiserver                             4.7.0-0.nightly-s390x-2021-01-24-004935   False       True          True       10h
kube-controller-manager                    4.7.0-0.nightly-s390x-2021-01-24-004935   True        True          True       10h
kube-scheduler                             4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
kube-storage-version-migrator              4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
machine-api                                4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
machine-approver                           4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
machine-config                             4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
marketplace                                4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
monitoring                                 4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
network                                    4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
node-tuning                                4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
openshift-apiserver                        4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
openshift-controller-manager               4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
openshift-samples                          4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
operator-lifecycle-manager                 4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
operator-lifecycle-manager-catalog         4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
service-ca                                 4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
storage                                    4.7.0-0.nightly-s390x-2021-01-24-004935   True        False         False      10h
#

5. [root@bastion ocp4-workdir]# cat install-config.copy.yaml
apiVersion: v1
baseDomain: "ocptest.pok.stglabs.ibm.com"
compute:
- hyperthreading: Enabled
  name: worker
  replicas: 0
controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3
metadata:
  name: "pok-243"
networking:
  clusterNetworks:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
pullSecret: <not included>


6. 
# virsh list
 Id   Name          State
-----------------------------
 2    bastion       running
 3    bootstrap-0   running
 4    master-0      running
 5    master-1      running
 6    master-2      running
 7    infnod-0      running
 8    infnod-1      running






Expected results:
1.  OCP 4.7 on Z should successfully install for KVM when specifying networkType "OVNKubernetes" in the install-config.yaml file.


Additional info:



Thank you.

Comment 1 Carvel Baus 2021-01-27 13:01:35 UTC
Can you provide the oc adm must-gather output?

Comment 2 krmoser 2021-01-28 08:04:17 UTC
Carvel,

Please find attached 2 oc adm must-gather tar.gz files:

1. 4.7.0-0.nightly-s390x-2021-01-27-164108.must-gather.local.2636577120531980503.tar.gz for OCP build 4.7.0-0.nightly-s390x-2021-01-27-164108 with networkType "OpenShiftSDN" value.  This installation completes successfully.

2. 4.7.0-0.nightly-s390x-2021-01-27-164108.must-gather.local.5719890522243309048.OVNKubenetes.tar.gz for OCP build 4.7.0-0.nightly-s390x-2021-01-27-164108 with networkType "OVNKubernetes" value.  This installation fails to install successfully with the issue described in my initial post.


Thank you,
Kyle

Comment 3 krmoser 2021-01-28 08:14:34 UTC
Created attachment 1751605 [details]
OCP 4.7 install with networkType "OpenShiftSDN"

oc adm must gather for networkType "OpenShiftSDN": installs successfully

Comment 4 krmoser 2021-01-28 08:41:54 UTC
Created attachment 1751611 [details]
OCP 4.7 install with networkType "OVNKubernetes"

oc adm must gather for networkType "OVNKubernetes": fails to complete installation

Comment 5 Prashanth Sundararaman 2021-01-28 22:07:00 UTC
Summarizing the discussion with Kyle today:

This problem seems to be because the master nodes were allocated only 8G of memory. We were able to reproduce the issue and the problem did not exist when the memory was bumped to 16G. 16G is the recommended default memory setting for masters for OCP. The symptoms pointing us to the fact that it was due to insufficient memory was logs like these:

Jan 28 15:10:34 root-ctlplane-2 hyperkube[1791]: W0128 15:10:34.934196    1791 predicate.go:113] Failed to admit pod kube-apiserver-root-ctlplane-2_openshift-kube-apiserver(dd3c4e54-b16e-4625-911d-a6fcb30e888d) - Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: memory, q: 124293632),

which indicate that kubernetes tries to remove pods with lower priority in order to accommodate higher priority pods and in this case it did not find any pods which it could kick out and thus the apiserver never started on some masters.


Kyle,

Do you agree that this bug can be closed? 

Thanks
Prashanth

Comment 6 krmoser 2021-01-29 12:06:50 UTC
Prashanth,

Thank you again to Andy and you for your assistance.  Yes, please close this bug.

FYI.  Using the appropriate master node memory size (16GB), we've successfully tested the 10 following OCP 4.7 on Z builds using the networkType: OVNKubernetes install option.
 1. 4.7.0-0.nightly-s390x-2021-01-27-164108
 2. 4.7.0-0.nightly-s390x-2021-01-28-005008
 3. 4.7.0-0.nightly-s390x-2021-01-28-023813           
 4. 4.7.0-0.nightly-s390x-2021-01-28-052030                 
 5. 4.7.0-0.nightly-s390x-2021-01-28-064716            
 6. 4.7.0-0.nightly-s390x-2021-01-28-084706            
 7. 4.7.0-0.nightly-s390x-2021-01-28-113553            
 8. 4.7.0-0.nightly-s390x-2021-01-28-140116            
 9. 4.7.0-0.nightly-s390x-2021-01-28-192809  
10. 4.7.0-0.nightly-s390x-2021-01-28-220317

Thank you,
Kyle