Bug 1856418 - ppc64le API installer pods oom-killed on deployment
Summary: ppc64le API installer pods oom-killed on deployment
Keywords:
Status: CLOSED DUPLICATE of bug 1812126
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Multi-Arch
Version: 4.4
Hardware: ppc64le
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Dennis Gilmore
QA Contact: Barry Donahue
URL:
Whiteboard:
Depends On: 1809593
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-13 15:07 UTC by Rob Gregory
Modified: 2023-12-15 18:26 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1809593
Environment:
Last Closed: 2020-07-31 19:08:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Rob Gregory 2020-07-13 15:07:33 UTC
+++ This bug was initially created as a clone of Bug #1809593 +++

Potentially a revision of BZ#1809593

Description of problem:
Many pods oom-killed, pending, or error state on deployment 


Version-Release number of selected component (if applicable):
OCP 4.3
OCP 4.4.9
cri-o-1.17.4-17.dev.rhaos4.4.gitf0cfdfc.el8.ppc64le

How reproducible:
Every deployment attempt

Steps to Reproduce:
1. Deploy following https://docs.openshift.com/container-platform/4.4/installing/installing_ibm_power/installing-ibm-power.html#machine-requirements_installing-ibm-power
2. run: oc get pods -A | grep -v Running | grep -v Completed

Actual results:
API installer pods are oom-killed; many other pods stuck in pending or error state

Expected results:
Successful deployment 

Additional info:

oc get po -A | grep -v Running | grep -v Completed
===================================================
NAMESPACE                                               NAME                                                              READY   STATUS      RESTARTS   AGE
openshift-apiserver                                     apiserver-577c8bffdb-8vj8n                                        0/1     Pending     0          10m
openshift-apiserver                                     apiserver-577c8bffdb-b7b5n                                        0/1     Pending     0          10m
openshift-apiserver                                     apiserver-577c8bffdb-k29l8                                        0/1     Pending     0          10m
openshift-cluster-node-tuning-operator                  tuned-8hfdx                                                       0/1     Pending     0          2m17s
openshift-cluster-node-tuning-operator                  tuned-qfs94                                                       0/1     Pending     0          2m14s
openshift-cluster-storage-operator                      csi-snapshot-controller-operator-7d76c9d7cf-dgzvs                 0/1     Pending     0          20m
openshift-etcd                                          installer-2-master3                                               0/1     Error       0          9m24s
openshift-kube-apiserver                                installer-2-master3                                               0/1     Error       0          9m53s
openshift-kube-apiserver                                installer-3-master3                                               0/1     OOMKilled   0          9m17s
openshift-kube-apiserver                                revision-pruner-2-master2                                         0/1     OOMKilled   0          9m22s
openshift-kube-apiserver                                revision-pruner-2-master3                                         0/1     Error       0          9m18s
openshift-kube-controller-manager                       installer-4-master3                                               0/1     Error       0          9m54s
openshift-kube-controller-manager                       installer-5-master2                                               0/1     OOMKilled   0          7m11s
openshift-kube-controller-manager                       revision-pruner-4-master1                                         0/1     OOMKilled   0          9m34s
openshift-kube-controller-manager                       revision-pruner-4-master2                                         0/1     OOMKilled   0          9m32s
openshift-kube-controller-manager                       revision-pruner-5-master2                                         0/1     Error       0          5m43s
openshift-kube-controller-manager                       revision-pruner-5-master3                                         0/1     OOMKilled   0          8m48s
openshift-kube-scheduler                                installer-5-master3                                               0/1     Error       0          11m
openshift-kube-scheduler                                installer-6-master3                                               0/1     Error       0          9m31s
openshift-kube-scheduler                                revision-pruner-4-master2                                         0/1     Error       0          11m
openshift-kube-scheduler                                revision-pruner-5-master3                                         0/1     Error       0          10m
openshift-kube-scheduler                                revision-pruner-6-master3                                         0/1     Error       0          9m21s
openshift-kube-storage-version-migrator                 migrator-f8f8bd9d8-tqrmg                                          0/1     Pending     0          12m
openshift-multus                                        multus-brwdx                                                      0/1     Pending     0          2m17s
openshift-multus                                        multus-kk95s                                                      0/1     Pending     0          2m14s
openshift-sdn                                           ovs-4pdr9                                                         0/1     Pending     0          2m14s
openshift-sdn                                           ovs-cncw7                                                         0/1     Pending     0          2m17s
openshift-sdn                                           sdn-vzd9b                                                         0/1     Pending     0          2m14s
openshift-sdn                                           sdn-zljhn                                                         0/1     Pending     0          2m17s

Comment 1 Michael Nelson 2020-07-15 08:22:22 UTC
I don't have access to any internal posts on this bugzilla, but I just wanted to mention that there is an external customer who is unable to deploy OCP on his IBM POWER9 system. He opened his original support case on 7 July. If there is any additional data to collect, or any workaround which might help, I'm sure he would appreciate an update sooner rather than later.

Comment 2 Dylan Orzel 2020-07-15 15:04:04 UTC
Hi Rob,

We may need additional clarification here. Are you unable to install any 4.3/4.4 on ppc64le at all? What specific release version of OCP are you using? Which method are you using to deploy OCP?

Comment 3 Adrian Sorge 2020-07-15 15:44:47 UTC
Hey, 

we tried to install 4.3.18 and 4.4.9 (both latest versions at the time of installation).

We used this specific version from the installer (for 4.4.9 and I ):
openshift-install 4.4.9 built from commit 1541bf917973186bbab6a5f895f08db4334a5d9a release image quay.io/openshift-release-dev/ocp-release@sha256:ae48474c6fcd0666a672ce2a449736a2549693c04186ea588e37477635c976a6

We followed the standard installation guide (https://docs.openshift.com/container-platform/4.4/installing/installing_ibm_power/installing-ibm-power.html) with IPs via DHCP.

Comment 6 Manoj Kumar 2020-07-31 18:59:59 UTC
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1812126

Comment 7 Manoj Kumar 2020-07-31 19:02:12 UTC
This is resolved in 4.5.

If you would like the installation to succeed on 4.3/4.4 the number of cores assigned to the master nodes should be 2. Which on SMT=8 leads to 16 threads.
The master nodes can be resized after the initial install.

Comment 8 Jeremy Poulin 2020-07-31 19:07:08 UTC
Hi team! This is a duplicate of an issue that was discovered in the 4.3 timeframe,and ultimately fixed in 4.5.

Some additional context from Manoj Kumar:
[This issue is related to] large page size and the kernel/cgroup caches being scaled to the number of virtual CPUs/threads.

The workaround is to:
Install with the default/minimum size of the master node (as in the documentation).
2 Cores, on SMT=8.

This information should already be present in the documentation.

Comment 9 Jeremy Poulin 2020-07-31 19:08:12 UTC

*** This bug has been marked as a duplicate of bug 1812126 ***


Note You need to log in before you can comment on or make changes to this bug.