Bug 1841414 - OCP 4.4: Installation issues with worker nodes on System z zVM environment
Summary: OCP 4.4: Installation issues with worker nodes on System z zVM environment
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Multi-Arch
Version: 4.4
Hardware: s390x
OS: Linux
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Andy McCrae
QA Contact: Barry Donahue
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-29 05:56 UTC by krmoser
Modified: 2020-07-28 12:35 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-28 12:35:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
worker-0 kubelet log part 1 of 2 (13.13 MB, application/gzip)
2020-06-05 18:56 UTC, krmoser
no flags Details
worker-0 kubelet log part 2 of 2 (11.79 MB, application/gzip)
2020-06-05 18:57 UTC, krmoser
no flags Details
worker-1 kubelet log part 1 of 3 (13.34 MB, application/gzip)
2020-06-05 18:58 UTC, krmoser
no flags Details
worker-1 kubelet log part 2 of 3 (13.37 MB, application/gzip)
2020-06-05 18:59 UTC, krmoser
no flags Details
worker-1 kubelet log part 3 of 3 (4.68 MB, application/gzip)
2020-06-05 19:00 UTC, krmoser
no flags Details
master-0 kubelet log part 1 of 5 (17.93 MB, application/gzip)
2020-06-12 05:59 UTC, krmoser
no flags Details
master-0 kubelet log part 2 of 5 (11.79 MB, application/gzip)
2020-06-12 06:02 UTC, krmoser
no flags Details
master-0 kubelet log part 3 of 5 (18.10 MB, application/gzip)
2020-06-12 06:09 UTC, krmoser
no flags Details
master-0 kubelet log part 4 of 5 (18.08 MB, application/gzip)
2020-06-12 06:13 UTC, krmoser
no flags Details
master-0 kubelet log part 5 of 5 (7.60 MB, application/gzip)
2020-06-12 06:15 UTC, krmoser
no flags Details
master-0 kubelet log part 2 of 5 (18.01 MB, application/gzip)
2020-06-12 06:18 UTC, krmoser
no flags Details
bootstrap-0 bootkube service log (2.76 KB, text/plain)
2020-06-12 06:22 UTC, krmoser
no flags Details
bootstrap-0 bootstrap-control-plane logs (1.19 MB, application/x-tar)
2020-06-12 06:23 UTC, krmoser
no flags Details

Description krmoser 2020-05-29 05:56:54 UTC
Description of problem:

When attempting to install an OCP 4.4 on Z zVM based cluster using the 05-18-2020 or 05-25-2020 OCP 4.4 nightly builds, a serious installation issue occurs when the zVM worker nodes are xautolog booted from their zVM reader kernel.img file within a few seconds after the master nodes are xautolog booted from their corresponding zVM reader kernel.img file.  The worker nodes will not properly integrate into the OCP 4.4 cluster, as there tend to be pending CSRs and/or "Internal Server Error" issues from the bastion server GET operations (which can even loop infinitely).  Even if these CSRs are subsequently approved, the ensuing worker node cluster integration is not completed, or not completed properly.

1. The rhcos-44.81.202005180840-0-installer-kernel-s390x and rhcos-44.81.202005180840-0-installer-initramfs.s390x.img components are used.

2. Our team has encountered this issue multiple times and is easily reproducible using multiple zVM environments (and a KVM POC environment) on System z.  We are testing using an IBM System z z15 server.

3. When this sequence of bootstrap, master, and then worker node zVM xautolog reader file boots are performed with only a few seconds between the master and worker nodes, the worker nodes will not properly integrate into the OCP 4.4 cluster, or sometimes not integrate at all without manual intervention.  One condition encountered with the worker nodes is an infinite loop when attempting a GET from the repo server.

4. With OCP 4.2 and 4.3, including OCP 4.3.21, using the same zVM environments, the zVM xautolog sequence of booting the bootstrap, master, and worker nodes in quick succession does not cause an issue with the cluster installation, with all master and worker nodes successfully joining the cluster, and all cluster operators are successfully started with AVAILABLE status of "True".

5. With OCP 4.4 (unless a potential workaround is employed), this xautolog sequence of booting the bootstrap, master, and worker nodes in quick succession causes installation issues that can be apparent at install time, or later when running workload.  Specifically, when one of both worker nodes in a two worker node configuration do not properly join the cluster (i.e., the oc get nodes command only shows one worker node with a STATUS of "Ready"), the authentication and console cluster operators may not achieve AVAILABLE status of "True".

    The authentication cluster operator AVAILABLE status may be at "Unknown" forever.
    The console cluster operator AVAILABLE status may remain at "False" forever.



6. With OCP 4.4, one workaround that we have had relatively good success with has been to xautolog boot the bootstrap and master nodes in quick succession, and then wait approximately 500 seconds before continuing with the xautolog of the worker nodes. 

7. This worker node workaround seems to be necessary as the 3 master nodes do not achieve a STATUS of "Ready" for close to 8 minutes (or 480 seconds).  Although this STATUS of "Ready" for the 3 master nodes is not necessary for OCP 4.3 for the worker nodes' to begin integration into the cluster, it does appear to be a prerequisite/issue with OCP 4.4.  The underlying issue with the master nodes seems to be the etcd readiness state for which to continue with proper cluster integration of the worker nodes.

8. Testing this OCP 4.4 cluster installation workaround for 17+ installations, for all but possibly 1-2 tests, all cluster operators have almost consistently achieve AVAILABLE state of "True" and remained in this state after running workload.


Thank you.






Version-Release number of selected component (if applicable):
OCP 4.4 44.81.202005180840-0

How reproducible:
Can be reproduced consistently

Steps to Reproduce: Please see above information in "Description of the Problem" section.
1. 
2.
3.

Actual results:
OCP 4.4 zVM cluster does not install properly when xautolog booting kernel img 

Expected results:
OCP 4.4 zVM custer installs properly, using the same method as successful with OCP 4.3. 

Additional info:

Comment 1 Adam Kaplan 2020-05-29 11:57:50 UTC
Moving this to the Multi-Arch team for further investigation.

"Build" is reserved for the OpenShift image build APIs and underlying components. See https://docs.openshift.com/container-platform/4.4/builds/understanding-image-builds.html

Comment 3 wvoesch 2020-06-02 11:58:49 UTC
I can confirm that the workers are not added automatically. After approving the pending CSRs manually the worker nodes were added to the cluster but it took another 32 minutes until the workers became ready. With the workers being in the ready state it took another 12 minutes until all cluster operators were available. The duration of the installation was 1h 7 min.

system: z13, zVM, 3 masters, 3 workers, fcp, hipersockets, nodes were ipled in quick succession. 
Client Version: 4.4.0-0.nightly-s390x-2020-06-01-021037
Server Version: 4.4.0-0.nightly-s390x-2020-06-01-021037
Kubernetes Version: v1.17.1+f5fb168
Kernel Version:                         4.18.0-147.8.1.el8_1.s390x
OS Image:                               Red Hat Enterprise Linux CoreOS 44.81.202005250840-0 (Ootpa)
Operating System:                       linux
Architecture:                           s390x
Container Runtime Version:              cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8

Comment 4 Dan Li 2020-06-03 21:02:38 UTC
Re-assigning this to Andy.

Comment 5 Prashanth Sundararaman 2020-06-03 21:23:46 UTC
An `oc adm must-gather` would be helpful

Comment 6 krmoser 2020-06-04 13:34:17 UTC
Requested "oc adm must-gather" tar file from OCP 4.4 cluster is 322MB (exceeding the 19.5MB limit).  Would an email attachment to your email account be acceptable?

Here is some basic information from one of the clusters we are seeing the issue with:

[root@OSPBMGR1 ~]# oc version
Client Version: 4.4.0-0.nightly-s390x-2020-05-25-145353
Server Version: 4.4.0-0.nightly-s390x-2020-05-25-145353
Kubernetes Version: v1.17.1
[root@OSPBMGR1 ~]#

[root@OSPBMGR1 ~]# oc get clusterversion
NAME      VERSION                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-s390x-2020-05-25-145353   True        False         20h     Cluster version is 4.4.0-0.nightly-s390x-2020-05-25-145353
[root@OSPBMGR1 ~]#


[root@OSPBMGR1 ~]#  oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}'
ac97dfd6-d9d8-48ba-b1a8-80ce7b76226a
[root@OSPBMGR1 ~]#


Thank you.

Comment 7 Prashanth Sundararaman 2020-06-04 15:10:02 UTC
could you try removing the audit_logs from the must gather and then try to do a .tar.gz and check the size ?

Comment 8 krmoser 2020-06-04 15:40:10 UTC
Yes, removed the audit_logs directory and this reduces the size of the must gather tar file to approximately 139MB.

Here are the approximate sizes of the remaining must gather component directories:
[root@OSPBMGR1 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b1d72669d15ae0d9727a9201c225f458b9d6ce09a6d67a0edc43469774d74cbb]# du -ms *
4       cluster-scoped-resources
1015    host_service_logs
242     namespaces
1       version
[root@OSPBMGR1 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b1d72669d15ae0d9727a9201c225f458b9d6ce09a6d67a0edc43469774d74cbb]#


Thank you.

Comment 9 Prashanth Sundararaman 2020-06-04 18:46:17 UTC
ok.Can you just attach the kubelet logs from the workers that had trouble joining?

Comment 10 krmoser 2020-06-05 18:56:16 UTC
Created attachment 1695510 [details]
worker-0 kubelet log part 1 of 2

Comment 11 krmoser 2020-06-05 18:57:25 UTC
Created attachment 1695511 [details]
worker-0 kubelet log part 2 of 2

Comment 12 krmoser 2020-06-05 18:58:28 UTC
Created attachment 1695512 [details]
worker-1 kubelet log part 1 of 3

Comment 13 krmoser 2020-06-05 18:59:23 UTC
Created attachment 1695513 [details]
worker-1 kubelet log part 2 of 3

Comment 14 krmoser 2020-06-05 19:00:27 UTC
Created attachment 1695514 [details]
worker-1 kubelet log part 3 of 3

Comment 15 krmoser 2020-06-05 19:02:17 UTC
Prashanth,

Thanks.   Given their sizes, I needed to divide the worker-0 and worker-1 kubelet log files into 2 and 3 tar.gz parts, respectively.

Please let us know if you need any additional information.

Thank you,
Kyle

Comment 16 Prashanth Sundararaman 2020-06-05 20:33:25 UTC
Ok. On a first glance , it looks like the kubelet was able to register the node early on :

Jun 03 16:21:35 worker-1.pok-43-may27.pok.stglabs.ibm.com hyperkube[1395]: I0603 16:21:35.287828    1395 kubelet_node_status.go:73] Successfully registered node worker-1.pok-43-may27.pok.stglabs.ibm.com
Jun 03 16:22:15 worker-1.pok-43-may27.pok.stglabs.ibm.com hyperkube[1395]: I0603 16:22:15.368873    1395 kubelet_node_status.go:486] Recording NodeReady event message for node worker-1.pok-43-may27.pok.stglabs.ibm.com

But looks like an hour and a half later the node was rebooted:

Jun 03 17:48:15 worker-1.pok-43-may27.pok.stglabs.ibm.com systemd[1]: Stopped Kubernetes Kubelet.
Jun 03 17:48:15 worker-1.pok-43-may27.pok.stglabs.ibm.com systemd[1]: kubelet.service: Consumed 4min 31.230s CPU time
-- Reboot --
Jun 03 17:48:36 worker-1.pok-43-may27.pok.stglabs.ibm.com systemd[1]: Starting Kubernetes Kubelet...


And then there were some erros probably because the apiserver was down:

Jun 03 17:48:57 worker-1.pok-43-may27.pok.stglabs.ibm.com hyperkube[1391]: E0603 17:48:57.952655    1391 kubelet_node_status.go:92] Unable to register node "worker-1.pok-43-may27.pok.stglabs.ibm.com" with API server: Post https://api-int.pok-43-may27.pok.stglabs.ibm.com:6443/api/v1/nodes: EOF

And then 3 minutes later it succeeds:

Jun 03 17:51:02 worker-1.pok-43-may27.pok.stglabs.ibm.com hyperkube[1391]: I0603 17:51:02.841651    1391 kubelet_node_status.go:73] Successfully registered node worker-1.pok-43-may27.pok.stglabs.ibm.com

Was the cluster rebooted after a while for a particular reason ? the kubelet logs above say the node registration succeeded at 16:21:35 which is pretty early on. Did you see the node when you did a `oc get nodes` ?

Comment 17 krmoser 2020-06-08 20:28:02 UTC
Prashanth,

Thanks for your analysis and my apologies for any misunderstanding.  Here's some additional background information on the OCP 4.4 zVM install issues we are experiencing, and how this relates to the data submitted to this bug.

There are at least 3 different OCP 4.4 zVM install process scenarios with issues that are not present with the OCP 4.3 zVM install process.  In these scenarios we used a minimal configuration of 3 master nodes and 2 worker nodes.  Similar issues/conditions occur when using additional worker nodes.

When installing OCP 4.4, after the 3 masters are booted/installed, dependent when the 2 worker nodes are booted after the 3 master nodes, these 3 different scenarios can occur.  The issue seems to be the readiness of the master nodes to then integrate the worker nodes (possibly etcd).  With OCP 4.4, the master nodes seem to require approximately 8 minutes from boot/install until they achieve "Ready" state (as reported by the "oc get nodes" command).

1. Boot/install the 3 master nodes and the 2 worker nodes in quick succession, as is the normal install procedure with OCP 4.3.
===============================================================================================================================  
Unless the worker nodes' pending CSRs are approved within a few minutes of the master nodes' achieving their "Ready" state (as reported by the "oc get nodes" command), the console and authentication operators do not ever achieve "AVAILABLE" state of True (as reported by the "oc get co" command).  Each worker node has 1 or more pending CSRs, with 1 each prior to joining the cluster, and may have at least 1 pending CSR each after joining the cluster.


2. Boot/install the 3 master nodes first, wait approximately 500 seconds, and then boot/install the 2 worker nodes.
===================================================================================================================
This OCP 4.4 zVM install process seems to consistently work where the master nodes successfully install, both workers install and join the cluster, all cluster operators achieve "AVAILABLE " states of True (as reported by the "oc get co" command), and no pending CSRs.


3. Boot/install the 3 master nodes and 1 worker node, wait approximately 500 seconds, and then boot/install the 2nd worker node.  
================================================================================================================================
The first worker node does not integrate into the cluster without first manually clearing its pending CSR(s), while the second worker node integrates into the cluster without manual intervention, without pending CSRs, and without issues.  As there is at least one "Ready" state worker node in the cluster without intervention (as reported by the "oc get nodes" command), all cluster operators will achieve "AVAILABLE" states of True (as reported by the "oc get co" command) within normal time frames.



The worker kubelet data I previously submitted was for install scenario #2 above, with both worker nodes integrating into the cluster without intervention.  However, these 2 worker nodes would not have successfully integrated if they had not waited to boot/install for approximately 500 seconds after the master nodes boot/install with their subsequent "Ready" states achieved.  We're curious as to this possible root cause "readiness" condition on the master nodes.  We manually (and intentionally) rebooted the cluster after approximately 90 minutes as part of debug and error recovery testing.

For OCP 4.4 zVM install scenarios #1 and #3 listed above, we can recreate and provide any master and/or worker kublelet data that you would like.


Thank you,
Kyle

Comment 18 Prashanth Sundararaman 2020-06-09 17:31:32 UTC
Thank you Kyle for this detailed explanation! Certainly makes things clearer. So if i understand correctly there are two issues:
      - master nodes take 8 mins to get to ready state - from the point when they show up through `oc get nodes` ?
      - Worker nodes are not automatically joined unless csr is approved?

The second one might be the default behavior because we are also seeing that the csrs do not get approved automatically. there is a manual step required to approve them. After that things seem fine though. This might be how it is designed to work as the docs say you need to approve the csrs if they are not approved: https://docs.openshift.com/container-platform/4.3/installing/installing_bare_metal/installing-bare-metal.html#installation-approve-csrs_installing-bare-metal. But i will follow up on this to confirm.

As for the issue of the master nodes not reaching ready state within 8 minutes that is weird. In our zVM setups we see all masters reach ready state in less than a minute or two from the time they are up.
Could you provide us with the bootkube logs from the bootstrap node? Is there any slowness in the network as such which could lead to this ? I am also assuming masters have 16G or more of memory.

Comment 19 krmoser 2020-06-10 13:37:02 UTC
Prashanth,

Thank you for your assistance.  Here is some information to help with your questions.

1. The master nodes require approximately 8 minutes from zVM xautolog boot from their zVM readers until they attain STATUS "Ready" as reported by the "oc get nodes" command.  Using the "oc get nodes" command, the 3 master nodes actually start to display with STATUS "NotReady" approximately 3 minutes after they perform their zVM xautolog boot from their zVM readers.  The 3 master nodes then require approximately 5 additional minutes before they transition from STATUS "NotReady" to "Ready".

2. The worker nodes can automatically join the cluster if their zVM xautolog boot from their zVM readers is timed to occur within a short time after the 3 master nodes attain STATUS "Ready" as reported by the "oc get nodes" command.  If this window is missed, then the pending CSRs for these worker nodes must be approved within 1-2 minutes.  After this 1-2 minute window for the pending worker nodes CSRs' approval, if they are approved after, the authentication console operator may not achieve AVAILABLE status of True, as indicated by the "oc get co" command.

3. All nodes in a few OCP 4.4 clusters we have seen this issue with have 32GB real memory, including the bastion, bootstrap, master, and worker nodes.

4. This OCP 4.4 install issue does not seem to be a network issue as:
  1. Both the IBM Germany and USA Solution Test teams see the same/similar behavior in their separate lab environments.
  2. Installing OCP 4.3 with the 4.3.23 build on the same clusters does not have these issues, with no pending CSRs, or master readiness issues of 8 minutes.

5. Yes, will work to provide the bootkube logs from the bootstrap node.

Thank you,
Kyle

Comment 20 Prashanth Sundararaman 2020-06-10 13:56:23 UTC
Thanks again for the succinct explanation again. While we are waiting for the bootstrap logs - could also check if your masters are scheduled as workers too ? i.e,

[root@rock-zvm-3-1 ~]# ./oc get nodes
NAME                        STATUS   ROLES           AGE     VERSION
master-0.test.example.com   Ready    master,worker   2m57s   v1.18.3+1635e9d
master-1.test.example.com   Ready    master,worker   3m15s   v1.18.3+1635e9d
master-2.test.example.com   Ready    master,worker   2m52s   v1.18.3+1635e9d


If this happens, can you make sure that when the masters are visible through `oc get nodes` (even if NotReady) , could you set mastersschedulable to false like this:

oc patch schedulers.config.openshift.io cluster --type merge --patch '{"spec":{"mastersSchedulable": false}}'

We believe this could be causing the authentication operator to have problems when the worker nodes come up.

Could you also send the kubelet log for one of the masters along with the bootkube logs?

Thanks

Comment 21 krmoser 2020-06-11 10:55:26 UTC
Prashanth,

Thanks for the information.  I've reinstalled a few times to test some scenarios and the mastersschedulable to false, and will work to upload the kubelet log for one of the masters along with the bootkube logs today.

1. For the 4.4.0-0.nightly-s390x-2020-05-25-145353 build I've been testing with, all 3 master nodes are also scheduled as workers.

[root@ospbmgr2 /]# oc get nodes
NAME                                        STATUS   ROLES           AGE    VERSION
master-0.pok-90-may25.pok.stglabs.ibm.com   Ready    master,worker   132m   v1.17.1
master-1.pok-90-may25.pok.stglabs.ibm.com   Ready    master,worker   131m   v1.17.1
master-2.pok-90-may25.pok.stglabs.ibm.com   Ready    master,worker   131m   v1.17.1
worker-0.pok-90-may25.pok.stglabs.ibm.com   Ready    worker          11m    v1.17.1
worker-1.pok-90-may25.pok.stglabs.ibm.com   Ready    worker          126m   v1.17.1
[root@ospbmgr2 /]#


2. One key difference between your and our configuration seems to be the Kubernetes version.  Your kubernetes version is v1.18.3+1635e9d, while our kubernetes version is v1.17.1.  

[root@ospbmgr2 /]# oc version
Client Version: 4.4.0-0.nightly-s390x-2020-05-25-145353
Server Version: 4.4.0-0.nightly-s390x-2020-05-25-145353
Kubernetes Version: v1.17.1
[root@ospbmgr2 /]#

The release.txt files for the 05-18-2020, 05-25-2020, 06-01-2020, and 06-08-2020 release.txt files all indicate: Kubernetes 1.17.1


3. Setting the mastersschedulable value to false seems to resolve the authentication operator issue for my initial tests, and I'll be conducting a few more tests today.


Thank you,
Kyle

Comment 22 Prashanth Sundararaman 2020-06-11 11:46:18 UTC
Thank you Kyle! Please don't worry about the kubernetes version as I was testing with some 4.6 builds at that time .The kubernetes version of 4.4 is v1.17.1 as you pointed out.

Setting the mastersSchedulable to false is a recommended step and would have caused race conditions in where the ingress pods were placed which might have caused the issue when the masters were set as schedulable. Please make sure that is part of your installation steps.

Comment 23 krmoser 2020-06-12 05:59:51 UTC
Created attachment 1696923 [details]
master-0 kubelet log part 1 of 5

Comment 24 krmoser 2020-06-12 06:02:11 UTC
Created attachment 1696924 [details]
master-0 kubelet log part 2 of 5

Comment 25 krmoser 2020-06-12 06:09:32 UTC
Created attachment 1696925 [details]
master-0 kubelet log part 3 of 5

Comment 26 krmoser 2020-06-12 06:13:16 UTC
Created attachment 1696926 [details]
master-0 kubelet log part 4 of 5

Comment 27 krmoser 2020-06-12 06:15:16 UTC
Created attachment 1696927 [details]
master-0 kubelet log part 5 of 5

Comment 28 krmoser 2020-06-12 06:18:13 UTC
Created attachment 1696928 [details]
master-0 kubelet log part 2 of 5

Comment 29 krmoser 2020-06-12 06:22:28 UTC
Created attachment 1696929 [details]
bootstrap-0 bootkube service log

Comment 30 krmoser 2020-06-12 06:23:30 UTC
Created attachment 1696930 [details]
bootstrap-0 bootstrap-control-plane logs

Comment 31 krmoser 2020-06-12 06:30:49 UTC
Prashanth,

Thanks for the information.  I've uploaded the requested bootstrap-0 and master-0 logs.

1. My apologies for not getting these logs to you yesterday -- two colleagues and I were trying to install the OCP 4.4 nightly 06-08-2020 build on multiple clusters and were unsuccessful, where only the bootstrap-0 node installs successfully.  Would you know if anyone on your team has successfully installed the OCP 4.4 nightly 06-08-2020 build?
 
2. Given the master-0 node kubelet log size, I needed to split into 5 tar.gz parts. 

Please let us know if you need any additional information.

Thank you,
Kyle

Comment 32 krmoser 2020-06-14 21:29:33 UTC
Prashanth,

Just a quick update that as we continue to test with the June 12, 2020 (06-08-2020) OCP 4.4 nightly build, starting on Friday and over the weekend, our initial results seem to indicate that most of the OCP 4.4 install problems we have encountered to date have now been corrected with this build.  

As an example, instead of requiring several minutes, the 3 masters now achieve "Ready" STATUS (as reported by the "oc get nodes" command) within 60-65 seconds.

We'll provide additional information/updates on Monday.

Thank you,
Kyle

Comment 33 krmoser 2020-06-16 05:43:17 UTC
Prashanth,

We're continuing to test with the June 12, 2020 (06-08-2020) OCP 4.4 nightly build, and the build consistently installs without needing to sleep between the master node and worker node boots, as long as the pending worker nodes' CSRs are accepted, and the mastersSchedulable option is set to false.  There does seem to be a longer delay between when the master nodes become Ready and the worker nodes install and then become Ready.  We are continuing to investigate.

Thank you,
Kyle

Comment 34 Prashanth Sundararaman 2020-06-16 12:33:03 UTC
That's great news Kyle. Please keep us updated and let me know if this issue can be closed once you confirm.

Comment 35 krmoser 2020-06-17 16:01:44 UTC
Prashanth,

We've been continuing to test with the June 12, 2020 OCP 4.4 build (06-12-2020, please forgive my typo in my previous post where I indicated 06-08-2020) and our zVM installation tests have been consistently successful, given we need to approve pending worker node CSRs and set the mastersschedulable to false.  Our subsequent tests of the 06-14-2020, 06-15-2020, and 06-17-2020 builds indicate basic install issues (where the master nodes do not install consistently) and we're continuing to work on the debug effort. We encountered similar master install issues with the 06-01-2020 and 06-08-2020 builds.

At this point, it seems the most stable OCP 4.4 build to date is the June 12, 2020 06-12-2020 build, which also does not require the approximate 7-8 minute sleep between the boot/install of the master nodes and then the boot/install of the worker nodes.  We encountered the issue where the worker nodes' boot/install must wait 7-8 minutes after the master nodes' boot/install with the 05-18-2020 and 05-25-2020 OCP 4.4 builds.

Thank you,
Kyle

Comment 36 Prashanth Sundararaman 2020-06-18 13:18:38 UTC
Kyle,

I just tested 06-18-2020 build and saw no issues. I approved the csrs and the workers got added and the cluster came up fine with all operators running. As before please provide bootkube and kubelet logs from the bootstrap and the problematic nodes.

I believe the issue with the worker nodes having to wait 7-8 minutes is resolved with the newer builds and the issue you are seeing now is something new? When you say master nodes do not install - do they never reach the ready state? do they not boot up ? some details would be good.

Prashanth

Comment 37 krmoser 2020-06-18 14:30:33 UTC
Prashanth,

Thanks for the update.  We just tested with the 06-18-2020 build, which consists of:

1. rhcos-44.81.202006171550-0-dasd.s390x.raw.gz
2. rhcos-44.81.202006171550-0-installer-initramfs.s390x.img
3. rhcos-44.81.202006171550-0-installer-kernel-s390x
4. rhcos-44.81.202006171550-0-installer.s390x.iso
5. rhcos-44.81.202006171550-0-metal.s390x.raw.gz
6. openshift-client-linux-4.4.0-0.nightly-s390x-2020-06-18-112815.tar.gz
7. openshift-install-linux-4.4.0-0.nightly-s390x-2020-06-18-112815.tar.gz

and it installs without issue.

When trying to install with the latest 06-17-2020 openshift-client and openshift-install, the install fails with continual master and worker node http GET requests that are never fulfilled.

Thank you,
Kyle

Comment 38 krmoser 2020-06-18 17:26:49 UTC
Prashanth,

When running the openshift-install 4.4.0-0.nightly-s390x-2020-06-18-112815 and client Version: 4.4.0-0.nightly-s390x-2020-06-18-112815, we are getting on our bootstrap node the following from journalctl -xe entries continually:

Jun 18 17:19:59 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1688]: Error: error pulling image "registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:1>
Jun 18 17:19:59 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1688]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:1620bc3>
Jun 18 17:20:00 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1688]: Error: error pulling image "registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:1>
Jun 18 17:20:00 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1688]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:1620bc3>
Jun 18 17:20:00 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1688]: Error: error pulling image "registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:1>
Jun 18 17:20:00 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1688]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:1620bc3>
lines 969-1013/1013 (END)


Thank you,
Kyle

Comment 39 Christian LaPolt 2020-06-18 17:31:36 UTC
I noticed that the image pull from the 6-12 installer is trying to get to 
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:139c0faf3422db2d106ff2d1f5fd44cc06f69a6a772ca7083adcf460a3e88c45

The 6-17 installer is trying 

release image registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:ac208c137611e808414b0a9b6321f5983ce8933c6e17d50521b24cabc7ef2c78

I am not sure if this is the issue but it is the only real difference I see at this point.  

I did the check after seeing this from journalctl -xe 

Jun 18 17:21:09 bootstrap-0.ospamgr2-jun17.zvmocp.notld release-image-download.sh[1718]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp.....

Thanks,
Christian

Comment 40 krmoser 2020-06-18 17:51:34 UTC
Prashanth,

We get the same issue when using Client Version: 4.4.0-0.nightly-s390x-2020-06-17-185805, openshift-install 4.4.0-0.nightly-s390x-2020-06-17-185805, bootstrap rhcos level:  Red Hat Enterprise Linux CoreOS 44.81.202006171550-0 (Ootpa) 4.4

We also see this intermittently on the bootstrap:
[systemd]
Failed Units: 1
  sssd.service


We are currently seeing the following on the bootstrap:

Jun 18 17:48:26 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1717]: Error: error pulling image "registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:a>
Jun 18 17:48:26 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1717]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:ac208c1>
Jun 18 17:48:26 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1717]: Error: error pulling image "registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:a>
Jun 18 17:48:26 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1717]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:ac208c1>
Jun 18 17:48:27 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1717]: Error: error pulling image "registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:a>
Jun 18 17:48:27 bootstrap-0.pok-90-jun15.pok.stglabs.ibm.com release-image-download.sh[1717]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp-s390x/release-s390x@sha256:ac208c1>


[root@bootstrap-0 core]# oc version
Client Version: 4.4.0-202006132207-d038424
[root@bootstrap-0 core]#



Thank you,
Kyle

Comment 41 krmoser 2020-06-19 07:54:03 UTC
Prashanth,

Thank you again to your colleagues and you for all the assistance with the dual build stream issue.

We have been able to successfully install multiple zVM clusters with the OCP 4.4.0-0.nightly-s390x-2020-06-17-185805 build without issue.  We are continuing to test and will let you know if any issues.

Thank you,
Kyle

Comment 42 Dan Li 2020-07-27 18:04:12 UTC
Hi Kyle, can this bug be closed or is there additional work to be done? Prashanth is OOTO at the moment so any work to be done will likely be deferred until the next sprint.

Comment 43 krmoser 2020-07-28 04:10:05 UTC
Thank you for all your assistance.  Please close this issue.

Comment 45 Dan Li 2020-07-28 12:35:43 UTC
Closing per confirmation from reporter


Note You need to log in before you can comment on or make changes to this bug.