1887007 – Creating cluster with realtime kernel (both masters and workers) almost always fails

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1887007 - Creating cluster with realtime kernel (both masters and workers) almost always fails

Summary: Creating cluster with realtime kernel (both masters and workers) almost alway...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel-rt
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Chris White
QA Contact:	Network QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-10 03:21 UTC by To Hung Sze
Modified:	2021-03-16 13:30 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-16 13:30:00 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
gather bootstrap log for attempt with 4.7 (5.67 MB, application/gzip) 2020-12-17 20:08 UTC, To Hung Sze	no flags	Details
View All

Description To Hung Sze 2020-10-10 03:21:44 UTC

Description of problem:
It is very difficult to get cluster up and working properly when specifying realtime kernel

Version-Release number of selected component (if applicable):
4.6.0-rc.1

How reproducible:
Always (see below)

Steps to Reproduce:
1. ./openshift-install create install-config --dir <dir>
2. ./openshift-install create manifests --dir <dir>
3. 
cat > gcp100920g/openshift/99-master-kerneltype.yaml <<EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "master"
  name: 99-master-kerneltype
spec:
  kernelType: realtime
EOF

cat > gcp100920g/openshift/99-worker-kerneltype.yaml <<EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker"
  name: 99-worker-kerneltype
spec:
  kernelType: realtime
EOF

4. openshift-install create cluster --dir <dir>

Actual results:
Most of the time (7 out of 9 times I tried), cluster will fail installation.
kubeconfig is created but the cluster is not well enough to allow ./oc login.

The remaining two times, 
oc --kubeconfig <installation_directory>/auth/kubeconfig debug node/<master_node> or <worker_node> will fail completely or fail for at least one of the node with

$ ./oc debug node/tszegcp100920g-bxgsp-worker-b-rc7qn.c.openshift-qe.internal
Creating debug namespace/openshift-debug-node-8g5wb ...
Removing debug namespace/openshift-debug-node-8g5wb ...
Error from server (Forbidden): pods "tszegcp100920g-bxgsp-worker-b-rc7qncopenshift-qeinternal-debug" is forbidden: error looking up service account openshift-debug-node-8g5wb2rg4l/default: serviceaccount "default" not found

5. ./oc get co will show some degraded:
$ ./oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-rc.1   False       False         True       8m3s
monitoring                                 4.6.0-rc.1   False       True          True       3m8s

6. $ ./oc adm must-gather
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fa63640328598f72567027e9cd0d50f00d4ec058dacc61f3be3c6cca7fbefac5
[must-gather      ] OUT namespace/openshift-must-gather-lzjp9 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-g744n created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-g744n deleted
[must-gather      ] OUT namespace/openshift-must-gather-lzjp9 deleted
Error from server (Forbidden): pods "must-gather-" is forbidden: error looking up service account openshift-must-gather-lzjp9/default: serviceaccount "default" not found

7. With nodes that succeeded to create debug pod on, they do show using realtime kernel:
sh-4.4# uname -a
Linux tszegcp100920g-bxgsp-worker-c-2qfk2.c.openshift-qe.internal 4.18.0-193.24.1.rt13.74.el8_2.dt1.x86_64 #1 SMP PREEMPT RT Fri Sep 25 12:29:06 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux



Expected results:
Creating cluster with realtime kernel always works and all nodes are available and use realtime kernel

Comment 1 Wenjing Zheng 2020-10-10 08:17:06 UTC

with version 4.6.0-rc.2, 1 node is not ready, and some operators are not ready:authentication/machine-config/openshift-apiserver are degraded.
$ oc get nodes
NAME                               STATUS                     ROLES    AGE    VERSION
wzheng1010-rxzcs-compute-0         Ready                      worker   87m    v1.19.0+d59ce34
wzheng1010-rxzcs-compute-1         Ready                      worker   87m    v1.19.0+d59ce34
wzheng1010-rxzcs-compute-2         Ready                      worker   86m    v1.19.0+d59ce34
wzheng1010-rxzcs-control-plane-0   Ready                      master   102m   v1.19.0+d59ce34
wzheng1010-rxzcs-control-plane-1   Ready,SchedulingDisabled   master   102m   v1.19.0+d59ce34
wzheng1010-rxzcs-control-plane-2   Ready                      master   102m   v1.19.0+d59ce34
$oc describe co machine-config
  Extension:
    Last Sync Error:  error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)
    Master:           pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node wzheng1010-rxzcs-control-plane-1 is reporting: \"failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod \\\"etcd-quorum-guard-644f5747b8-hk9t9\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"cluster-monitoring-operator-5469cc87fd-4gtw7\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"packageserver-58f65dcf7d-ccn74\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"insights-operator-79c6587597-w299g\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"cloud-credential-operator-5d456d584f-vvjxz\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"authentication-operator-555fbb4869-h8j4v\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"csi-snapshot-controller-operator-859f7b9bfb-ks68b\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"catalog-operator-59546d8c85-927q6\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"apiserver-6dd8df54f5-8jt4t\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"service-ca-operator-55985d6c6c-rnqx8\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"revision-pruner-3-wzheng1010-rxzcs-control-plane-1\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"olm-operator-6984b748cf-z9gkz\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"marketplace-operator-6d54fcd68c-hd4cw\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"machine-config-operator-67f479947-k8xh4\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"console-7495f84648-sdtm8\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"cluster-image-registry-operator-744bcc5b4c-lgrjf\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"kube-apiserver-operator-884f95bc8-dkxnb\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"cluster-st\""
    Worker:           all 3 nodes are at latest configuration rendered-worker-7188b96a970ce5ee43dac510dfd21b58

Comment 3 Wei Sun 2020-10-10 08:23:56 UTC

Adding testblocker keyword, this is blocking the test with realtime kernel.

Comment 5 Ryan Phillips 2020-10-12 16:05:45 UTC

A fix just went in for this. Could you retest? If the stars aligned, the fix should be in the rc2 candidate.

https://github.com/openshift/kubernetes/pull/400

Comment 8 To Hung Sze 2020-10-12 21:04:21 UTC

Tried with rc2.
Don't seem to improve.
One attempt via Flexy and cluster failed to finish installation.
One attempt manually. Cluster finished installation. Nodes have RT kernel but can't create debug node:

$ uname -a
Linux tszegcp101220c-kww2b-master-0.c.openshift-qe.internal 4.18.0-193.24.1.rt13.74.el8_2.dt1.x86_64 #1 SMP PREEMPT RT Fri Sep 25 12:29:06 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

$ ./oc debug node/tszegcp101220c-kww2b-master-0.c.openshift-qe.internal
Creating debug namespace/openshift-debug-node-xrpff ...
Removing debug namespace/openshift-debug-node-xrpff ...
Error from server (Forbidden): pods "tszegcp101220c-kww2b-master-0copenshift-qeinternal-debug" is forbidden: error looking up service account openshift-debug-node-xrpff992cd/default: serviceaccount "default" not found

Comment 9 To Hung Sze 2020-10-12 21:05:52 UTC

Also can't get must-gather
$ ./oc adm must-gather
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fa63640328598f72567027e9cd0d50f00d4ec058dacc61f3be3c6cca7fbefac5
[must-gather      ] OUT namespace/openshift-must-gather-g7mnm created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-t5wzx created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-t5wzx deleted
[must-gather      ] OUT namespace/openshift-must-gather-g7mnm deleted
Error from server (Forbidden): pods "must-gather-" is forbidden: error looking up service account openshift-must-gather-g7mnm/default: serviceaccount "default" not found

Comment 10 Seth Jennings 2020-10-12 21:23:00 UTC

These errors have nothing to do with Node.  In this particular case the KCM doesn't seem to be creating the default serviceaccount for the namespace.  But is it more likely the cluster is broken in some fundamental way.

Comment 11 Ryan Phillips 2020-10-12 21:37:41 UTC

The kubelet log has a crash with the controller-manager:

kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: I1010 06:39:16.476467    1972 status_manager.go:572] Status for pod "controller-manager-xs988_openshift-controller-manager(bc62fcc0-9c79-4433-96f3-9f992540de35)" updated successfully: (3, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-10 06:37:05 +0000 UT│
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         github.com/openshift/client-go.0-20200722173614-5a1b0aaeff15/apps/clientset/versioned/typed/apps/v1/deploymentconfig.go:82 +0x1db                                                                                                                                                                                                  │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: github.com/openshift/client-go/apps/informers/externalversions/apps/v1.NewFilteredDeploymentConfigInformer.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x23bad37, ...)                                                                                                                                                                   │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         github.com/openshift/client-go.0-20200722173614-5a1b0aaeff15/apps/informers/externalversions/apps/v1/deploymentconfig.go:49 +0x1bc                                                                                                                                                                                                 │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: k8s.io/client-go/tools/cache.(*ListWatch).List(0xc00053bda0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)                                                                                                                                                                                                                                  │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         k8s.io/client-go.0-rc.3/tools/cache/listwatch.go:106 +0x78                                                                                                                                                                                                                                                                        │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func1.1.2(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x23bad37, ...)                                                                                                                                                                                                                   │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         k8s.io/client-go.0-rc.3/tools/cache/reflector.go:265 +0x75                                                                                                                                                                                                                                                                        │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: k8s.io/client-go/tools/pager.SimplePageFunc.func1(0x274d720, 0xc000130010, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)                                                                                                                                                                                                                         │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         k8s.io/client-go.0-rc.3/tools/pager/pager.go:40 +0x64                                                                                                                                                                                                                                                                             │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: k8s.io/client-go/tools/pager.(*ListPager).List(0xc00246fe60, 0x274d720, 0xc000130010, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)                                                                                                                                                                                                                   │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         k8s.io/client-go.0-rc.3/tools/pager/pager.go:91 +0x179                                                                                                                                                                                                                                                                            │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func1.1(0xc00063b020, 0xc001110ea0, 0xc001f16900, 0xc002260d50, 0xc00239bd90, 0xc002260d60, 0xc0014d74a0)                                                                                                                                                                                │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         k8s.io/client-go.0-rc.3/tools/cache/reflector.go:290 +0x1a5                                                                                                                                                                                                                                                                       │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: created by k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func1                                                                                                                                                                                                                                                                         │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         k8s.io/client-go.0-rc.3/tools/cache/reflector.go:256 +0x295                                                                                                                                                                                                                                                                       │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: goroutine 8015 [runnable]:                                                                                                                                                                                                                                                                                                                      │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func2(0xc000e42340, 0xc00059f740, 0xc002593a40, 0xc000b0df20)                                                                                                                                                                                                                            │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         k8s.io/client-go.0-rc.3/tools/cache/reflector.go:361 +0x16f                                                                                                                                                                                                                                                                       │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]: created by k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch                                                                                                                                                                                                                                                                               │
kubelet_service.log│Oct 10 06:39:16.476658 wzheng1010-rxzcs-control-plane-2 hyperkube[1972]:         k8s.io/client-go.0-rc.3/tools/cache/reflector.go:355 +0x2a5

Comment 12 Ryan Phillips 2020-10-12 21:41:00 UTC

It looks like the networking for the pod is ok.

Comment 14 Tomáš Nožička 2020-10-13 10:24:31 UTC

KCM is down because of KAS


KAS is down because of etcd
Trace[810146594]: [19.522779599s] [19.522779599s] END
I1013 09:54:18.445020     198 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out


# etcdctl endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |        ERRORS         |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
|  https://10.0.182.25:2379 | af37c4e1b6d29ada |   3.4.9 |   61 MB |     false |      false |      1359 |      22706 |              22706 | etcdserver: no leader |
| https://10.0.235.116:2379 | ab9304c9dd492523 |   3.4.9 |   61 MB |     false |      false |      1359 |      22706 |              22706 | etcdserver: no leader |
| https://10.0.147.156:2379 | 4eb0b1b9793bf50f |   3.4.9 |   61 MB |     false |      false |      1359 |      22706 |              22706 |                       |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+

sending to etcd team to have a look at why etcd is broken

2020-10-13 10:23:30.771777 I | etcdserver/api/etcdhttp: /health OK (status code 200)
raft2020/10/13 10:23:35 INFO: 4eb0b1b9793bf50f is starting a new election at term 1417
raft2020/10/13 10:23:35 INFO: 4eb0b1b9793bf50f became candidate at term 1418
raft2020/10/13 10:23:35 INFO: 4eb0b1b9793bf50f received MsgVoteResp from 4eb0b1b9793bf50f at term 1418
raft2020/10/13 10:23:35 INFO: 4eb0b1b9793bf50f [logterm: 1417, index: 23010] sent MsgVote request to ab9304c9dd492523 at term 1418
raft2020/10/13 10:23:35 INFO: 4eb0b1b9793bf50f [logterm: 1417, index: 23010] sent MsgVote request to af37c4e1b6d29ada at term 1418
raft2020/10/13 10:23:35 INFO: raft.node: 4eb0b1b9793bf50f lost leader af37c4e1b6d29ada at term 1418
2020-10-13 10:23:36.754155 W | etcdserver/api/etcdhttp: /health error; QGET failed etcdserver: request timed out (status code 503)
raft2020/10/13 10:23:36 INFO: 4eb0b1b9793bf50f is starting a new election at term 1418
raft2020/10/13 10:23:36 INFO: 4eb0b1b9793bf50f became candidate at term 1419
raft2020/10/13 10:23:36 INFO: 4eb0b1b9793bf50f received MsgVoteResp from 4eb0b1b9793bf50f at term 1419
raft2020/10/13 10:23:36 INFO: 4eb0b1b9793bf50f [logterm: 1417, index: 23010] sent MsgVote request to ab9304c9dd492523 at term 1419
raft2020/10/13 10:23:36 INFO: 4eb0b1b9793bf50f [logterm: 1417, index: 23010] sent MsgVote request to af37c4e1b6d29ada at term 1419
raft2020/10/13 10:23:37 INFO: 4eb0b1b9793bf50f [term: 1419] ignored a MsgReadIndexResp message with lower term from af37c4e1b6d29ada [term: 1417]

Comment 20 Ben Bennett 2020-10-13 14:34:05 UTC

Sam, can you please keep the cluster up in case the RHEL team needs to do further investigation?  Since this is purely host networking, this is unrelated to the SDN so we reassigned to the RHEL real time team to see if they could identify the problem.

Comment 24 Sam Batschelet 2020-10-13 14:54:04 UTC

FTR the cluster has a 48hr TTL and was created by Tomáš ~2020-10-13 10:44:51 UTC.

Comment 29 Yang Yang 2020-10-14 03:43:47 UTC

Cluster deployment with 3 "normal" master nodes and 3 "RT" worker nodes passed on 4.6.0-rc.3.

1. # oc get node  -- All of nodes are Ready
NAME                                                      STATUS   ROLES    AGE   VERSION
yanyangrt2-xk2tq-master-0.c.openshift-qe.internal         Ready    master   43m   v1.19.0+d59ce34
yanyangrt2-xk2tq-master-1.c.openshift-qe.internal         Ready    master   43m   v1.19.0+d59ce34
yanyangrt2-xk2tq-master-2.c.openshift-qe.internal         Ready    master   43m   v1.19.0+d59ce34
yanyangrt2-xk2tq-worker-a-fdvbg.c.openshift-qe.internal   Ready    worker   23m   v1.19.0+d59ce34
yanyangrt2-xk2tq-worker-b-p7w7b.c.openshift-qe.internal   Ready    worker   23m   v1.19.0+d59ce34
yanyangrt2-xk2tq-worker-c-x6rfj.c.openshift-qe.internal   Ready    worker   23m   v1.19.0+d59ce34

2. # oc get co  -- No degraded operators
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-rc.3   True        False         False      5m15s
cloud-credential                           4.6.0-rc.3   True        False         False      53m
cluster-autoscaler                         4.6.0-rc.3   True        False         False      40m
config-operator                            4.6.0-rc.3   True        False         False      41m
console                                    4.6.0-rc.3   True        False         False      11m
csi-snapshot-controller                    4.6.0-rc.3   True        False         False      8m33s
dns                                        4.6.0-rc.3   True        False         False      39m
etcd                                       4.6.0-rc.3   True        False         False      39m
image-registry                             4.6.0-rc.3   True        False         False      21m
ingress                                    4.6.0-rc.3   True        False         False      22m
insights                                   4.6.0-rc.3   True        False         False      41m
kube-apiserver                             4.6.0-rc.3   True        False         False      38m
kube-controller-manager                    4.6.0-rc.3   True        False         False      37m
kube-scheduler                             4.6.0-rc.3   True        False         False      38m
kube-storage-version-migrator              4.6.0-rc.3   True        False         False      6m52s
machine-api                                4.6.0-rc.3   True        False         False      33m
machine-approver                           4.6.0-rc.3   True        False         False      40m
machine-config                             4.6.0-rc.3   True        False         False      40m
marketplace                                4.6.0-rc.3   True        False         False      10m
monitoring                                 4.6.0-rc.3   True        False         False      8m15s
network                                    4.6.0-rc.3   True        False         False      42m
node-tuning                                4.6.0-rc.3   True        False         False      41m
openshift-apiserver                        4.6.0-rc.3   True        False         False      7m41s
openshift-controller-manager               4.6.0-rc.3   True        False         False      31m
openshift-samples                          4.6.0-rc.3   True        False         False      32m
operator-lifecycle-manager                 4.6.0-rc.3   True        False         False      40m
operator-lifecycle-manager-catalog         4.6.0-rc.3   True        False         False      40m
operator-lifecycle-manager-packageserver   4.6.0-rc.3   True        False         False      8m25s
service-ca                                 4.6.0-rc.3   True        False         False      41m
storage                                    4.6.0-rc.3   True        False         False      41m

3. # oc debug node/yanyangrt2-xk2tq-master-0.c.openshift-qe.internal  -- Master nodes are running with normal kernel

Starting pod/yanyangrt2-xk2tq-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.3
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# uname -a
Linux yanyangrt2-xk2tq-master-0.c.openshift-qe.internal 4.18.0-193.24.1.el8_2.dt1.x86_64 #1 SMP Thu Sep 24 14:57:05 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

4. # oc debug node/yanyangrt2-xk2tq-worker-a-fdvbg.c.openshift-qe.internal  -- Worker nodes are running with RT kernel

Starting pod/yanyangrt2-xk2tq-worker-a-fdvbgcopenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.32.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# uname -a
Linux yanyangrt2-xk2tq-worker-a-fdvbg.c.openshift-qe.internal 4.18.0-193.24.1.rt13.74.el8_2.dt1.x86_64 #1 SMP PREEMPT RT Fri Sep 25 12:29:06 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Comment 32 Marius Cornea 2020-10-14 09:24:44 UTC

For an additional datapoint from testing - I was able get passing deployment with RT kernel enabled for both master and worker nodes on a baremetal IPI simulated on VMs environment with 4.6.0-rc.3.

Comment 40 Yang Yang 2020-10-15 06:11:35 UTC

I am facing degraded operators with RT enabled cluster on both master and worker nodes on IPI GCP and UPI baremetal with 4.6.0-rc.4-x86_64.

# oc get node
NAME                                                      STATUS   ROLES    AGE     VERSION
yanyangrt4-wtdkj-master-0.c.openshift-qe.internal         Ready    master   3h20m   v1.19.0+d59ce34
yanyangrt4-wtdkj-master-1.c.openshift-qe.internal         Ready    master   3h19m   v1.19.0+d59ce34
yanyangrt4-wtdkj-master-2.c.openshift-qe.internal         Ready    master   3h19m   v1.19.0+d59ce34
yanyangrt4-wtdkj-worker-a-qmghv.c.openshift-qe.internal   Ready    worker   3h8m    v1.19.0+d59ce34
yanyangrt4-wtdkj-worker-b-mlpkj.c.openshift-qe.internal   Ready    worker   3h8m    v1.19.0+d59ce34
yanyangrt4-wtdkj-worker-c-sgmmx.c.openshift-qe.internal   Ready    worker   3h8m    v1.19.0+d59ce34


# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-rc.4   False       False         True       59s
cloud-credential                           4.6.0-rc.4   True        False         False      171m
cluster-autoscaler                         4.6.0-rc.4   True        False         False      163m
config-operator                            4.6.0-rc.4   True        False         False      164m
console                                    4.6.0-rc.4   True        False         False      137m
csi-snapshot-controller                    4.6.0-rc.4   False       True          False      14m
dns                                        4.6.0-rc.4   True        False         False      163m
etcd                                       4.6.0-rc.4   True        False         False      162m
image-registry                             4.6.0-rc.4   True        False         False      152m
ingress                                    4.6.0-rc.4   True        False         False      152m
insights                                   4.6.0-rc.4   True        False         False      165m
kube-apiserver                             4.6.0-rc.4   True        False         False      161m
kube-controller-manager                    4.6.0-rc.4   True        False         True       161m
kube-scheduler                             4.6.0-rc.4   True        False         True       161m
kube-storage-version-migrator              4.6.0-rc.4   True        False         False      152m
machine-api                                4.6.0-rc.4   True        False         False      156m
machine-approver                           4.6.0-rc.4   True        False         False      164m
machine-config                             4.6.0-rc.4   True        False         False      78m
marketplace                                4.6.0-rc.4   True        False         False      162m
monitoring                                 4.6.0-rc.4   False       True          True       22m
network                                    4.6.0-rc.4   True        False         False      165m
node-tuning                                4.6.0-rc.4   True        False         False      165m
openshift-apiserver                        4.6.0-rc.4   True        False         False      44m
openshift-controller-manager               4.6.0-rc.4   True        False         False      152m
openshift-samples                          4.6.0-rc.4   True        False         False      154m
operator-lifecycle-manager                 4.6.0-rc.4   True        False         False      163m
operator-lifecycle-manager-catalog         4.6.0-rc.4   True        False         False      163m
operator-lifecycle-manager-packageserver   4.6.0-rc.4   True        False         False      15m
service-ca                                 4.6.0-rc.4   True        False         False      164m
storage                                    4.6.0-rc.4   True        False         False      165m

Comment 42 To Hung Sze 2020-10-21 18:26:48 UTC

I tried three times today using 4.6 rc4 and all 3 times the cluster (with RT kernels) came up.

Comment 43 To Hung Sze 2020-10-22 20:03:10 UTC

Tried 3 times again today with rc.4 - 1 out of the 3 failed.

Comment 44 To Hung Sze 2020-12-17 20:08:40 UTC

Created attachment 1740058 [details]
gather bootstrap log for attempt with 4.7

Comment 45 To Hung Sze 2020-12-18 15:08:32 UTC

It still fails with openshift-install-linux-4.7.0-0.nightly-2020-12-14-080124 on GCP.

Comment 48 To Hung Sze 2021-01-25 19:10:25 UTC

kcarcia
I have a cluster with realtime kernel (both master and worker) available for debugging.
Sent an email.
Please feel free to ping me if someone from your team wants to access it for debugging

Comment 62 To Hung Sze 2021-03-16 13:15:25 UTC

This ticket can be closed as RT kernel is now working in OpenShift 4.7.1 / 4.7.2 with updated kernel (4.18.0-240.15.1.rt7.69.el8_3.x86_64).
Thanks.

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
behoward
bhu
blitton
chwhite
dcbw
deads
gpei
grajaiya
jlelli
jokerman
kcarcia
markmc
mcornea
mfojtik
mifiedle
msivak
nst-kernel-bugs
qzhao
rolove
sbatsche
scuppett
wsun
wzheng
yanyang