1779811 – e2e-aws-scaleup-rhel7 constantly failing

Bug 1779811 - e2e-aws-scaleup-rhel7 constantly failing

Summary: e2e-aws-scaleup-rhel7 constantly failing

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Russell Teague
QA Contact:	Russell Teague
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1878832 (view as bug list)
Depends On:	1781575 1824285 1826094 1827783 1832332 1836284
Blocks:	1823916 1824150
TreeView+	depends on / blocked

Reported:	2019-12-04 19:09 UTC by Kirsten Garrison
Modified:	2020-09-24 17:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1823916 (view as bug list)
Environment:
Last Closed:	2020-09-24 17:53:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 12099	0	None	closed	Bug 1779811: Gather debug data on task failure	2020-10-19 13:24:28 UTC
Github	openshift openshift-ansible pull 12225	0	None	closed	Bug 1779811: Install pre-release kernel (for scaleup-rhel7 CI)	2020-10-19 13:24:28 UTC

Description Kirsten Garrison 2019-12-04 19:09:37 UTC

This bug was initially created as a copy of Bug #1766792

I am copying this bug because: 
The job is still not stable in master and has failed ~70 times in a row.  The last time it passed was on 11/21/19. Really unsure of the value of this job.


Description of problem:
this job seems to be basically broken.  going into the history in the last 168 runs it's only passed 17 times.  can we eliminate these tests until it is actually able to run correctly? it doesn't seem like an efficient use of our resources right now.

For ref: https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId=


How reproducible:
Look at most of its ci runs

Actual results:
It always fails

Expected results:
It should generally be passing unless there is a reason for it to fail.

Comment 5 Kirsten Garrison 2020-04-14 18:56:24 UTC

This job is still failing :
https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId=

Comment 6 Kirsten Garrison 2020-04-14 19:00:46 UTC

There's something going on in this job, it's been failing consistently across releases. If it doesn't work reliably why is there a CI job?

See prior report as well: https://bugzilla.redhat.com/show_bug.cgi?id=1766792

Comment 7 Kirsten Garrison 2020-04-14 19:08:06 UTC

This is also failing on 4.4 https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.4-e2e-aws-scaleup-rhel7?buildId=

Comment 9 Russell Teague 2020-04-15 12:24:51 UTC

Top 15 failing tests for rhel7:

Failed	% of 210	Test (started between 2020-04-06T15:01:06 and 2020-04-15T08:28:56 UTC)
114	54	[sig-storage] PersistentVolumes-local  [Volume type: tmpfs] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
110	52	[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s]
109	51	[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
107	50	[sig-storage] PersistentVolumes-local  [Volume type: dir-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
106	50	[sig-storage] PersistentVolumes-local  [Volume type: blockfswithformat] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
103	49	[sig-storage] PersistentVolumes-local  [Volume type: blockfswithformat] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
102	48	[sig-storage] PersistentVolumes-local  [Volume type: dir-link-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
102	48	[sig-storage] PersistentVolumes-local  [Volume type: dir-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
101	48	[sig-storage] PersistentVolumes-local  [Volume type: dir-link-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
98	46	[sig-storage] PersistentVolumes-local  [Volume type: tmpfs] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
98	46	[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity [Suite:openshift/conformance/parallel] [Suite:k8s]
95	45	[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directory [Suite:openshift/conformance/parallel] [Suite:k8s]
94	44	[sig-storage] PersistentVolumes-local  [Volume type: tmpfs] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s]
94	44	[sig-storage] PersistentVolumes-local  [Volume type: dir-link-bindmounted] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s]
92	43	[sig-storage] PersistentVolumes-local  [Volume type: blockfswithformat] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 10 Russell Teague 2020-04-15 12:35:25 UTC

Effecting all branch release jobs:
release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.3
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.5
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6

Effecting all branches (master, release-4.3, release-4.4) for the following repos:
openshift-installer
openshift-machine-api-operator
openshift-machine-config-operator
openshift-openshift-ansible

Comment 11 Russell Teague 2020-04-15 13:53:33 UTC

When running e2e tests, observing node report Not Ready:

            {
                "lastHeartbeatTime": "2020-04-15T13:51:42Z",
                "lastTransitionTime": "2020-04-15T13:45:11Z",
                "message": "[container runtime is down, PLEG is not healthy: pleg was last seen active 7m5.997877271s ago; threshold is 3m0s]",
                "reason": "KubeletNotReady",
                "status": "False",
                "type": "Ready"
            }

Comment 12 Russell Teague 2020-04-15 14:52:19 UTC

Possible fix: https://github.com/cri-o/cri-o/pull/3583
Backporting a fix to 1.17


Panic in crio:
Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: panic: attempted to update last-writer in lockfile without the write lock
Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: goroutine 135 [running]:
Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: panic(0x560e8b171000, 0x560e8b4bc160)

Comment 13 Kirsten Garrison 2020-05-05 00:27:50 UTC

@Russell Did https://github.com/cri-o/cri-o/pull/3583 solve the issue? AFAIK, the 1.17 Prs were was pulled into 4.4.

Comment 14 Russell Teague 2020-05-05 12:19:17 UTC

@Kirsten The crio issue was resolved by that fix, there were updated builds for each branch:
4.3 - cri-o-1.16.6-5.dev.rhaos4.3.git5fb6738.el7
4.4 - cri-o-1.17.4-2.dev.rhaos4.4.gitfe61deb.el7
4.5 - cri-o-1.18.0-3.dev.rhaos4.5.git2cef6c3.el7

The runc issue was also resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1823374 with new builds:
4.2 - runc-1.0.0-67.rc10.rhaos4.2.el7_8
4.3 - runc-1.0.0-67.rc10.rhaos4.3.el7
4.4 - runc-1.0.0-68.rc10.rhaos4.4.el7_8
4.5 - runc-1.0.0-68.rc10.rhaos4.5.el7

Still waiting on a resolution for the other issues identified with the dependent bugs linked above, Prometheus alert tests, Service endpoint tests, and worker config rollout.

Comment 15 Russell Teague 2020-05-22 17:58:52 UTC

The issues associated with this bug are being tracked in the dependent bugs and https://issues.redhat.com/browse/CORS-1429.

Comment 16 Russell Teague 2020-05-26 17:29:16 UTC

This is not blocking the 4.5 release.

Comment 17 Russell Teague 2020-06-19 15:55:38 UTC

Still waiting on blocking issues to be resolved.

Comment 18 Clayton Coleman 2020-06-26 19:08:46 UTC

CI has been broken for two months, raising this up one more level of severity.

Comment 20 Russell Teague 2020-07-10 18:41:07 UTC

Still waiting for kernel update in this errata, https://errata.devel.redhat.com/advisory/56015

Comment 21 Russell Teague 2020-07-29 17:33:35 UTC

The required kernel update was not added to the errata referenced in comment 20.  The issue related to "Services should be rejected when no endpoints exist" will not be addressed until RHEL 7.9 is shipped and available in AWS.

Comment 22 Russell Teague 2020-08-11 19:32:42 UTC

The errata [1] for dependent bug [2] was pushed out and therefore the needed kernel update will not be available for at least three more weeks.

[1] https://errata.devel.redhat.com/advisory/50704
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1832332

Comment 23 Russell Teague 2020-08-20 14:15:07 UTC

A change in the ssh-bastion deployment script [1] has caused the script to fail in CI.  Working on a PR [2] to use a newer version of oc for the ssh-bastion step.

[1] https://github.com/eparis/ssh-bastion/pull/26
[2] https://github.com/openshift/release/pull/11083

Comment 24 Russell Teague 2020-08-24 19:12:05 UTC

The RHEL 7.9 errata [1] was pushed out.  In order to fix the failing test until 7.9 ships, a pre-release kernel will be installed for 4.6+ jobs [2].

[1] https://errata.devel.redhat.com/advisory/50704
[2] https://github.com/openshift/release/pull/11188

Comment 25 Russell Teague 2020-09-01 20:10:30 UTC

Opened a PR to install the pre-release kernel for scaleup-rhel7 jobs as well.

Comment 28 Abhinav Dahiya 2020-09-14 16:39:39 UTC

*** Bug 1878832 has been marked as a duplicate of this bug. ***

Comment 29 Russell Teague 2020-09-23 19:20:57 UTC

With the manual upgrade of the kernel for ci jobs, the 'workers-rhel7' CI jobs are succeeding regularly with occasional failures unrelated to RHEL usage. Still waiting for RHEL 7.9 to ship which contains the updated kernel. 

RHEL 7.9 errata [1] pushed to 2020-Sep-29.

[1] https://errata.devel.redhat.com/advisory/50704

Comment 30 Russell Teague 2020-09-24 17:53:52 UTC

e2e-aws-scaleup-rhel7 has been replaced by e2e-aws-workers-rhel7 which uses the new kernel.

e2e-aws-workers-rhel7 has been succeeding consistently.

Note You need to log in before you can comment on or make changes to this bug.