Bug 1779811

Summary: e2e-aws-scaleup-rhel7 constantly failing
Product: OpenShift Container Platform Reporter: Kirsten Garrison <kgarriso>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Russell Teague <rteague>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: urgent    
Priority: urgent CC: adahiya, ccoleman, deads, rteague
Version: 4.4Keywords: Reopened
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1823916 (view as bug list) Environment:
Last Closed: 2020-09-24 17:53:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1781575, 1824285, 1826094, 1827783, 1832332, 1836284    
Bug Blocks: 1823916, 1824150    

Description Kirsten Garrison 2019-12-04 19:09:37 UTC
This bug was initially created as a copy of Bug #1766792

I am copying this bug because: 
The job is still not stable in master and has failed ~70 times in a row.  The last time it passed was on 11/21/19. Really unsure of the value of this job.


Description of problem:
this job seems to be basically broken.  going into the history in the last 168 runs it's only passed 17 times.  can we eliminate these tests until it is actually able to run correctly? it doesn't seem like an efficient use of our resources right now.

For ref: https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId=


How reproducible:
Look at most of its ci runs

Actual results:
It always fails

Expected results:
It should generally be passing unless there is a reason for it to fail.

Comment 6 Kirsten Garrison 2020-04-14 19:00:46 UTC
There's something going on in this job, it's been failing consistently across releases. If it doesn't work reliably why is there a CI job?

See prior report as well: https://bugzilla.redhat.com/show_bug.cgi?id=1766792

Comment 9 Russell Teague 2020-04-15 12:24:51 UTC
Top 15 failing tests for rhel7:

Failed	% of 210	Test (started between 2020-04-06T15:01:06 and 2020-04-15T08:28:56 UTC)
114	54	[sig-storage] PersistentVolumes-local  [Volume type: tmpfs] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
110	52	[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s]
109	51	[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
107	50	[sig-storage] PersistentVolumes-local  [Volume type: dir-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
106	50	[sig-storage] PersistentVolumes-local  [Volume type: blockfswithformat] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
103	49	[sig-storage] PersistentVolumes-local  [Volume type: blockfswithformat] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
102	48	[sig-storage] PersistentVolumes-local  [Volume type: dir-link-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
102	48	[sig-storage] PersistentVolumes-local  [Volume type: dir-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
101	48	[sig-storage] PersistentVolumes-local  [Volume type: dir-link-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
98	46	[sig-storage] PersistentVolumes-local  [Volume type: tmpfs] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s]
98	46	[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity [Suite:openshift/conformance/parallel] [Suite:k8s]
95	45	[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directory [Suite:openshift/conformance/parallel] [Suite:k8s]
94	44	[sig-storage] PersistentVolumes-local  [Volume type: tmpfs] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s]
94	44	[sig-storage] PersistentVolumes-local  [Volume type: dir-link-bindmounted] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s]
92	43	[sig-storage] PersistentVolumes-local  [Volume type: blockfswithformat] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 10 Russell Teague 2020-04-15 12:35:25 UTC
Effecting all branch release jobs:
release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.3
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.5
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6

Effecting all branches (master, release-4.3, release-4.4) for the following repos:
openshift-installer
openshift-machine-api-operator
openshift-machine-config-operator
openshift-openshift-ansible

Comment 11 Russell Teague 2020-04-15 13:53:33 UTC
When running e2e tests, observing node report Not Ready:

            {
                "lastHeartbeatTime": "2020-04-15T13:51:42Z",
                "lastTransitionTime": "2020-04-15T13:45:11Z",
                "message": "[container runtime is down, PLEG is not healthy: pleg was last seen active 7m5.997877271s ago; threshold is 3m0s]",
                "reason": "KubeletNotReady",
                "status": "False",
                "type": "Ready"
            }

Comment 12 Russell Teague 2020-04-15 14:52:19 UTC
Possible fix: https://github.com/cri-o/cri-o/pull/3583
Backporting a fix to 1.17


Panic in crio:
Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: panic: attempted to update last-writer in lockfile without the write lock
Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: goroutine 135 [running]:
Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: panic(0x560e8b171000, 0x560e8b4bc160)

Comment 13 Kirsten Garrison 2020-05-05 00:27:50 UTC
@Russell Did https://github.com/cri-o/cri-o/pull/3583 solve the issue? AFAIK, the 1.17 Prs were was pulled into 4.4.

Comment 14 Russell Teague 2020-05-05 12:19:17 UTC
@Kirsten The crio issue was resolved by that fix, there were updated builds for each branch:
4.3 - cri-o-1.16.6-5.dev.rhaos4.3.git5fb6738.el7
4.4 - cri-o-1.17.4-2.dev.rhaos4.4.gitfe61deb.el7
4.5 - cri-o-1.18.0-3.dev.rhaos4.5.git2cef6c3.el7

The runc issue was also resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1823374 with new builds:
4.2 - runc-1.0.0-67.rc10.rhaos4.2.el7_8
4.3 - runc-1.0.0-67.rc10.rhaos4.3.el7
4.4 - runc-1.0.0-68.rc10.rhaos4.4.el7_8
4.5 - runc-1.0.0-68.rc10.rhaos4.5.el7

Still waiting on a resolution for the other issues identified with the dependent bugs linked above, Prometheus alert tests, Service endpoint tests, and worker config rollout.

Comment 15 Russell Teague 2020-05-22 17:58:52 UTC
The issues associated with this bug are being tracked in the dependent bugs and https://issues.redhat.com/browse/CORS-1429.

Comment 16 Russell Teague 2020-05-26 17:29:16 UTC
This is not blocking the 4.5 release.

Comment 17 Russell Teague 2020-06-19 15:55:38 UTC
Still waiting on blocking issues to be resolved.

Comment 18 Clayton Coleman 2020-06-26 19:08:46 UTC
CI has been broken for two months, raising this up one more level of severity.

Comment 20 Russell Teague 2020-07-10 18:41:07 UTC
Still waiting for kernel update in this errata, https://errata.devel.redhat.com/advisory/56015

Comment 21 Russell Teague 2020-07-29 17:33:35 UTC
The required kernel update was not added to the errata referenced in comment 20.  The issue related to "Services should be rejected when no endpoints exist" will not be addressed until RHEL 7.9 is shipped and available in AWS.

Comment 22 Russell Teague 2020-08-11 19:32:42 UTC
The errata [1] for dependent bug [2] was pushed out and therefore the needed kernel update will not be available for at least three more weeks.

[1] https://errata.devel.redhat.com/advisory/50704
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1832332

Comment 23 Russell Teague 2020-08-20 14:15:07 UTC
A change in the ssh-bastion deployment script [1] has caused the script to fail in CI.  Working on a PR [2] to use a newer version of oc for the ssh-bastion step.

[1] https://github.com/eparis/ssh-bastion/pull/26
[2] https://github.com/openshift/release/pull/11083

Comment 24 Russell Teague 2020-08-24 19:12:05 UTC
The RHEL 7.9 errata [1] was pushed out.  In order to fix the failing test until 7.9 ships, a pre-release kernel will be installed for 4.6+ jobs [2].

[1] https://errata.devel.redhat.com/advisory/50704
[2] https://github.com/openshift/release/pull/11188

Comment 25 Russell Teague 2020-09-01 20:10:30 UTC
Opened a PR to install the pre-release kernel for scaleup-rhel7 jobs as well.

Comment 28 Abhinav Dahiya 2020-09-14 16:39:39 UTC
*** Bug 1878832 has been marked as a duplicate of this bug. ***

Comment 29 Russell Teague 2020-09-23 19:20:57 UTC
With the manual upgrade of the kernel for ci jobs, the 'workers-rhel7' CI jobs are succeeding regularly with occasional failures unrelated to RHEL usage. Still waiting for RHEL 7.9 to ship which contains the updated kernel. 

RHEL 7.9 errata [1] pushed to 2020-Sep-29.

[1] https://errata.devel.redhat.com/advisory/50704

Comment 30 Russell Teague 2020-09-24 17:53:52 UTC
e2e-aws-scaleup-rhel7 has been replaced by e2e-aws-workers-rhel7 which uses the new kernel.

e2e-aws-workers-rhel7 has been succeeding consistently.