This bug was initially created as a copy of Bug #1766792 I am copying this bug because: The job is still not stable in master and has failed ~70 times in a row. The last time it passed was on 11/21/19. Really unsure of the value of this job. Description of problem: this job seems to be basically broken. going into the history in the last 168 runs it's only passed 17 times. can we eliminate these tests until it is actually able to run correctly? it doesn't seem like an efficient use of our resources right now. For ref: https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId= How reproducible: Look at most of its ci runs Actual results: It always fails Expected results: It should generally be passing unless there is a reason for it to fail.
This job is still failing : https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId=
There's something going on in this job, it's been failing consistently across releases. If it doesn't work reliably why is there a CI job? See prior report as well: https://bugzilla.redhat.com/show_bug.cgi?id=1766792
This is also failing on 4.4 https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.4-e2e-aws-scaleup-rhel7?buildId=
Top 15 failing tests for rhel7: Failed % of 210 Test (started between 2020-04-06T15:01:06 and 2020-04-15T08:28:56 UTC) 114 54 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 110 52 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s] 109 51 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] 107 50 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 106 50 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 103 49 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 102 48 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 102 48 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 101 48 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 98 46 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 98 46 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity [Suite:openshift/conformance/parallel] [Suite:k8s] 95 45 [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directory [Suite:openshift/conformance/parallel] [Suite:k8s] 94 44 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 94 44 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 92 43 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s]
Effecting all branch release jobs: release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2 release-openshift-ocp-e2e-aws-scaleup-rhel7-4.3 release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4 release-openshift-ocp-e2e-aws-scaleup-rhel7-4.5 release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 Effecting all branches (master, release-4.3, release-4.4) for the following repos: openshift-installer openshift-machine-api-operator openshift-machine-config-operator openshift-openshift-ansible
When running e2e tests, observing node report Not Ready: { "lastHeartbeatTime": "2020-04-15T13:51:42Z", "lastTransitionTime": "2020-04-15T13:45:11Z", "message": "[container runtime is down, PLEG is not healthy: pleg was last seen active 7m5.997877271s ago; threshold is 3m0s]", "reason": "KubeletNotReady", "status": "False", "type": "Ready" }
Possible fix: https://github.com/cri-o/cri-o/pull/3583 Backporting a fix to 1.17 Panic in crio: Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: panic: attempted to update last-writer in lockfile without the write lock Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: goroutine 135 [running]: Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: panic(0x560e8b171000, 0x560e8b4bc160)
@Russell Did https://github.com/cri-o/cri-o/pull/3583 solve the issue? AFAIK, the 1.17 Prs were was pulled into 4.4.
@Kirsten The crio issue was resolved by that fix, there were updated builds for each branch: 4.3 - cri-o-1.16.6-5.dev.rhaos4.3.git5fb6738.el7 4.4 - cri-o-1.17.4-2.dev.rhaos4.4.gitfe61deb.el7 4.5 - cri-o-1.18.0-3.dev.rhaos4.5.git2cef6c3.el7 The runc issue was also resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1823374 with new builds: 4.2 - runc-1.0.0-67.rc10.rhaos4.2.el7_8 4.3 - runc-1.0.0-67.rc10.rhaos4.3.el7 4.4 - runc-1.0.0-68.rc10.rhaos4.4.el7_8 4.5 - runc-1.0.0-68.rc10.rhaos4.5.el7 Still waiting on a resolution for the other issues identified with the dependent bugs linked above, Prometheus alert tests, Service endpoint tests, and worker config rollout.
The issues associated with this bug are being tracked in the dependent bugs and https://issues.redhat.com/browse/CORS-1429.
This is not blocking the 4.5 release.
Still waiting on blocking issues to be resolved.
CI has been broken for two months, raising this up one more level of severity.
Still waiting for kernel update in this errata, https://errata.devel.redhat.com/advisory/56015
The required kernel update was not added to the errata referenced in comment 20. The issue related to "Services should be rejected when no endpoints exist" will not be addressed until RHEL 7.9 is shipped and available in AWS.
The errata [1] for dependent bug [2] was pushed out and therefore the needed kernel update will not be available for at least three more weeks. [1] https://errata.devel.redhat.com/advisory/50704 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1832332
A change in the ssh-bastion deployment script [1] has caused the script to fail in CI. Working on a PR [2] to use a newer version of oc for the ssh-bastion step. [1] https://github.com/eparis/ssh-bastion/pull/26 [2] https://github.com/openshift/release/pull/11083
The RHEL 7.9 errata [1] was pushed out. In order to fix the failing test until 7.9 ships, a pre-release kernel will be installed for 4.6+ jobs [2]. [1] https://errata.devel.redhat.com/advisory/50704 [2] https://github.com/openshift/release/pull/11188
Opened a PR to install the pre-release kernel for scaleup-rhel7 jobs as well.
*** Bug 1878832 has been marked as a duplicate of this bug. ***
With the manual upgrade of the kernel for ci jobs, the 'workers-rhel7' CI jobs are succeeding regularly with occasional failures unrelated to RHEL usage. Still waiting for RHEL 7.9 to ship which contains the updated kernel. RHEL 7.9 errata [1] pushed to 2020-Sep-29. [1] https://errata.devel.redhat.com/advisory/50704
e2e-aws-scaleup-rhel7 has been replaced by e2e-aws-workers-rhel7 which uses the new kernel. e2e-aws-workers-rhel7 has been succeeding consistently.