Bug 1779811
Summary: | e2e-aws-scaleup-rhel7 constantly failing | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Kirsten Garrison <kgarriso> | |
Component: | Installer | Assignee: | Russell Teague <rteague> | |
Installer sub component: | openshift-ansible | QA Contact: | Russell Teague <rteague> | |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | adahiya, ccoleman, deads, rteague | |
Version: | 4.4 | Keywords: | Reopened | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1823916 (view as bug list) | Environment: | ||
Last Closed: | 2020-09-24 17:53:52 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1781575, 1824285, 1826094, 1827783, 1832332, 1836284 | |||
Bug Blocks: | 1823916, 1824150 |
Description
Kirsten Garrison
2019-12-04 19:09:37 UTC
This job is still failing : https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId= There's something going on in this job, it's been failing consistently across releases. If it doesn't work reliably why is there a CI job? See prior report as well: https://bugzilla.redhat.com/show_bug.cgi?id=1766792 This is also failing on 4.4 https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.4-e2e-aws-scaleup-rhel7?buildId= Top 15 failing tests for rhel7: Failed % of 210 Test (started between 2020-04-06T15:01:06 and 2020-04-15T08:28:56 UTC) 114 54 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 110 52 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s] 109 51 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] 107 50 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 106 50 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 103 49 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 102 48 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 102 48 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 101 48 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 98 46 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 98 46 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity [Suite:openshift/conformance/parallel] [Suite:k8s] 95 45 [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directory [Suite:openshift/conformance/parallel] [Suite:k8s] 94 44 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 94 44 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 92 43 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] Effecting all branch release jobs: release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2 release-openshift-ocp-e2e-aws-scaleup-rhel7-4.3 release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4 release-openshift-ocp-e2e-aws-scaleup-rhel7-4.5 release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 Effecting all branches (master, release-4.3, release-4.4) for the following repos: openshift-installer openshift-machine-api-operator openshift-machine-config-operator openshift-openshift-ansible When running e2e tests, observing node report Not Ready: { "lastHeartbeatTime": "2020-04-15T13:51:42Z", "lastTransitionTime": "2020-04-15T13:45:11Z", "message": "[container runtime is down, PLEG is not healthy: pleg was last seen active 7m5.997877271s ago; threshold is 3m0s]", "reason": "KubeletNotReady", "status": "False", "type": "Ready" } Possible fix: https://github.com/cri-o/cri-o/pull/3583 Backporting a fix to 1.17 Panic in crio: Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: panic: attempted to update last-writer in lockfile without the write lock Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: goroutine 135 [running]: Apr 15 13:44:43 ip-10-0-129-128.us-east-2.compute.internal crio[70396]: panic(0x560e8b171000, 0x560e8b4bc160) @Russell Did https://github.com/cri-o/cri-o/pull/3583 solve the issue? AFAIK, the 1.17 Prs were was pulled into 4.4. @Kirsten The crio issue was resolved by that fix, there were updated builds for each branch: 4.3 - cri-o-1.16.6-5.dev.rhaos4.3.git5fb6738.el7 4.4 - cri-o-1.17.4-2.dev.rhaos4.4.gitfe61deb.el7 4.5 - cri-o-1.18.0-3.dev.rhaos4.5.git2cef6c3.el7 The runc issue was also resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1823374 with new builds: 4.2 - runc-1.0.0-67.rc10.rhaos4.2.el7_8 4.3 - runc-1.0.0-67.rc10.rhaos4.3.el7 4.4 - runc-1.0.0-68.rc10.rhaos4.4.el7_8 4.5 - runc-1.0.0-68.rc10.rhaos4.5.el7 Still waiting on a resolution for the other issues identified with the dependent bugs linked above, Prometheus alert tests, Service endpoint tests, and worker config rollout. The issues associated with this bug are being tracked in the dependent bugs and https://issues.redhat.com/browse/CORS-1429. This is not blocking the 4.5 release. Still waiting on blocking issues to be resolved. CI has been broken for two months, raising this up one more level of severity. Still waiting for kernel update in this errata, https://errata.devel.redhat.com/advisory/56015 The required kernel update was not added to the errata referenced in comment 20. The issue related to "Services should be rejected when no endpoints exist" will not be addressed until RHEL 7.9 is shipped and available in AWS. The errata [1] for dependent bug [2] was pushed out and therefore the needed kernel update will not be available for at least three more weeks. [1] https://errata.devel.redhat.com/advisory/50704 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1832332 A change in the ssh-bastion deployment script [1] has caused the script to fail in CI. Working on a PR [2] to use a newer version of oc for the ssh-bastion step. [1] https://github.com/eparis/ssh-bastion/pull/26 [2] https://github.com/openshift/release/pull/11083 The RHEL 7.9 errata [1] was pushed out. In order to fix the failing test until 7.9 ships, a pre-release kernel will be installed for 4.6+ jobs [2]. [1] https://errata.devel.redhat.com/advisory/50704 [2] https://github.com/openshift/release/pull/11188 Opened a PR to install the pre-release kernel for scaleup-rhel7 jobs as well. *** Bug 1878832 has been marked as a duplicate of this bug. *** With the manual upgrade of the kernel for ci jobs, the 'workers-rhel7' CI jobs are succeeding regularly with occasional failures unrelated to RHEL usage. Still waiting for RHEL 7.9 to ship which contains the updated kernel. RHEL 7.9 errata [1] pushed to 2020-Sep-29. [1] https://errata.devel.redhat.com/advisory/50704 e2e-aws-scaleup-rhel7 has been replaced by e2e-aws-workers-rhel7 which uses the new kernel. e2e-aws-workers-rhel7 has been succeeding consistently. |