Bug 2057502 - e2e-telco5g is permafailing
Summary: e2e-telco5g is permafailing
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: CNF Platform Validation
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Yuval Kashtan
QA Contact: Nikita
URL:
Whiteboard:
Depends On: 2074483
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-23 14:19 UTC by Stephen Benjamin
Modified: 2023-09-18 04:32 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
job=periodic-ci-openshift-release-master-nightly-4.11-e2e-telco5g=all job=periodic-ci-openshift-release-master-nightly-4.10-e2e-telco5g=all
Last Closed: 2023-03-09 01:13:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift release pull 27243 0 None Merged telco5g: default to master branch 2022-04-11 18:57:43 UTC
Github openshift release pull 27334 0 None Merged fix e2e-telco5g 2022-04-04 14:24:42 UTC
Github openshift release pull 27597 0 None Merged telco5g: print the right branch name 2022-04-06 13:51:53 UTC

Description Stephen Benjamin 2022-02-23 14:19:18 UTC
periodic-ci-openshift-release-master-nightly-4.11-e2e-telco5g

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-nightly-4.11-e2e-telco5g

There's an error in the logs:

 ************ telco5g cnf-tests commands ************
/bin/bash: line 14: PULL_BASE_REF: unbound variable

Comment 1 Federico Paolinelli 2022-02-23 14:22:28 UTC
This component is for reporting bugs to the cnf-tests docker image, which does not handle that env variable.
Not sure where you should assign it to. @stbenjam mind moving or closing it?

Comment 2 Stephen Benjamin 2022-02-23 15:09:29 UTC
@obraunsh You added these e2e-telco5g jobs, where should this bug go?

Comment 3 Yuval Kashtan 2022-02-28 07:39:27 UTC
I believe this is due to the "other" side of the bastion 
whenever something on the build cluster changes (pod restart, moved, etc.)
the other side need to reestablish communication
and currently it doesnt support that.

Comment 5 Michael Gourin 2022-03-06 16:06:23 UTC
The SSH tunnels are created by the script /home/cloud-user/bastion_connect.sh, this script along with other related files are maintained in the cnf-internal-deploy repo.
There is a service named ocpci.service run by the root server that runs the bastion_connect.sh script. This service is configured in root's crontab to restart at 7AM EST each day.
The service defines two append-only log files (one for STDOUT, one for STDERR) and the service failed to start because these files did not exist on the file system. I've created them and the service is now functioning properly.
Regarding monitoring & restarting the SSH tunnels, it looks as though this is being taken care of by the ssh-tunnel function in the bastion_connect.sh script, whereby an if condition is met when the exit code of the command is different than 0, and it is attempted 30 times (with 30 second sleeps in between) before finally exiting the script.

Comment 6 Yuval Kashtan 2022-03-16 11:39:13 UTC
 INFO[2022-03-15T19:26:22Z] Logs for container test in pod e2e-telco5g-telco5g-cnf-tests: 
INFO[2022-03-15T19:26:22Z] ************ telco5g cnf-tests commands ************
/bin/bash: line 14: PULL_BASE_REF: unbound variable
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-03-15T19:26:14Z"}
error: failed to execute wrapped command: exit status 1

Comment 7 Michael Gourin 2022-03-20 18:13:16 UTC
(In reply to Yuval Kashtan from comment #6)
>  INFO[2022-03-15T19:26:22Z] Logs for container test in pod
> e2e-telco5g-telco5g-cnf-tests: 
> INFO[2022-03-15T19:26:22Z] ************ telco5g cnf-tests commands
> ************
> /bin/bash: line 14: PULL_BASE_REF: unbound variable
> {"component":"entrypoint","error":"wrapped process failed: exit status
> 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:80","func":"k8s.io/test-
> infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing
> test process","severity":"error","time":"2022-03-15T19:26:14Z"}
> error: failed to execute wrapped command: exit status 1

What bash script does this error come from?

Comment 8 Stephen Benjamin 2022-03-21 18:42:07 UTC
It is coming from https://github.com/openshift/release/blob/master/ci-operator/step-registry/telco5g/cnf-tests/telco5g-cnf-tests-commands.sh

There is no PULL_BASE_REF on periodics.  I imagine this could just be fixed with something like branch="${PULL_BASE_REF:-master}"

Comment 9 Yuval Kashtan 2022-03-21 22:42:01 UTC
@stbenjam how can a periodic know what branch to clone? ie if it's master/ release-4.10/ release-4.9 etc?

Comment 10 Stephen Benjamin 2022-03-21 23:38:04 UTC
I don't know if you can, but most aren't cloning git repos but rather relying on CI to build a container image.  If you do it that way, you'd have a tag for each release.  Some details on https://docs.ci.openshift.org/docs/how-tos/use-registries-in-build-farm/, but the test platform team (#forum-testplatform) would be in the best position to advise here.

Comment 11 Yuval Kashtan 2022-03-24 11:00:56 UTC
even with a container image, I still need a way to know what version the periodic is currently supposed to run (And that could be used for either container image or git clone..)

Comment 12 Stephen Benjamin 2022-03-24 12:37:03 UTC
> even with a container image, I still need a way to know what version the periodic is currently supposed to run (And that could be used for either container image or git clone..)

No, you don't -- because the images are built from the release branches and named appropriately. If you look at the ci-operator configuration for jobs, you'll see release-specific images are used: https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.10.yaml

Comment 14 Stephen Benjamin 2022-04-28 14:07:33 UTC
Hi, it looks like the original problem about the unbound variable is fixed.  I see https://bugzilla.redhat.com/show_bug.cgi?id=2074483 is another issue. Is that what's causing the ansible install timeouts I see in the job now? Is there anything else TRT can do to help support you all fixing this?

Comment 15 Stephen Benjamin 2022-05-10 12:47:36 UTC
Hi, any update on this? Thanks

Comment 16 Shiftzilla 2023-03-09 01:13:26 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9130

Comment 17 Red Hat Bugzilla 2023-09-18 04:32:32 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.