2057502 – e2e-telco5g is permafailing

Bug 2057502 - e2e-telco5g is permafailing

Summary: e2e-telco5g is permafailing

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	CNF Platform Validation
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Yuval Kashtan
QA Contact:	Nikita
Docs Contact:
URL:
Whiteboard:
Depends On:	2074483
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-23 14:19 UTC by Stephen Benjamin
Modified:	2023-09-18 04:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-nightly-4.11-e2e-telco5g=all job=periodic-ci-openshift-release-master-nightly-4.10-e2e-telco5g=all
Last Closed:	2023-03-09 01:13:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift release pull 27243	None	Merged	telco5g: default to master branch	2022-04-11 18:57:43 UTC
Github	openshift release pull 27334	None	Merged	fix e2e-telco5g	2022-04-04 14:24:42 UTC
Github	openshift release pull 27597	None	Merged	telco5g: print the right branch name	2022-04-06 13:51:53 UTC

Description Stephen Benjamin 2022-02-23 14:19:18 UTC

periodic-ci-openshift-release-master-nightly-4.11-e2e-telco5g

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-nightly-4.11-e2e-telco5g

There's an error in the logs:

 ************ telco5g cnf-tests commands ************
/bin/bash: line 14: PULL_BASE_REF: unbound variable

Comment 1 Federico Paolinelli 2022-02-23 14:22:28 UTC

This component is for reporting bugs to the cnf-tests docker image, which does not handle that env variable.
Not sure where you should assign it to. @stbenjam mind moving or closing it?

Comment 2 Stephen Benjamin 2022-02-23 15:09:29 UTC

@obraunsh You added these e2e-telco5g jobs, where should this bug go?

Comment 3 Yuval Kashtan 2022-02-28 07:39:27 UTC

I believe this is due to the "other" side of the bastion 
whenever something on the build cluster changes (pod restart, moved, etc.)
the other side need to reestablish communication
and currently it doesnt support that.

Comment 5 Michael Gourin 2022-03-06 16:06:23 UTC

The SSH tunnels are created by the script /home/cloud-user/bastion_connect.sh, this script along with other related files are maintained in the cnf-internal-deploy repo.
There is a service named ocpci.service run by the root server that runs the bastion_connect.sh script. This service is configured in root's crontab to restart at 7AM EST each day.
The service defines two append-only log files (one for STDOUT, one for STDERR) and the service failed to start because these files did not exist on the file system. I've created them and the service is now functioning properly.
Regarding monitoring & restarting the SSH tunnels, it looks as though this is being taken care of by the ssh-tunnel function in the bastion_connect.sh script, whereby an if condition is met when the exit code of the command is different than 0, and it is attempted 30 times (with 30 second sleeps in between) before finally exiting the script.

Comment 6 Yuval Kashtan 2022-03-16 11:39:13 UTC

 INFO[2022-03-15T19:26:22Z] Logs for container test in pod e2e-telco5g-telco5g-cnf-tests: 
INFO[2022-03-15T19:26:22Z] ************ telco5g cnf-tests commands ************
/bin/bash: line 14: PULL_BASE_REF: unbound variable
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-03-15T19:26:14Z"}
error: failed to execute wrapped command: exit status 1

Comment 7 Michael Gourin 2022-03-20 18:13:16 UTC

(In reply to Yuval Kashtan from comment #6)
>  INFO[2022-03-15T19:26:22Z] Logs for container test in pod
> e2e-telco5g-telco5g-cnf-tests: 
> INFO[2022-03-15T19:26:22Z] ************ telco5g cnf-tests commands
> ************
> /bin/bash: line 14: PULL_BASE_REF: unbound variable
> {"component":"entrypoint","error":"wrapped process failed: exit status
> 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:80","func":"k8s.io/test-
> infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing
> test process","severity":"error","time":"2022-03-15T19:26:14Z"}
> error: failed to execute wrapped command: exit status 1

What bash script does this error come from?

Comment 8 Stephen Benjamin 2022-03-21 18:42:07 UTC

It is coming from https://github.com/openshift/release/blob/master/ci-operator/step-registry/telco5g/cnf-tests/telco5g-cnf-tests-commands.sh

There is no PULL_BASE_REF on periodics.  I imagine this could just be fixed with something like branch="${PULL_BASE_REF:-master}"

Comment 9 Yuval Kashtan 2022-03-21 22:42:01 UTC

@stbenjam how can a periodic know what branch to clone? ie if it's master/ release-4.10/ release-4.9 etc?

Comment 10 Stephen Benjamin 2022-03-21 23:38:04 UTC

I don't know if you can, but most aren't cloning git repos but rather relying on CI to build a container image.  If you do it that way, you'd have a tag for each release.  Some details on https://docs.ci.openshift.org/docs/how-tos/use-registries-in-build-farm/, but the test platform team (#forum-testplatform) would be in the best position to advise here.

Comment 11 Yuval Kashtan 2022-03-24 11:00:56 UTC

even with a container image, I still need a way to know what version the periodic is currently supposed to run (And that could be used for either container image or git clone..)

Comment 12 Stephen Benjamin 2022-03-24 12:37:03 UTC

> even with a container image, I still need a way to know what version the periodic is currently supposed to run (And that could be used for either container image or git clone..)

No, you don't -- because the images are built from the release branches and named appropriately. If you look at the ci-operator configuration for jobs, you'll see release-specific images are used: https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.10.yaml

Comment 14 Stephen Benjamin 2022-04-28 14:07:33 UTC

Hi, it looks like the original problem about the unbound variable is fixed.  I see https://bugzilla.redhat.com/show_bug.cgi?id=2074483 is another issue. Is that what's causing the ansible install timeouts I see in the job now? Is there anything else TRT can do to help support you all fixing this?

Comment 15 Stephen Benjamin 2022-05-10 12:47:36 UTC

Hi, any update on this? Thanks

Comment 16 Shiftzilla 2023-03-09 01:13:26 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9130

Comment 17 Red Hat Bugzilla 2023-09-18 04:32:32 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.