periodic-ci-openshift-release-master-nightly-4.11-e2e-telco5g is failing frequently in CI, see: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-nightly-4.11-e2e-telco5g There's an error in the logs: ************ telco5g cnf-tests commands ************ /bin/bash: line 14: PULL_BASE_REF: unbound variable
This component is for reporting bugs to the cnf-tests docker image, which does not handle that env variable. Not sure where you should assign it to. @stbenjam mind moving or closing it?
@obraunsh You added these e2e-telco5g jobs, where should this bug go?
I believe this is due to the "other" side of the bastion whenever something on the build cluster changes (pod restart, moved, etc.) the other side need to reestablish communication and currently it doesnt support that.
The SSH tunnels are created by the script /home/cloud-user/bastion_connect.sh, this script along with other related files are maintained in the cnf-internal-deploy repo. There is a service named ocpci.service run by the root server that runs the bastion_connect.sh script. This service is configured in root's crontab to restart at 7AM EST each day. The service defines two append-only log files (one for STDOUT, one for STDERR) and the service failed to start because these files did not exist on the file system. I've created them and the service is now functioning properly. Regarding monitoring & restarting the SSH tunnels, it looks as though this is being taken care of by the ssh-tunnel function in the bastion_connect.sh script, whereby an if condition is met when the exit code of the command is different than 0, and it is attempted 30 times (with 30 second sleeps in between) before finally exiting the script.
INFO[2022-03-15T19:26:22Z] Logs for container test in pod e2e-telco5g-telco5g-cnf-tests: INFO[2022-03-15T19:26:22Z] ************ telco5g cnf-tests commands ************ /bin/bash: line 14: PULL_BASE_REF: unbound variable {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-03-15T19:26:14Z"} error: failed to execute wrapped command: exit status 1
(In reply to Yuval Kashtan from comment #6) > INFO[2022-03-15T19:26:22Z] Logs for container test in pod > e2e-telco5g-telco5g-cnf-tests: > INFO[2022-03-15T19:26:22Z] ************ telco5g cnf-tests commands > ************ > /bin/bash: line 14: PULL_BASE_REF: unbound variable > {"component":"entrypoint","error":"wrapped process failed: exit status > 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:80","func":"k8s.io/test- > infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing > test process","severity":"error","time":"2022-03-15T19:26:14Z"} > error: failed to execute wrapped command: exit status 1 What bash script does this error come from?
It is coming from https://github.com/openshift/release/blob/master/ci-operator/step-registry/telco5g/cnf-tests/telco5g-cnf-tests-commands.sh There is no PULL_BASE_REF on periodics. I imagine this could just be fixed with something like branch="${PULL_BASE_REF:-master}"
@stbenjam how can a periodic know what branch to clone? ie if it's master/ release-4.10/ release-4.9 etc?
I don't know if you can, but most aren't cloning git repos but rather relying on CI to build a container image. If you do it that way, you'd have a tag for each release. Some details on https://docs.ci.openshift.org/docs/how-tos/use-registries-in-build-farm/, but the test platform team (#forum-testplatform) would be in the best position to advise here.
even with a container image, I still need a way to know what version the periodic is currently supposed to run (And that could be used for either container image or git clone..)
> even with a container image, I still need a way to know what version the periodic is currently supposed to run (And that could be used for either container image or git clone..) No, you don't -- because the images are built from the release branches and named appropriately. If you look at the ci-operator configuration for jobs, you'll see release-specific images are used: https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.10.yaml
Hi, it looks like the original problem about the unbound variable is fixed. I see https://bugzilla.redhat.com/show_bug.cgi?id=2074483 is another issue. Is that what's causing the ansible install timeouts I see in the job now? Is there anything else TRT can do to help support you all fixing this?
Hi, any update on this? Thanks
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9130
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days