Bug 2152305

Summary: The ppc64le builds frequently timeouts & fails.
Product: [Community] Copr Reporter: Balint Cristian <cristian.balint>
Component: backendAssignee: Copr Team <copr-team>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: jkadlcik, praiskup
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64le   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-02 15:53:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Balint Cristian 2022-12-10 13:06:39 UTC
Description
===========

First, this *only* happens to ppc64le.
Not noticed on x86-64/aarch64 builders, even under shortage/queue conditions.

* Many (if not all) ppc64le builds systematically fails.
* See some examples here, (builds was canceled) for exposure:

  $ copr list-builds rezso/HDL | grep cancel | awk '{print $1}'

  5119900, 5119899, 5119882, 5119868, 5119855
  5116866, 5116863, 5116862, 5116861, 5116860
  5116859, 5116857

* Failures *always* end up with some kind of timeout:
  "Connection timed out during banner exchange"
  "Connection to XX.XX.XX.XX port 22 timed out"

  E.g. https://download.copr.fedorainfracloud.org/results/rezso/HDL/fedora-rawhide-ppc64le/05119899-verible/builder-live.log.gz


Ocurrence:
==========

* Before latest scheduled upgrade there was only few (waivable) failures.
* After latest scheduled upgrade these failures are frequently occurring.

Sometimes might be a shortage/queue of ppc64le builders, but this don't explain such failures.

Comment 1 Balint Cristian 2022-12-11 10:02:12 UTC
Update
=======


 * Another strange behaviour, related to ppc64le only, that might hint at firewall/badcfg issues, despite "internet access" is checked:
 
  <...>
  Cloning into '.'...
  fatal: unable to access 'https://github.com/chipsalliance/Surelog.git/': Failed to connect to github.com port 443 after 1039 ms: No route to host
  <...>

Comment 2 Jakub Kadlčík 2022-12-13 00:29:52 UTC
Hello Balint,
thank you for the report.

I think it will be related to this issue
https://github.com/fedora-copr/copr/issues/2433

It is my priority for this week

Comment 3 Jakub Kadlčík 2023-01-02 15:53:00 UTC
Sorry, it took so long, I had a hard time debugging this and needed some help from @praiskup.
It should be fixed now, more information in the GitHub issue above.

Please let us know if you see this happening again.