Description of problem: Agent calls GetNextSteps API, the request arrives at the service, while processing the request the service runs `oc adm release info`, the command may take a long time and the request eventually times out Version-Release number of selected component (if applicable): Latest (2021/06/08) How reproducible: It seems to happen when there's connectivity issues to the registry, it hangs for 15 seconds, then says ``` $ oc adm release info error: unable to read image registry.ci...: Get "...": context deadline exceeded ``` Steps to Reproduce: See above Actual results: Request takes a long time, seems like some intermediate HTTP proxy eventually gives up and returns a 504. This is unexpected in the swagger definition so the agent gives a cryptic error message about swagger and 504. Expected results: Service should be able to tell that the command will fail more quickly. Either by caching in-advance, or having a more strict internal timeout. Additional info:
We saw that assisted-service is running "oc adm release info" for 2 images "machine-config-operator" and "must-gather". Current mirrored registry (disconnected env) was slow and it took ~15 seconds for each command to run. On the agent side we have 30 seconds timeout as default in assisted-client. This combination caused agent to timeout all the time and installation failed to start and moved from preparing back to ready and vicer-versa all the time. We should add cache per ocp version per image as there is no benefit to run it all the time.
Besides caching, maybe should we also enlarge the timeout on the agent's side?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759