Bug 1964688
| Summary: | [master] ACM/ZTP with Wan emulation, several clusters fail to step past discovery | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Alex Krzos <akrzos> | ||||||||||
| Component: | assisted-installer | Assignee: | Ori Amizur <oamizur> | ||||||||||
| assisted-installer sub component: | discovery-agent | QA Contact: | Yuri Obshansky <yobshans> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||||||
| Severity: | urgent | ||||||||||||
| Priority: | urgent | CC: | mcornea, oamizur | ||||||||||
| Version: | 4.8 | Keywords: | Triaged | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | AI-Team-Platform KNI-EDGE-4.8 | ||||||||||||
| Fixed In Version: | OCP-Metal-v1.0.23.1 | Doc Type: | If docs needed, set a value | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | |||||||||||||
| : | 1971632 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2022-08-28 08:45:59 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 1971632 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Alex Krzos
2021-05-25 20:39:01 UTC
Created attachment 1786988 [details]
sno00031 journal logs
Alex, any chance we can have a new run of this and collect more logs (including a tcpdump) (In reply to Flavio Percoco from comment #2) > Alex, any chance we can have a new run of this and collect more logs > (including a tcpdump) I have re-run this with https disabled since last time trying to decode what was being sent through ssl seemed impossible with the current setup and I was unable to reproduce with https disabled (however I also switched to fc.7 ocp version) reading back I ran with fc.3 so I'll retry with that build. A quick update I am still seeing this failure type. I have re-run with newer bits and adjusted the test to more accurately resemble wan emulation: * egress hypervisor latency 75ms, 0.02% packet loss * ingress hjypervisor latency 75ms, 0.02% packet loss * hub cluster ocp 4.8.0-fc.8 * clusterimageset / sno ocp is ocp 4.8.0-fc.8 * ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02 1st run: 3/60 failed with this failure type 2nd run: 2/60 failed with this failure type. I expect to get a bit more time to dig at this again today. I suspect something is wrong with the free_addresses command: [root@localhost ~]# podman run --privileged --net=host --rm --quiet --name akrzos_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses ["198.18.8.0/21"] invalid character '.' after array element It is still unclear to me what exactly is failing here. Reviewing a cluster that is actively failing because of this, I see these things: Following this request_id = 0e39f985-ebe1-4b47-98d5-c4ba2707a61b I put a snippet of logs here - https://gist.github.com/akrzos/4db55dd43109212320504eba50d76bba since they are lengthy lines and tend to get chopped up when pasted into a bz. It seems the assisted-service is saying that the host status is insufficient ("host: <4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5>, host status: <insufficient>") and to run some commands to gather data/host information, it also asks to scan for free ip addresses. At first I thought there was an error with invoking this command however it seems to output valid output when invoked off the sno itself: root@localhost ~]# podman run --privileged --net=host --rm --quiet --name test_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses '["198.18.8.0/21"]' [{"free_addresses":["198.18.8.0","198.18.8.2","198.18.8.10","198.18.8.11",..... ...,"198.18.15.254","198.18.15.255"],"network":"198.18.8.0/21"}] [root@localhost ~]# podman run --privileged --net=host --rm --quiet --name test_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses '["198.18.8.0/21"]' | jq -r ".[].free_addresses|length" 1973 Which is the expected number (1973) of free addresses after accounting for all the hosts on this subnet. The SNO sends the free addresses over: Jun 10 18:41:04 localhost next_step_runne[1965]: time="10-06-2021 18:41:04" level=info msg="Sending step <free-network-addresses-18ddd3dc> reply error <> exit-code <0>" file="step_processor.go:48" request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b However the assisted-service seems to say that there is a json decoding issue with the data: time="2021-06-10T18:41:04Z" level=error msg="Failed decode <free-network-addresses-18ddd3dc> reply for host <4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5> cluster <9a797e9e-e37e-44a9-bc05-48c9b6d5d654>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/remote-source/app/internal/bminventory/inventory.go:2828" cluster_id=9a797e9e-e37e-44a9-bc05-48c9b6d5d654 error="unexpected end of JSON input" go-id=635517 host_id=4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5 pkg=Inventory request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b And the sno then gets the reply that the data it sent was bad Jun 10 18:41:04 localhost next_step_runne[1965]: time="10-06-2021 18:41:04" level=warning msg="Assisted service returned status code Bad Request after processing step reply. Reason: unexpected end of JSON input" file="step_processor.go:74" request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b It is unclear what "packaging" the agent does to the free address data that could have corrupted it. I would suspect that since we have a /21 configured for all SNOs and the hub cluster that perhaps the large number of free addresses is "too much" data however only 2/60 SNOs have this failure. So I dug into more of the objects off the hub cluster and found the events log for these failed clusters. The most notable event that could indicate an error are: { "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654", "event_time": "2021-06-09T22:43:27.481Z", "host_id": "4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5", "message": "Host sno00017: validation 'container-images-available' that used to succeed is now failing", "severity": "warning" }, { "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654", "event_time": "2021-06-09T22:43:27.490Z", "host_id": "4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5", "message": "Host sno00017: updated status from \"preparing-for-installation\" to \"insufficient\" (Host cannot be installed due to following failing validation(s): Failed to fetch container images needed for installation from quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a ; Host couldn't synchronize with any NTP server)", "severity": "warning" }, { "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654", "event_time": "2021-06-09T22:43:35.478Z", "message": "Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing", "severity": "warning" }, { "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654", "event_time": "2021-06-09T22:43:35.487Z", "message": "Updated status of cluster sno00017 to insufficient", "severity": "info" }, It seems to me that something "told" the assisted-installer the sno was having trouble pulling images and it gave up and prevents the cluster from further install in this case. (instead of a corrupted https packet (failure to decode)) Created attachment 1790017 [details]
sno00017 events log
Created attachment 1790018 [details]
sno00037 events log
I believe the smoking gun for this is:
Jun 10 21:06:37 localhost next_step_runne[1940]: time="10-06-2021 21:06:37" level=info msg="Sending step <container-image-availability-6a5ee7e6> reply output <{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",\"result\":\"failure\"},{\"download_rate\":6.553431398919053,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75dcac253d32eeb10fba6fc760a55c23b424dbe075946f7c16ae9491902c61dc\",\"result\":\"success\",\"size_bytes\":475449430,\"time\":72.549692071},{\"download_rate\":15.100748670432615,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83276e1421645322d9abdf21752c170645fac26456b05b19b1870b7c0b808dbb\",\"result\":\"success\",\"size_bytes\":428752695,\"time\":28.392810473}]}> error <time=\"2021-06-10T21:06:37Z\" level=warning msg=\"lstat /sys/fs/cgroup/devices/machine.slice/libpod-08e75a959d2036bbf7df221acb80e17c1d05e105f69ff7466e7a2dbbfd7cade2.scope: no such file or directory\"\n> exit-code <0>" file="step_processor.go:50" request_id=1df87a8d-ee20-4a58-b934-8811150766c7
{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",\"result\":\"failure\"}
Looks like the container-image-availability step does not have any mechanism to retry.
(In reply to Alex Krzos from comment #8) > I believe the smoking gun for this is: > > Jun 10 21:06:37 localhost next_step_runne[1940]: time="10-06-2021 21:06:37" > level=info msg="Sending step <container-image-availability-6a5ee7e6> reply > output > <{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256: > 834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\", > \"result\":\"failure\"},{\"download_rate\":6.553431398919053,\"name\":\"quay. > io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 75dcac253d32eeb10fba6fc760a55c23b424dbe075946f7c16ae9491902c61dc\", > \"result\":\"success\",\"size_bytes\":475449430,\"time\":72.549692071}, > {\"download_rate\":15.100748670432615,\"name\":\"quay.io/openshift-release- > dev/ocp-v4.0-art-dev@sha256: > 83276e1421645322d9abdf21752c170645fac26456b05b19b1870b7c0b808dbb\", > \"result\":\"success\",\"size_bytes\":428752695,\"time\":28.392810473}]}> > error <time=\"2021-06-10T21:06:37Z\" level=warning msg=\"lstat > /sys/fs/cgroup/devices/machine.slice/libpod- > 08e75a959d2036bbf7df221acb80e17c1d05e105f69ff7466e7a2dbbfd7cade2.scope: no > such file or directory\"\n> exit-code <0>" file="step_processor.go:50" > request_id=1df87a8d-ee20-4a58-b934-8811150766c7 > > > {\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256: > 834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\", > \"result\":\"failure\"} > > Looks like the container-image-availability step does not have any mechanism > to retry. Great investigation, @akrzos I think we should investigate both options. From conversations over chat, it looks like cri-o may have a similar issue to https://bugzilla.redhat.com/show_bug.cgi?id=1966872 However, it is true that sending `free_addresses` may end up in a quite big json. It would be nice to verify whether there's a retry logic for it and that it is, indeed, working as expected. Maybe we can find this info in the assisted-service logs. Is the testing host still up? Any chance we can take a look at it and run some criocli commands? @oamizur Lets disable the Free-List tool on operator based deployments There are two separate issues here: 1 - The large free-addresses list is getting truncated and the assisted-service is not able to decode the json 2 - Some images are not being pulled correctly the second issue is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1964591. The long-term solution should come from podman fix to https://bugzilla.redhat.com/show_bug.cgi?id=1966872 Removing needinfo since bug is verified. |