Bug 1886426
Summary: | Missing step reply log in Agent log | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Fred Rolland <frolland> | ||||||||
Component: | assisted-installer | Assignee: | Omer Tuchfeld <otuchfel> | ||||||||
assisted-installer sub component: | discovery-agent | QA Contact: | Yuri Obshansky <yobshans> | ||||||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||||
Severity: | unspecified | ||||||||||
Priority: | high | CC: | aos-bugs, lgamliel | ||||||||
Version: | 4.6 | ||||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | OCP-Metal-v1.0.12 | ||||||||||
Fixed In Version: | OCP-Metal-v1.0.12.1 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2022-08-28 08:45:59 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Fred Rolland
2020-10-08 12:32:02 UTC
Created attachment 1719948 [details]
hosts logs
Please add the install command even if there is no error We should at least print the step.StepID before executing it. Here, I think: https://github.com/openshift/assisted-installer-agent/blob/master/src/commands/step_processor.go#L67 Print the step.StepID before executing it: https://github.com/openshift/assisted-installer-agent/pull/51 The commit that Fred made is just for adding a log before starting the command. We still need to understand why we don't have the steps reply in the log we should add a log after stdout, stderr, exitCode := handler(command, args...) when the exitCode != 0 (In reply to Fred Rolland from comment #4) > Print the step.StepID before executing it: > > https://github.com/openshift/assisted-installer-agent/pull/51 Not sure how to verify the bug. What is correct failure scenario? My cluster failed http://10.19.131.3:8080/clusters/58f01af4-bdf0-4dc5-ac05-fd5d8e6da9be But I did not find <install-****> step Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Executing step: <execute-c88a45f8>, command: </usr/bin/podman>, args: <[stop -i -t 5 assisted-installer]>" file="step_processor.go:78" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Executing step: <execute-49dd511d>, command: <podman>, args: <[run --rm --privileged --net=host -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log --env PULL_SECRET_TOKEN --name logs-sender --pid=host quay.io/ocpmetal/assisted-installer-agent:latest logs_sender -url http://10.19.131.3:8090 -cluster-id 58f01af4-bdf0-4dc5-ac05-fd5d8e6da9be -host-id aafd9871-ff57-4fe4-be81-695ea2b753d2 --insecure=false -bootstrap=false -with-installer-gather-logging=true]>" file="step_processor.go:78" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Sending step <execute-c88a45f8> reply output <> error <> exit-code <0>" file="step_processor.go:46" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd Dec 14 17:56:45 master-0-0 logs_sender[4333]: time="14-12-2020 17:56:45" level=info msg="Start gathering journalctl logs with tags [agent installer], services [] and installer-gather" file="send_logs.go:238" Thanks There is more than one PR relevant for resolving this ticket: (1) https://github.com/openshift/assisted-installer-agent/pull/51 (2) https://github.com/openshift/assisted-installer-agent/pull/91/files The first one you mentioned yourself. It adds the "Executing step:" log and it seems like from your logs that it appears correctly as expected. The second PR makes it so when the *step itself* fails (exit code != 0), a "Step execution failed" log should occur, with the step's exitcode/stdout/stderr also included in the log. I'll try and look into why exactly your cluster failed, but it might be for a reason other than a failed step, so it's not good enough for the test. In order to test PR #91 (the second one), specifically a step failure needs to occur. Ok, reproduced cluster failure https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/96af0ccc-3e3e-41d5-bf65-f08e4e7c8e31 We can see Executing step: <install-7e44a08d> However it happened even error <> exit-code <0> Not sure we expect this behaviour Dec 15 20:12:18 worker-0-1 next_step_runne[1933]: time="15-12-2020 20:12:18" level=info msg="Executing step: <install-7e44a08d>, command: <bash>, args: <[-c podman run -v /dev:/dev:rw -v /opt:/opt:rw -v /run/systemd/journal/socket:/run/systemd/journal/socket --privileged --pid=host --net=host -v /var/log:/var/log:rw --env PULL_SECRET_TOKEN --name assisted-installer quay.io/ocpmetal/assisted-installer:latest --role worker --cluster-id 96af0ccc-3e3e-41d5-bf65-f08e4e7c8e31 --boot-device /dev/sda --host-id ec8a2162-2927-4a16-b31d-cd00af78f93f --openshift-version 4.6 --mco-image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c8f0b9a2852b15b45487c08c158e10f3b803d7a77538d6dbc1d991994f58bcee --controller-image quay.io/ocpmetal/assisted-installer-controller:latest --url https://api.stage.openshift.com --insecure=false --agent-image quay.io/ocpmetal/assisted-installer-agent:latest]>" file="step_processor.go:78" request_id=fdbe18ce-0740-4ba9-adf9-5ae52fb0a9c5 Dec 15 20:12:18 worker-0-1 next_step_runne[1933]: time="15-12-2020 20:12:18" level=info msg="Sending step <dhcp-lease-allocate-0e2ba009> reply output <{\"api_vip_address\":\"192.168.123.131\",\"api_vip_lease\":\"lease {\\n interface \\\"api\\\";\\n fixed-address 192.168.123.131;\\n option subnet-mask 255.255.255.0;\\n option routers 192.168.123.1;\\n option dhcp-lease-time 3600;\\n option dhcp-message-type 5;\\n option domain-name-servers 192.168.123.1;\\n option dhcp-server-identifier 192.168.123.1;\\n option dhcp-renewal-time 1800;\\n option broadcast-address 192.168.123.255;\\n option dhcp-rebinding-time 3150;\\n option host-name \\\"00-1a-4a-33-3c-6f-api\\\";\\n option domain-name \\\"ocp-cluster-sealusa47-0.qe.lab.redhat.com\\\";\\n renew 2 2020/12/15 20:40:11;\\n rebind 2 2020/12/15 21:04:48;\\n expire 2 2020/12/15 21:12:18;\\n}\",\"ingress_vip_address\":\"192.168.123.109\",\"ingress_vip_lease\":\"lease {\\n interface \\\"ingress\\\";\\n fixed-address 192.168.123.109;\\n option subnet-mask 255.255.255.0;\\n option routers 192.168.123.1;\\n option dhcp-lease-time 3600;\\n option dhcp-message-type 5;\\n option domain-name-servers 192.168.123.1;\\n option dhcp-server-identifier 192.168.123.1;\\n option dhcp-renewal-time 1800;\\n option broadcast-address 192.168.123.255;\\n option dhcp-rebinding-time 3150;\\n option host-name \\\"00-1a-4a-54-a6-00-ingress\\\";\\n option domain-name \\\"ocp-cluster-sealusa47-0.qe.lab.redhat.com\\\";\\n renew 2 2020/12/15 20:41:27;\\n rebind 2 2020/12/15 21:04:48;\\n expire 2 2020/12/15 21:12:18;\\n}\"}> error <> exit-code <0>" file="step_processor.go:46" request_id=fdbe18ce-0740-4ba9-adf9-5ae52fb0a9c5 I don't seem to have access to see the latest cluster you linked. However, I can see from the logs you posted that the step (the one that happens to be in the logs, there should be many other ones) has exit-code 0. This issue's PRs deal with adding logs to step failures specifically, not cluster failures. The cluster failure might be interesting to investigate, but I'm not sure that it's relevant to discuss it as part of this issue's comments. Created attachment 1739467 [details]
Installer log
Added logs. Still do not know how to reproduce Step failure Give me clear scenario Thank you yobshans I think that this bug is not really worth the testing effort that is required from you (In reply to Ronnie Lazar from comment #13) > yobshans I think that this bug is not really worth the testing > effort that is required from you Agreed, let's close it |