Version: 4.7.0-0.nightly-2020-11-12-200927 Platform: Baremetal on Dell Hardware driven through IPI install (BIOS/IPMI) Please specify: * IPI What happened? The time to deploy has regressed in 4.7 compared to 4.6. For a 3 master node + 2 worker node deployment on baremetal 4.6.4 4.7 nightly Bootstrap Complete 46 mins 51 mins Install Complete 45 mins 1 hr 28 mins The install finishes 45 minutes after successful bootstrap in the case of 4.6 and 1hr 28 minutes after bootstrap completion in the case of 4.7. So total bootstrap+install time in 4.6 = 1hr 31 mins total bootstrap + install time in 4.7 = 2 hr 19 mins It looked like the "provisioning" stage of the workers took a long time in 4.7 and instead of the one pxe boot, there was one pxe boot after which ironic python agent kept sleeping and eventually there was another reboot to pxe after which the RHCOS image was written. Ironic python agent kept sleeping for several minutes as can be seen on the console. The "provisioning" stage itself took around 31 minutes in 4.7. I will attach a screenshot of the console of a worker node during the provisioning where ironic python agent kept on sleeping until an eventual reboot to pxe and subsequent OS installation and run of OStree. I'm suspecting this is a regression in ironic as there is no rate limiting through BMO or like implemented yet.
Created attachment 1729960 [details] console-log-during-provisioning Please find the ironic python agent logs as seen from the console of worker during "provisioning" attached.
Sai - which ironic rpm is being used in this test? There are these log message regarding the agent token in the ironic-conductor.log file: Token is properly generated 020-11-17 00:11:43.066 1 DEBUG ironic.conductor.manager [req-3d7af0d9-a790-44cd-968c-5c765cb072e8 - - - - -] Generating agent token for node 8af71141-c7af-4922-bc13-2a610d5360c6 get_node_with_token /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3535[00m After a power cycle the conductor can't generate a new token: 2020-11-17 00:17:45.722 1 WARNING ironic.conductor.manager [req-52cb6f42-5470-4aa8-84d3-a6493c9374b4 - - - - -] An agent token generation request is being refused as one is already present for node 8af71141-c7af-4922-bc13-2a610d5360c6[00m Following another power cycle the token is properly generated: 020-11-17 00:34:07.890 1 DEBUG ironic.common.utils [req-766d4d1f-89f6-47a0-84ce-2c6b52fd8730 - - - - -] Command stdout is: "Chassis Power Control: Down/Off ... 2020-11-17 00:38:59.848 1 DEBUG ironic.conductor.manager [req-10c4c7a6-7443-47af-a4fe-a03c7457dcb2 - - - - -] Generating agent token for node 8af71141-c7af-4922-bc13-2a610d5360c6 get_node_with_token /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3535[00m We have seen problems in which IPA can't get a token, which ends up resulting in a extra timeout/reset cycle. I thought that was fixed with https://review.opendev.org/#/c/756479/ for https://bugzilla.redhat.com/show_bug.cgi?id=1885355 but need to see if that fix is in the version you are testing.
The version of ironic should be whatever was carried in 4.7.0-0.nightly-2020-11-12-200927. I do not have the env now to check.
Bob, can you please let me know what nightly has the fix so I can try and report back?
Looking at your logs I'm seeing a lot of "timed out" messages, e.g. 2020-11-17 00:11:42.328 1 DEBUG eventlet.wsgi.server [req-c31f9184-5780-4838-be45-42646e36973d - - - - -] (1) timed out ('::ffff:172.22.0.3', 57408, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823[00m When I run 4.7 locally my inspection, stalls for 10 minutes or so, while posting the collected data, I'm thinking that possibly connections from the BMO to inspector aren't being closed and as a result holding workers open until they timeout. I can't see your IPA log so don't know if the same thing is happening here, do you have the IPA logs? And if so is there a long delay between these lines Nov 18 11:50:51 worker-1 ironic-python-agent[733]: 2020-11-18 11:50:51.627 733 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory': {... Nov 18 12:00:51 worker-1 ironic-python-agent[733]: 2020-11-18 12:00:51.617 733 INFO ironic_python_agent.inspector [-] inspection finished successfull
(In reply to Derek Higgins from comment #6) > Looking at your logs I'm seeing a lot of "timed out" messages, e.g. > > 2020-11-17 00:11:42.328 1 DEBUG eventlet.wsgi.server > [req-c31f9184-5780-4838-be45-42646e36973d - - - - -] (1) timed out > ('::ffff:172.22.0.3', 57408, 0, 0) process_request > /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823[00m > > > When I run 4.7 locally my inspection, stalls for 10 minutes or so, while > posting the collected data, I'm thinking that possibly connections from the > BMO to inspector aren't being closed and as a result holding workers open > until they timeout. > > I can't see your IPA log so don't know if the same thing is happening here, > do you have the IPA logs? And if so is there a long delay between these lines > > Nov 18 11:50:51 worker-1 ironic-python-agent[733]: 2020-11-18 11:50:51.627 > 733 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory': > {... > Nov 18 12:00:51 worker-1 ironic-python-agent[733]: 2020-11-18 12:00:51.617 > 733 INFO ironic_python_agent.inspector [-] inspection finished successfull In my local dev-scripts watching sockets to ironic inspector (during inspection), ESTABLISHED sockets raises to 100 and then remains there. [root@master-1 /]# netstat -apn | grep :5050 | grep barem tcp6 0 0 fd00:1101::3:54118 fd00:1101::3:5050 ESTABLISHED 25332/baremetal-ope tcp6 0 0 fd00:1101::3:39390 fd00:1101::3:5050 ESTABLISHED 25332/baremetal-ope tcp6 0 0 fd00:1101::3:49100 fd00:1101::3:5050 ESTABLISHED 25332/baremetal-ope tcp6 0 0 fd00:1101::3:52448 fd00:1101::3:5050 ESTABLISHED 25332/baremetal-ope tcp6 0 0 fd00:1101::3:39982 fd00:1101::3:5050 ESTABLISHED 25332/baremetal-ope <snip> [root@master-2 tmp]# netstat -apn | grep 5050 | grep barem | wc 102 714 10302 Then inspector stops responding until connections hit the 15 minute timeout. During inspection in my tests this accounted for roughly 12 to 15 minutes but the same thing may be happening in ironic (I haven't looked through the logs as closely) [root@master-2 tmp]# netstat -apn | grep 6385 | grep barem | wc 101 707 10201 I'm not sure this accounts for all of the additional time but if the same thing is happening in your environment its at least some of it. e.g. I think this BMO call needs to be closed but there may be others https://github.com/openshift/baremetal-operator/blob/3f71294/pkg/provisioner/ironic/dependencies.go#L48 it might also be that the default server behaviour has changed to enable keepalive
So, I agree with Derek that there is a combination of problems. With the nightly .7.0-0.nightly-2020-11-18-085225 [kni@e16-h12-b01-fc640 ~]$ grep "agent token" conductor 2020-11-18 22:41:14.764 1 DEBUG ironic.conductor.manager [req-d3cb2911-1143-4cc9-a9b7-2c677f221dbd - - - - -] Generating agent token for node 9e19b004-be4f-456f-bc5b-5dceb8c50650 get_node_with_token /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3540 2020-11-18 22:41:14.785 1 DEBUG ironic.conductor.manager [req-4d4de588-7935-427c-9dd8-b02edc4a9f10 - - - - -] Generating agent token for node 764dedc7-b92e-45d6-a8a4-852bd4a80e1e get_node_with_token /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3540 So the token issue BOb mentioned is no longer an issue. But let me look into Derek's comments as the "install" after bootstrap still took 56 minutes which is higher than 4.6
Can you guys help me figure out how to get access to the IPA logs in the inspector container. I tried crictl exec 07dc56e736e92 /bin/bash from the master node running the ironic-inspector pod and oc exec metal3-5f5678b464-72kp4 -c metal3-ironic-inspector -- bash none of which work
There's a bunch of timedout messages in the ironic-inspector logs inspector:24190:2020-11-19 15:28:02.593 1 DEBUG eventlet.wsgi.server [req-1d178150-4ef6-4176-8f2f-557a9dd87693 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37264, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24192:2020-11-19 15:28:02.597 1 DEBUG eventlet.wsgi.server [req-928acc5f-6b8a-4c01-8a91-9b660ac1eee1 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37266, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24193:2020-11-19 15:28:02.664 1 DEBUG eventlet.wsgi.server [req-b74e6395-0e8b-4430-a854-f19bb9ad3522 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37278, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24194:2020-11-19 15:28:04.669 1 DEBUG eventlet.wsgi.server [req-64abc33c-028b-4bd3-89d1-c1f3832a8ee2 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37280, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24195:2020-11-19 15:28:18.513 1 DEBUG eventlet.wsgi.server [req-8ed534e6-390a-4cd2-ae76-6832a4e1aeb2 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37612, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24196:2020-11-19 15:28:21.234 1 DEBUG eventlet.wsgi.server [req-092f61f6-8624-46af-b5ba-559ee82d12af - - - - -] (1) timed out ('::ffff:172.22.0.17', 37614, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24207:2020-11-19 15:28:33.707 1 DEBUG eventlet.wsgi.server [req-53aa1eee-49d5-4e8b-a16f-fb02f2c69ec7 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37618, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24208:2020-11-19 15:28:35.752 1 DEBUG eventlet.wsgi.server [req-eb5bc448-0a36-4f07-b3ee-0995b223f551 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37760, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24212:2020-11-19 15:28:51.361 1 DEBUG eventlet.wsgi.server [req-04c0ab62-61e8-4291-8b95-9e1779a36699 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37850, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 inspector:24213:2020-11-19 15:29:04.738 1 DEBUG eventlet.wsgi.server [req-50225c96-7f4a-4dab-b2ab-fb01fad29906 - - - - -] (1) timed out ('::ffff:172.22.0.17', 38192, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823 I see deployment taking 11 minutes longer, with the agent token fix that Bob referenced, so kind of in line with Derek's observations. I missed a -it in my oc and crictl commands pasted above <need more coffee>, so will get u that information as well.
The ramdisk logs for both the nodes in my deployment truncates at /ironic_python_agent/inspector.py:224 Nov 18 17:37:53 e16-h14-b02-fc640.rdu2.scalelab.redhat.com ironic-python-agent[1804]: 2020-11-18 17:37:53.283 1804 DEBUG ironic_python_agent.inspector [-] BMC IP address: 10.1.30.154 collect_default /usr/lib/python3.6/site-packages/ironic_python_agent/inspector.py:225 Nov 18 17:37:53 e16-h14-b02-fc640.rdu2.scalelab.redhat.com ironic-python-agent[1804]: 2020-11-18 17:37:53.284 1804 INFO ironic_python_agent.utils [-] Collecting system logs and debugging information Nov 18 17:37:53 e16-h14-b02-fc640.rdu2.scalelab.redhat.com ironic-python-agent[1804]: 2020-11-18 17:37:53.290 1804 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): journalctl --full --no-pager -b -n 10000 execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:377 So couldn't find the log messages Derek was referencing.
I'm OK with removing the TestBlocker flag now.
> e.g. I think this BMO call needs to be closed but there may be others > https://github.com/openshift/baremetal-operator/blob/3f71294/pkg/provisioner/ > ironic/dependencies.go#L48 Workaround for the tcp connection problem submitted upstream https://github.com/metal3-io/baremetal-operator/pull/726 Looks like gophercloud has had functionality added to reuse tcp connections[1] but it isn't reusing them in inspector status checks 1- https://github.com/gophercloud/gophercloud/pull/1952
It's possible that https://github.com/metal3-io/baremetal-operator/pull/712 has fixed the bug.
It does appear that https://github.com/metal3-io/baremetal-operator/pull/712 has fixed the tcp socket issue. That fix was merged downstream in https://github.com/openshift/baremetal-operator/pull/113. At this point both fixes - the token generation fix and the tcp socket issue, should be available in 4.7. Moving this to MODIFIED.
Hey Shi, I'm closing this BZ, as it seemed to be solved. If you sill get timeouts and long installation please open new BZ/ B.R Raviv
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633