Bug 1309300

Summary: [scale] - vdsm initialization timeout (on master machine with 144 cores and vm per core)
Product: [oVirt] vdsm Reporter: Eldad Marciano <emarcian>
Component: CoreAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Eldad Marciano <emarcian>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.17.20CC: bugs, emarcian, gklein, oourfali
Target Milestone: ovirt-4.0.0-alphaFlags: oourfali: ovirt-4.0.0?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-13 07:30:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Eldad Marciano 2016-02-17 12:20:03 UTC
Description of problem:
when restarting (vdsm start) on master machine with 144 cores and vm per core.
vdsm pre-start \ initialization failed due to systemd TimeOutStartSec=90.

when it failed systemctl status and journal print this:
Active: activating (start-pre) since Wed 2016-02-17 05:57:46 EST; 136ms ago
Process: 113343 ExecStopPost=/usr/libexec/vdsm/vdsmd_init_common.sh --post-stop (code=exited, status=0/SUCCESS)
Main PID: 96235 (code=exited, status=0/SUCCESS);         : 127024 (vdsmd_init_comm)
    CGroup: /system.slice/vdsmd.service
            └─control
              ├─127024 /bin/sh /usr/libexec/vdsm/vdsmd_init_common.sh --pre-start
              └─127030 /usr/bin/python /usr/share/vdsm/get-conf-item /etc/vdsm/vdsm.conf irs repository /rhev/


Once the TimeOutStartSec extended to 500sec, vdsm start correctly

seems like the pre start and init hit the performance under this such of scale.

not sure if vdsm support profiling around this area since it is pre-start of the vdsm itself.

Version-Release number of selected component (if applicable):
vdsm-4.17.20-0.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. extreme host with large ram and more than 100 cores.
2. run vm per core.


Actual results:
vdsm failed to start due to TimeOutStartSec

Expected results:
optimize the initialization stage.

Additional info:

further investigation required - profiler results.
there is no such a data in the vdsm logs, since the init stage failed may other logs have some useful lines

Comment 1 Dan Kenigsberg 2016-02-17 13:45:49 UTC
Please attach /var/log/vdsm/* so we can tell what eats so much time during boot.

Comment 2 Dan Kenigsberg 2016-02-17 13:46:42 UTC
(and /var/log/messages as well)

Comment 3 Oved Ourfali 2016-02-24 11:26:06 UTC
Currently targeting to 4.0.
We will re-examine once we get more details.

Comment 4 Oved Ourfali 2016-03-13 07:30:42 UTC
Please re-open if still relevant, and give access to the environment and relevant logs.