Description of problem: some of the host become non-operational, due to the following errors reported by engine: -Storage Domain storage_real of pool dc_real is in problem -storage_real check time ot 54.9 is too big so it looks like we have some storage issues, but no. storage monitored separately and there is no latency. there is no ERRORs at all @vdsm side. by the vdsm logs repoStat always reporting valid ture, and low delay less then 0 but sometimes the lastcheck value is big more than 30 sec (which is the threshold and the engine side). we should identify why the lastcheck operation delayed on scale. Version-Release number of selected component (if applicable): 3.5 VT 13.5 How reproducible: 100% Steps to Reproduce: 1. scale up to 37 hosts (3K vms) 3. especially reproduced while remove vms in bulks of 50 vms. Actual results: storage become non available. host become nonOperetional. Expected results: this repostat monitor should run as a single operation and not affected from other operations. active storage \ hosts while delay is low. Additional info:
Repostats monitoring is a storage flow, and the research should start from our side. In the worst case, if the analysis of this bug will uncover an infra issue, we'll move it back there. Nir - please take lead on this.
Eldad, we need logs from the host that becoming non-operational. Please provide these logs from the machine: /var/log/messages /var/log/sanlock.log /var/log/vdsm/vdsm.log To understand host health during that time, we also need sar logs. Lets try to reproduce this with a smaller setup: - one host running engine - one host running x vms - same storage used in your full scale test Can you reproduce it on such setup? If you can, we will need access to such setup for investigation.
Next time please open a bug correctly with the logs, especially when concerning the scale environment which is rapidly changing.
Since we have performance issues on that env in higher proprity i have customize the lastCheck treshhold @engine side. MaxStorageVdsTimeoutCheckSec=120 this issue may reproduced on loaded hosts.