Created attachment 990455 [details] fix regular expression for parsing dd output Description of problem: Similar to #948232, vdsm can not handle high speeds on attached storage domains, if this results in dd prompting "infinit" speed. (i.e. for nfs due do mem-caching). This error occurs with ovirt 3.5.0 on an EL6 setup. Example log VDSM on Host: 241 Thread-17::DEBUG::2015-02-11 10:01:44,089::fileSD::261::Storage.Misc.excCmd::(getReadDelay) /bin/dd if=/rhev/data-center/mnt/nfs.sec.t-labs.tu-berlin.de:_srv_nfs_x4200/bf6a9e33-f659-4354-bc63-f38e96118133/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None) 242 Thread-17::DEBUG::2015-02-11 10:01:44,272::fileSD::261::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n517 bytes (517 B) copied, 0 s, Infinity B/s\n'; <rc> = 0 243 Thread-67249::DEBUG::2015-02-11 10:01:44,273::BindingXMLRPC::318::vds::(wrapper) client [130.149.230.18] 244 Thread-17::ERROR::2015-02-11 10:01:44,275::misc::221::Storage.Misc::(readspeed) Unable to parse dd output: '517 bytes (517 B) copied, 0 s, Infinity B/s' 245 Thread-17::ERROR::2015-02-11 10:01:44,277::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain bf6a9e33-f659-4354-bc63-f38e96118133 monitoring information 246 Traceback (most recent call last): 247 File "/usr/share/vdsm/storage/domainMonitor.py", line 234, in _monitorDomain 248 self.nextStatus.readDelay = self.domain.getReadDelay() 249 File "/usr/share/vdsm/storage/fileSD.py", line 261, in getReadDelay 250 stats = misc.readspeed(self.metafile, 4096) 251 File "/usr/share/vdsm/storage/misc.py", line 222, in readspeed 252 raise se.MiscFileReadException(path) 253 MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/nfs.sec.t-labs.tu-berlin.de:_srv_nfs_x4200/bf6a9e33-f659-4354-bc63-f38e96118133/dom_md/metadata',) 254 Thread-67249::DEBUG::2015-02-11 10:01:44,277::task::595::Storage.TaskManager.Task::(_updateState) Task=`5695699f-2fa2-4a0e-9891-048ff2fcf70b`::moving from state init -> state preparing 255 Thread-17::DEBUG::2015-02-11 10:01:44,279::domainMonitor::264::Storage.DomainMonitorThread::(_monitorDomain) Domain bf6a9e33-f659-4354-bc63-f38e96118133 changed its status to Invalid Example Log on ovirt-engine: 1704 2015-02-11 10:01:30,678 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) domain bf6a9e33-f659-4354-bc63-f 38e96118133:nfs in problem. vds: virt2.sec.t-labs.tu-berlin.de How reproducible: I so far have no idea how to reproduce this. This only occurs on one specific host with supplied nfs storage. The error stems from the dd-output '517 bytes (517 B) copied, 0 s, Infinity B/s' being unparseable. Additional info: I patched vdsm/storage/misc.py in two places to counter this problem. Patch 1 adjusts the regular expression. Patch two returns a very small value instead of 0, if an infinity case occured to ensure the returned 0 does not cause problems somewhere down the road.
Created attachment 990456 [details] patch ensuring a value >0 is returned for runtime of throughput check
Bumping severity, as this may cause a domain to become invalid and lead to migration storm while storage is in perfect condition.
(In reply to Tobias Fiebig from comment #0) Tobias, did you get one error in the log, or many such errors in a consistent way?
What do you mean with many such errors? The issue consistently appeared over an extended period of time, repeating every 1-2min. Does this answer your question, or do you need a more specific timeline, how often this happened?
(In reply to Tobias Fiebig from comment #4) > What do you mean with many such errors? The issue consistently appeared over > an extended period of time, repeating every 1-2min. Does this answer your > question, or do you need a more specific timeline, how often this happened? We need more info to understand this issue: - vdsm logs showing multiple failures for extended period of time - /var/log/messages from same time frame - output of rpm -qa on the effected host - Do you have ntp configured on this machine? How is it configured? Please attach also ntp related logs
Ondrej, can you help us understand why dd is returning zero time in this case? Looking at upstream source, it seems that dd cannot use monotonic time source so it is using system time, and system time is changing backwards during the read operation.
Created attachment 993940 [details] rpm -qa from affected host
Created attachment 993944 [details] current timing behab ntp
Created attachment 993945 [details] ntp kernel pll jan 2015
Created attachment 993946 [details] ntp jitter jan 2015
(In reply to Tobias Fiebig from comment #10) > Created attachment 993946 [details] > ntp jitter jan 2015 What are the units in these ntp graphs - seconds? milliseconds? microseconds?
Unit is time in seconds, the m's on the y-axis indicate milisecond range. Sorry for the confusion, just took what fell out of munin.
Created attachment 993948 [details] /var/log/messages jan 18 to jan 25
Created attachment 993963 [details] vdsm log jan 18-jan 25 ip's redacted, compressed
Ok, details should be added. Please poke me, if this has been insufficient.
Adding back need info for Ondrej, see comment 6.
In the source code (src/dd.c:773 in F22 coreutils-8.23), you can see that if the "now" time is same or lower than "start_time", it prints "Infinity B/s" - and sets delta to 0. Is this what are you asking for or you need something else?
(In reply to Ondrej Vasik from comment #17) > In the source code (src/dd.c:773 in F22 coreutils-8.23), you can see that if > the "now" time is same or lower than "start_time", it prints "Infinity B/s" > - and sets delta to 0. > > Is this what are you asking for or you need something else? It is clear that we get Infinity when elapsed time is zero or negative, but it is unlikely that read operation over the network (NFS in this case) will finish before it was started, or take less then nanosecond (clock_get_time) or microsecond (gettimeofday). I suspected that this was caused by dd using system time (because monotonic time source was not available), and system time changed backwards during a read operation. However we have 1000s of such logs, and it is unlikely that this will happen so frequently. We can easily fix vdsm to parse dd output when it returns zero seconds and infinity bytes per seconds, but we are worried that there is a bigger issue behind this. There is also the issue of handling zero time in the upper layers of the vdsm/ovirt, where it can break code expecting positive values for read delay. We need help understating why we get 1000's of read operations with taking zero time. This is the dd command we use: /bin/dd iflag=direct if=/rhev/data-center/mnt/_dev_mapper_vg__virt2-lv__iso/06d8c1c4-3d0f-4f57-af00-b4a1efbf2b32/dom_md/metadata bs=4096 count=1 The input (if=) is NFS mount in this case. This is an example of an unexpected output: 0+1 records in 0+1 records out 354 bytes (354 B) copied, 0 s, Infinity B/s And this is expected output: 0+1 records in 0+1 records out 354 bytes (354 B) copied, 0.000999814 s, 354 kB/s In the attached logs, we have two threads running the dd command above for two NFS mounts, every 10 seconds, Thread-80 and Thread-16. For Thread-80, we have 7969 runs, 2508 returning zero/Infinity. For Thread-16, we have 7961 runs, 1171 returning zero/Infinity. See attachments thread-80-results.log and thread-16-results.log Here is typical runs from thread-80-results.log: Thread-80::DEBUG::2015-01-18 23:35:44,612::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.000999811 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:35:54,749::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.00099981 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:36:04,792::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.00099981 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:36:14,836::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.00099981 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:36:24,879::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0 s, Infinity B/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:36:34,953::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.000999811 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:36:44,996::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0 s, Infinity B/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:36:55,086::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.00099981 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:37:05,129::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0 s, Infinity B/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:37:15,176::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.00099981 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:37:25,220::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.00099981 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:37:35,309::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0 s, Infinity B/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:37:45,377::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.000999811 s, 354 kB/s\n'; <rc> = 0 Thread-80::DEBUG::2015-01-18 23:37:55,437::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n354 bytes (354 B) copied, 0.00099
Created attachment 994316 [details] dd results from thread-16
Created attachment 994317 [details] dd results from thread-80
Honestly, brown system time is a rather good explaination. I do not really see how there could be another underlying issue then? Zeros in upper layers could be approached as in attachment two, though.
(In reply to Tobias Fiebig from comment #21) > Honestly, brown system time is a rather good explaination. I do not really > see how there could be another underlying issue then? System time should not behave like that. Most code assume that system time is going forward 99.9999% of the time. If we a system issue here we should open a bug for the component responsible for this. > Zeros in upper layers could be approached as in attachment two, though. Sure, we are not very happy with this approach. I would like to pass zero value as is and let the upper layer handle it if possible.
dd uses gethrxtime() gnulib function, it tries to use monotonic clocks where available, but if not, there is fallback to simple gettime() which may jump backwards (and it is even in the comments) - as it uses the system functions mentioned in comment #18. Using strace would probably reveal that the system clock really went backwards. I don't think I will be able to identify the culprit - it will likely be somewhere in kernel system time module. Adding P.Brady - who is upstream maintainer of coreutils - to see if he has some other ideas what might be a culprit.
Well, a flaky time-source on the system could also be an option. Looking at those ntp-graphs, i think that this may be a reasonable explanation with ntp constantly correcting (sometimes backwards). Considering the upper-layer thing: If you want i can disable that part of my patch on that host, and see how this influences the system.
Created attachment 994398 [details] System time debugging tool This is a simple tool that log both monotonic time and system time every second. Running it for an hour or so on the host experiencing this issue will reveal if the problem is system time going backwards.
(In reply to Tobias Fiebig from comment #24) > Well, a flaky time-source on the system could also be an option. Looking at > those ntp-graphs, i think that this may be a reasonable explanation with ntp > constantly correcting (sometimes backwards). It will be interesting to try the tool in attachment 994398 [details]. Please run it on the effected system for an hour or so and share the output of this tool. > > Considering the upper-layer thing: If you want i can disable that part of my > patch on that host, and see how this influences the system. It can be useful.
Using strace on rhel 6.6, we see that dd is using CLOCK_MONOTONIC: $ strace dd if=/dev/zero of=/dev/null bs=4096 count=1 ... write(2, "4096 bytes (4.1 kB) copied", 264096 bytes (4.1 kB) copied) = 26 clock_gettime(CLOCK_MONOTONIC, {1859546, 419736614}) = 0 write(2, ", 0.000364096 s, 11.2 MB/s\n", 27, 0.000364096 s, 11.2 MB/s ) = 27 ... Tobias, can try this on your host and attach the output?
Created attachment 994645 [details] timelog from affected machine over ~2h time log of the affected machine
Created attachment 994646 [details] time_log analyzer Small .py that parses all lines and throws an error if time went backwards.
Created attachment 994647 [details] strace-log for dd
dd behaves as expected on the affected system: write(2, "4096 bytes (4.1 kB) copied", 264096 bytes (4.1 kB) copied) = 26 clock_gettime(CLOCK_MONOTONIC, {1634838, 381835004}) = 0 write(2, ", 0.00199961 s, 2.0 MB/s\n", 25, 0.00199961 s, 2.0 MB/s ) = 25
(In reply to Tobias Fiebig from comment #31) > dd behaves as expected on the affected system Checking the time log show no issue, montonic time and real time seems to advance consistently. While the time log was collected, did you have any "Infinity" in getReadDelay in vdsm log?
Created attachment 994683 [details] Tool for measuring the practical resolution of clock_gettime()
Another possibility - while clock_gettime returns value in nanoseconds its resolution is lower. For example on this machine (8 cores): $ cat /proc/cpuinfo | grep 'model name' | head -n 1 model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz I get practical resolution of about 1150 nanoseconds. 1911247.946452809 (0.000001149) 1911247.946453938 (0.000001129) 1911247.946455084 (0.000001146) 1911247.946456229 (0.000001145) 1911247.946457374 (0.000001145) 1911247.946458514 (0.000001140) 1911247.946459662 (0.000001148) 1911247.946460866 (0.000001204) Please build and run the attached timediff tool on your host to see what is the practical resolution on your host. If your storage is very fast or there is very successful caching involved, the the operation takes less then the practical resolution of the monotonic time source, we will get zero time.
Created attachment 994701 [details] timediff log
tfiebig@smith Downloads % cat ./timediff_log|awk '{print $2}'|sort|uniq -c 177 (0.000000001) 175 (0.000999812) 9462 (0.000999813) 183 (0.000999814) 1 (0.001999626) 1 (0.010997944) 1 (0.011997756) Based on this i guess your assumption is correct.
(In reply to Tobias Fiebig from comment #36) > tfiebig@smith Downloads % cat ./timediff_log|awk '{print $2}'|sort|uniq -c > 177 (0.000000001) > 175 (0.000999812) > 9462 (0.000999813) > 183 (0.000999814) > 1 (0.001999626) > 1 (0.010997944) > 1 (0.011997756) > > Based on this i guess your assumption is correct. So this is not very fast storage, but eratic clock_gettime(). We did not measure how many times we get the same value from clock_gettime, this can be useful to detect a delay caused by scheduling. So I think we should continue in two directions: 1. coreutils developers can consider a more useful return value when start_time == now. 2. You can open a kernel bug to understand if this behavior is expected Please cc me if you open such bug. On the vdsm side, I think the attached patch (http://gerrit.ovirt.org/38082) is correct. Thanks for your help Tobias!
This bug reproduced on oVirt 3.6 master, are we sure the fix have been merged?
(In reply to Ori Gofen from comment #38) > This bug reproduced on oVirt 3.6 master, are we sure the fix have been > merged? Yes, the fix is merged. Can you please describe the steps you used to reproduce?
(In reply to Ori Gofen from comment #38) > This bug reproduced on oVirt 3.6 master, are we sure the fix have been > merged? What commit hash?
Just a point on the coreutils "Infinity" output. That's a valid floating point value as can be seen with `sleep Infinity`. I'm not sure what coreutils should do differently here.
(In reply to Allon Mureinik from comment #40) > (In reply to Ori Gofen from comment #38) > > This bug reproduced on oVirt 3.6 master, are we sure the fix have been > > merged? > What commit hash? 4427cee24bf1d9b02b2e423d3873cd30ac10e0b7
Ori, can you paste the exception you get in the log?
Created attachment 1032780 [details] logs Steps taken to verify: 1.have created a fake dd with a static output: "517 bytes (517 B) copied, 0 s, Infinity B/s" 2.viewed the vdsm log
Created attachment 1044333 [details] logs reproduced again on vdsm-cli-4.17.0-1054.git562e711.el7.noarch vdsm-jsonrpc-4.17.0-1054.git562e711.el7.noarch vdsm-python-4.17.0-1054.git562e711.el7.noarch vdsm-4.17.0-1054.git562e711.el7.noarch vdsm-infra-4.17.0-1054.git562e711.el7.noarch vdsm-yajsonrpc-4.17.0-1054.git562e711.el7.noarch vdsm-xmlrpc-4.17.0-1054.git562e711.el7.noarch Nir, this should probably be moved to Assigned
(In reply to Ori Gofen from comment #45) > Created attachment 1044333 [details] > logs > > reproduced again on Please describe how did you reproduce the issue - what is the action you took, what was the expected behavior, what is the actual behavior.
Verified on 3.6.0.3 getReadDelay is handling 'infinity'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html