Hide Forgot
Description of problem: Rhel6 guest video playback is worse than windows guest with the same bandwidth. When playing a full-screen youtube flash movie with firefox, on 6Mbps connection, the movie is sometimes not recognized as stream by spice-server due to too large time differences between frames. As a result, the frames are not mjpeg compressed, and the performance is bad. I've made some modifications to the driver in order to investigate this. The most significant problem is that each time the driver alloc fails, it io exits with a call to UPDATE_AREA(all_primary_surface) and then io exits again with an OOM call. The first io_exit is unnecessary - OOM will handle rendering of the oldest drawables if needed. Removing the UPDATE_AREA call improves the video performance. In addition there are other enhancements in the Windows driver that worth consideration for the Rhel6 driver: (1) copy images from guest memory to the device memory using SSE2 (2) Do not compute hash value for images that are suspected to be a part of a video stream (3) replace lookup3 hash with murmur (Alon has already done it upstream. Need to take it to rhel). All the changes described here should affect not only video, but the general performance of the driver.
Created attachment 579279 [details] removing the pre-oom update_area io exits
(In reply to comment #0) > In addition there are other enhancements in the Windows driver that worth > consideration for the Rhel6 driver: > (1) copy images from guest memory to the device memory using SSE2 > (2) Do not compute hash value for images that are suspected to be a part of a > video stream > (3) replace lookup3 hash with murmur (Alon has already done it upstream. Need > to take it to rhel). > > All the changes described here should affect not only video, but the general > performance of the driver. Your bug report is about improving the driver's performance with respect to bandwidth usage if I'm not mistaken while the points above would improve CPU usage. Without actual scenarios where CPU is a bottleneck, I'm not sure it is that important to optimize this. Regarding (1) if the driver is using memcpy, then glibc may provide a SSE2 memcpy version through the STT_GNU_IFUNC mechanism (no idea if such an implementation is already available upstream and in RHEL)
(In reply to comment #2) > (In reply to comment #0) > > In addition there are other enhancements in the Windows driver that worth > > consideration for the Rhel6 driver: > > (1) copy images from guest memory to the device memory using SSE2 > > (2) Do not compute hash value for images that are suspected to be a part of a > > video stream > > (3) replace lookup3 hash with murmur (Alon has already done it upstream. Need > > to take it to rhel). > > > > All the changes described here should affect not only video, but the general > > performance of the driver. > > Your bug report is about improving the driver's performance with respect to > bandwidth usage if I'm not mistaken while the points above would improve CPU > usage. Without actual scenarios where CPU is a bottleneck, I'm not sure it is > that important to optimize this. No, the bug is about cpu. The update_area issue affects video for example, since the time difference between frames become larger, and then it makes it harder to classify the frames as video stream. For the sse2 and murmur hash - IIRC the windows driver experience showed it has improved performance significantly. So I'm not sure it is not important to optimize them. It's at least worth investigation. > Regarding (1) if the driver is using memcpy, then glibc may provide a SSE2 > memcpy version through the STT_GNU_IFUNC mechanism (no idea if such an > implementation is already available upstream and in RHEL)
(In reply to comment #3) > For the sse2 and murmur hash - IIRC the windows driver experience showed it has > improved performance significantly. So I'm not sure it is not important to > optimize them. It's at least worth investigation. > All I'm saying is that saying something needs to be changed because it "improves performance" is useless, we need to know if it improves CPU, throughput, ... and have some (possibly rough) measurements of what was improved, it's very easy to work on "performance improvements" which are visible in microbenchmarks but cause no visible changes at all in the application because the code that was optimized wasn't a bottlenect in the application.
Raising up priority of this bug as we have customers affected with this issue
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Reopening I meant to cond-nak, not nak.