Following commit regresses performance hugely with DRI3 in synthetic benchmarks both on Sandybridge and Broxton. commit 1e3218bc5ba2b739261f0c0bacf4eb662d377236 Author: Michel Dänzer <michel.daenzer@amd.com> AuthorDate: Wed Aug 17 17:02:04 2016 +0900 Commit: Michel Dänzer <michel@daenzer.net> CommitDate: Thu Aug 25 17:40:24 2016 +0900 loader/dri3: Overhaul dri3_update_num_back Always use 3 buffers when flipping. With only 2 buffers, we have to wait for a flip to complete (which takes non-0 time even with asynchronous flips) before we can start working on the next frame. We were previously only using 2 buffers for flipping if the X server supports asynchronous flips, even when we're not using asynchronous flips. This could result in bad performance (the referenced bug report is an extreme case, where the inter-frame stalls were preventing the GPU from reaching its maximum clocks). I couldn't measure any performance boost using 4 buffers with flipping. Performance actually seemed to go down slightly, but that might have been just noise. Without flipping, a single back buffer is enough for swap interval 0, but we need to use 2 back buffers when the swap interval is non-0, otherwise we have to wait for the swap interval to pass before we can start working on the next frame. This condition was previously reversed. Cc: "12.0 11.2" <mesa-stable@lists.freedesktop.org> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=97260 Reviewed-by: Frank Binns <frank.binns@imgtec.com> Reviewed-by: Eric Anholt <eric@anholt.net> Reverting the batch restores earlier performance (bisect done on Broxton, revert tested on Sandybridge, so same commit is problem for both). Impact is larger for tests with higher FPS, and naturally affects only onscreen versions of the tests. Both fullscreen and windowed+composited tests were affected. On Sandybridge impact is up to 35% (SynMark Batch tests), 25% in GpuTest Triangle test, and less in other tests. On Broxton the drop affects more tests (due to better GPU, heavier tests have higher FPS), even few tests that are normally fully ALU bound: * SynMark v6: up to 40% (Batch tests) * GfxBench v4: 35% ALU, 25% Driver, 10% Tess tests * Lightsmark 2008: 20% * GpuTest 0.7: 15% Triangle, Julia32 & Plot3D tests * GLB 2.7: 10% Egypt The change doesn't seem to affect HSW, BDW nor SKL. I don't know why. Issue doesn't seem to be related to FPS (occasionally) being limited to some multiple of 60 FPS like in the earlier DRI3 perf bug. My assumption is that the buffering change indirectly affects some memory setting, but I don't know what, as SNB & BXT different in that respect: * SNB has LLC, but BXT doesn't * AFAIK Intel DDX supports SNA for SNB, but not yet for BXT
Please attach the corresponding Xorg log file. It would also be interesting to know which path dri3_update_num_back takes before and after the change for each affected test.
Created attachment 126171 [details] Xorg log (In reply to Michel Dänzer from comment #1) > Please attach the corresponding Xorg log file. Attached, didn't seem to have anything significant (only difference to version before the issue is timestamps, build IDs and some additional input device lines).
Does the problem also occur with the modesetting driver instead of intel? If yes, I'm afraid we can't make progress on this bug without seeing at least the number of buffers used before and after my change in the affected cases, preferably also the values of the variables used to decide that number.
Created attachment 126214 [details] [review] Always use at least two buffers Does this patch fix the problem?
(In reply to Michel Dänzer from comment #4) > Created attachment 126214 [details] [review] [review] > Always use at least two buffers > > Does this patch fix the problem? Yes, but that's not all... BXT performance raised a lot in the affected programs end of last week (some more than the drop, some less than the drop). This was due Chris' commit adding BXT & KBL PCI IDs to xf86-video-intel. I assume with this, X will be using SNA also on BXT instead of the "default acceleration" that uses (slow) legacy blitter. So, I did few tests: * BXT: reverting the "Overhaul dri3_update_num_back" with the new X DDX didn't affect the performance (same good one) * BXT: with old X DDX attached patch fixes the drops * BXT: with new X DDX attached patch doesn't give additional improvement * SNB: attached patch fixes the drops Comments: - This isn't necessarily Mesa issue, but X DDX one - Something fishy with SNB for which it would be good to get comment from Chris Still TODO: -Check modesetting driver perf with the before and after the commit - Provide requested info from GDB
With modesetting: * BXT: similar to new X DDX, perf OK (somewhat below X DDX) * SNB: perf still bad, but better than with Intel DDX -> on SNB the attached patch is needed to get performance on previous level regardless of DDX. Backtraces are from BXT, both with old (last month) X Intel DDX, before and after the commit. Earlier dri3_update_num_back() was called when getting buffers, now it's called when swapping.
Created attachment 126221 [details] backtrace before change
Created attachment 126222 [details] backtrace after change
Thanks for the information and testing! Fix pushed to master: Module: Mesa Branch: master Commit: dc3bb5db8c81e7f08ae12ea5d3ee999e2afcbfd1 URL: http://cgit.freedesktop.org/mesa/mesa/commit/?id=dc3bb5db8c81e7f08ae12ea5d3ee999e2afcbfd1 Author: Michel Dänzer <michel.daenzer@amd.com> Date: Tue Sep 6 11:34:49 2016 +0900 loader/dri3: Always use at least two back buffers
Verified. This affected also KBL (for which Intel DDX still doesn't seem to use SNA) and potentially BYT too.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.