Bug 88978

Summary: [bisected] [SI Scheduler] Graphical corruption in Dota 2
Product: Mesa Reporter: Nick Sarnie <sarnex>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: daniel, tstellar
Version: git   
Hardware: All   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=88561
Whiteboard:
i915 platform: i915 features:
Attachments: patch to disable the machine scheduler for SI
R600_DEBUG=ps,vs,gs output for the Talos trace with r227460 (no lockups)
R600_DEBUG=ps,vs,gs output for the Talos trace with r227461 (lockups)
dmesg log
sdiff output for suspected bad shader

Description Nick Sarnie 2015-02-05 03:34:17 UTC
Hi guys. If I use LLVM git, I get these graphical glitches in Dota 2 native.

https://i.imgur.com/I4vyWFt.jpg

The bug has been bisected to LLVM: 51a3c27d6e0c66cc8d2d1da8e9205fec7b74ca5c
 R600/SI: Define a schedule model and enable the generic machine scheduler


I'm using Mesa git, Kernel 3.18.5 and Linux Mint. 

Thanks alot,

sarnex
Comment 1 smoki 2015-02-05 04:39:11 UTC
 Which card is used? I already saw those artifacts in bug 88758 and tried to reproduce it on Kabini with two apitraces i have, but i can't. Those are from bug 67887 and bug 88301.

 Can you also reproduce artifacts with any of those two? Or made new one, might be some recent game update only show an issue or it arise only with particular settings.
Comment 2 Nick Sarnie 2015-02-05 05:17:49 UTC
Hi smoki. I am on a Radeon HD 7950(TAHITI). If I try the trace from https://bugs.freedesktop.org/show_bug.cgi?id=67887, I DO see the same graphical glitches that I get. 


Here an apitrace I just took: https://idontevenlift.no-ip.org/sarnex_dota_linux.trace.xz


The other apitrace is too large, I'll try it tomorrow.

Thanks,
sarnex
Comment 3 smoki 2015-02-05 05:30:19 UTC
 OK, thanks for trace, but i can't reproduce it either with your trace on Kabini. 

 So as it happen on R7 265 and HD 7950, i guess this is likely GCN 1.0 only bug.
Comment 4 Tom Stellard 2015-02-05 14:35:01 UTC
(In reply to sarnex from comment #0)
> Hi guys. If I use LLVM git, I get these graphical glitches in Dota 2 native.
> 
> https://i.imgur.com/I4vyWFt.jpg
> 
> The bug has been bisected to LLVM: 51a3c27d6e0c66cc8d2d1da8e9205fec7b74ca5c
>  R600/SI: Define a schedule model and enable the generic machine scheduler
> 
> 
> I'm using Mesa git, Kernel 3.18.5 and Linux Mint. 
> 
> Thanks alot,
> 
> sarnex

Can you run the game with the  environment variable:
R600_DEBUG=ps,vs,gs and post the output.
Comment 5 Nick Sarnie 2015-02-05 16:54:41 UTC
(In reply to Tom Stellard from comment #4)
> (In reply to sarnex from comment #0)
> > Hi guys. If I use LLVM git, I get these graphical glitches in Dota 2 native.
> > 
> > https://i.imgur.com/I4vyWFt.jpg
> > 
> > The bug has been bisected to LLVM: 51a3c27d6e0c66cc8d2d1da8e9205fec7b74ca5c
> >  R600/SI: Define a schedule model and enable the generic machine scheduler
> > 
> > 
> > I'm using Mesa git, Kernel 3.18.5 and Linux Mint. 
> > 
> > Thanks alot,
> > 
> > sarnex
> 
> Can you run the game with the  environment variable:
> R600_DEBUG=ps,vs,gs and post the output.

Hi Tom, thanks for replying.

The log is here, since it's too big to be an attachment. Skip to near the end to see the in-game bugged time, the beginning is mostly the menu.

Log: http://paste.ubuntu.com/10075950/
Comment 6 Tom Stellard 2015-02-09 03:39:33 UTC
(In reply to sarnex from comment #5)
> (In reply to Tom Stellard from comment #4)
> > (In reply to sarnex from comment #0)
> > > Hi guys. If I use LLVM git, I get these graphical glitches in Dota 2 native.
> > > 
> > > https://i.imgur.com/I4vyWFt.jpg
> > > 
> > > The bug has been bisected to LLVM: 51a3c27d6e0c66cc8d2d1da8e9205fec7b74ca5c
> > >  R600/SI: Define a schedule model and enable the generic machine scheduler
> > > 
> > > 
> > > I'm using Mesa git, Kernel 3.18.5 and Linux Mint. 
> > > 
> > > Thanks alot,
> > > 
> > > sarnex
> > 
> > Can you run the game with the  environment variable:
> > R600_DEBUG=ps,vs,gs and post the output.
> 
> Hi Tom, thanks for replying.
> 
> The log is here, since it's too big to be an attachment. Skip to near the
> end to see the in-game bugged time, the beginning is mostly the menu.
> 
> Log: http://paste.ubuntu.com/10075950/

Thanks, would you also be able to get a dump using the last good commit.
Comment 7 Nick Sarnie 2015-02-09 15:24:05 UTC
(In reply to Tom Stellard from comment #6)
> (In reply to sarnex from comment #5)
> > (In reply to Tom Stellard from comment #4)
> > > (In reply to sarnex from comment #0)
> > > > Hi guys. If I use LLVM git, I get these graphical glitches in Dota 2 native.
> > > > 
> > > > https://i.imgur.com/I4vyWFt.jpg
> > > > 
> > > > The bug has been bisected to LLVM: 51a3c27d6e0c66cc8d2d1da8e9205fec7b74ca5c
> > > >  R600/SI: Define a schedule model and enable the generic machine scheduler
> > > > 
> > > > 
> > > > I'm using Mesa git, Kernel 3.18.5 and Linux Mint. 
> > > > 
> > > > Thanks alot,
> > > > 
> > > > sarnex
> > > 
> > > Can you run the game with the  environment variable:
> > > R600_DEBUG=ps,vs,gs and post the output.
> > 
> > Hi Tom, thanks for replying.
> > 
> > The log is here, since it's too big to be an attachment. Skip to near the
> > end to see the in-game bugged time, the beginning is mostly the menu.
> > 
> > Log: http://paste.ubuntu.com/10075950/
> 
> Thanks, would you also be able to get a dump using the last good commit.

Hi Tom,

Here is the log from the commit directly before R600/SI: Define a schedule model and enable the generic machine scheduler, and it has no graphical issues

Log: http://paste.ubuntu.com/10143733/

Thanks again,
sarnex
Comment 8 Daniel Scharrer 2015-02-19 19:27:40 UTC
The Mesa patch from bug 88561 comment 6 fixes this for me - at least the glitches with the posted apitrace.
Comment 9 Nick Sarnie 2015-02-20 01:52:48 UTC
(In reply to Daniel Scharrer from comment #8)
> The Mesa patch from bug 88561 comment 6 fixes this for me - at least the
> glitches with the posted apitrace.

Hi Daniel, 

Thanks for the information. The patch from Marek significantly reduces the number of artifacts in Dota 2, but it does not completely fix the issue and I still see a few artifacts per second. It seems that this bug and the Portal bug are related, but there is still an underlying bug somewhere.

Thanks,
sarnex
Comment 10 Nick Sarnie 2015-04-12 15:08:45 UTC
This issue is still present on LLVM git and Mesa git, although the frequency of the corruption is significantly lowered with Marek's patch from https://bugs.freedesktop.org/show_bug.cgi?id=88561#c6
Comment 11 Daniel Scharrer 2015-05-23 20:43:12 UTC
Created attachment 115995 [details] [review]
patch to disable the machine scheduler for SI

I can confirm that these these glitches are still present on current LLVM + Mesa git with a 7950 (TAHITI).

Glitches happen in various games with different engines (Source, Unity, …). Here is a trace of The Talos Principle (first posted in bug #88561 comment 9), that still produces more than just occasional glitches (even with Marek's patch):
 http://constexpr.org/tmp/Talos-radeonsi.3.trace.xz (147 MiB)

Like sarnex, I have bisected this to LLVM 51a3c27d6e0c66cc8d2d1da8e9205fec7b74ca5c (r227461).
I had to revert b8797a7 and a99a16a in current Mesa git for it to build against that LLVM revision.

Some Source engine games (L4D2, Nuclear Dawn, maybe others) don't just produce graphical glitches but also frequently lock up the GPU since a later change to the machine scheduler (r233366) - see bug #90378.

Disabling the machine scheduler for SI on current LLVM (see attached patch) also fixes both the lockups an graphical glitches.

Additionally, using R600_DEBUG=switch_on_eop with unpatched LLVM also works around both the graphical glitches and and GPU lockups.
Comment 12 Daniel Scharrer 2015-05-23 20:45:08 UTC
Created attachment 115996 [details]
R600_DEBUG=ps,vs,gs output for the Talos trace with r227460 (no lockups)
Comment 13 Daniel Scharrer 2015-05-23 20:45:43 UTC
Created attachment 115997 [details]
R600_DEBUG=ps,vs,gs output for the Talos trace with r227461 (lockups)
Comment 14 Tom Stellard 2015-05-30 01:05:26 UTC
Can you post your dmesg log too?
Comment 15 Daniel Scharrer 2015-05-30 14:08:21 UTC
Created attachment 116176 [details]
dmesg log

Here is the dmesg log with Linux 4.0.4-gentoo and LLVM patched to disable the machine scheduler for SI, after replaying both sarnex' and my trace. I don't have an unpatched LLVM build right now, but don't remember the dmesg output being different.

The log is compressed because there are lots of GPU faults at the end (bug #87278) which pushed the uncompressed log over the attachment size limit - not sure if you wanted those or just the startup part.

Bug #90378 has the dmesg output for L4D2 including GPU lockups with an unpatched (but older revision on) LLVM and 4.0.1-gentoo in attachment 115653 [details].
Comment 16 Tom Stellard 2015-06-01 15:00:31 UTC
(In reply to Daniel Scharrer from comment #15)
> Created attachment 116176 [details]
> dmesg log
> 
> Here is the dmesg log with Linux 4.0.4-gentoo and LLVM patched to disable
> the machine scheduler for SI, after replaying both sarnex' and my trace. I
> don't have an unpatched LLVM build right now, but don't remember the dmesg
> output being different.
> 
> The log is compressed because there are lots of GPU faults at the end (bug
> #87278) which pushed the uncompressed log over the attachment size limit -
> not sure if you wanted those or just the startup part.
> 
> Bug #90378 has the dmesg output for L4D2 including GPU lockups with an
> unpatched (but older revision on) LLVM and 4.0.1-gentoo in attachment 115653 [details]
> [details].

Would you be able to post an API trace of one of the games that is locking up?
Comment 17 Tom Stellard 2015-06-01 15:08:53 UTC
(In reply to Tom Stellard from comment #16)
> (In reply to Daniel Scharrer from comment #15)
> > Created attachment 116176 [details]
> > dmesg log
> > 
> > Here is the dmesg log with Linux 4.0.4-gentoo and LLVM patched to disable
> > the machine scheduler for SI, after replaying both sarnex' and my trace. I
> > don't have an unpatched LLVM build right now, but don't remember the dmesg
> > output being different.
> > 
> > The log is compressed because there are lots of GPU faults at the end (bug
> > #87278) which pushed the uncompressed log over the attachment size limit -
> > not sure if you wanted those or just the startup part.
> > 
> > Bug #90378 has the dmesg output for L4D2 including GPU lockups with an
> > unpatched (but older revision on) LLVM and 4.0.1-gentoo in attachment 115653 [details]
> > [details].
> 
> Would you be able to post an API trace of one of the games that is locking
> up?

Nevermind, I think the one you posted already should be enough.
Comment 18 Tom Stellard 2015-06-02 03:55:43 UTC
Created attachment 116227 [details]
sdiff output for suspected bad shader

Here is a dump from sdiff of a good shader with no GPU protection faults (left side) and a bad shader that causes GPU protection faults (right side).  Search for the pipe character '|' to find the only difference between the two shaders.

I'm not sure yet why this difference would lead to GPU protection faults.
Comment 19 Nick Sarnie 2015-07-31 17:55:26 UTC
I cannot reproduce this on Mesa master, it must have been fixed with radeonsi: completely rework updating descriptors without CP DMA
Comment 20 Michel Dänzer 2015-08-01 07:02:47 UTC
Resolving per comment 19, thanks for the update.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.