Bug 104767

Summary: [CI][BAT] igt@pm_rpm@(module-reload|dpm_resources_equal) - dmesg-fail / fail - Failed assertion: c1->count_modes == c2->count_modes
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Dhinakaran Pandiyan <dhinakaran.pandiyan>
Status: CLOSED DUPLICATE QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BDW, BXT, GLK, KBL i915 features: display/Other

Description Martin Peres 2018-01-24 14:00:17 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3677/shard-kbl6/igt@pm_rpm@drm-resources-equal.html

(pm_rpm:1611) CRITICAL: Test assertion failure function assert_drm_connectors_equal, file pm_rpm.c:493:
(pm_rpm:1611) CRITICAL: Failed assertion: c1->count_modes == c2->count_modes
(pm_rpm:1611) CRITICAL: Last errno: 2, No such file or directory
(pm_rpm:1611) CRITICAL: error: 52 != 34
Comment 2 Martin Peres 2018-06-19 12:46:04 UTC
Not seen since drmtip_37 (1 month, 1 week / 28 runs ago), even though it was seen once before on drmtip_31.
Comment 3 Jani Saarinen 2018-06-19 14:25:54 UTC
Closing, thanks.
Comment 4 Martin Peres 2018-07-23 15:21:28 UTC
It is back!

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_74/fi-glk-j4005/igt@pm_rpm@drm-resources-equal.html

(pm_rpm:1781) CRITICAL: Test assertion failure function assert_drm_connectors_equal, file ../tests/pm_rpm.c:506:
(pm_rpm:1781) CRITICAL: Failed assertion: c1->count_modes == c2->count_modes
(pm_rpm:1781) CRITICAL: error: 23 != 40
Subtest drm-resources-equal failed.
Comment 5 Martin Peres 2018-09-06 13:25:59 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4780/fi-glk-j4005/igt@pm_rpm@module-reload.html

(pm_rpm:3993) CRITICAL: Test assertion failure function assert_drm_connectors_equal, file ../tests/pm_rpm.c:507:
(pm_rpm:3993) CRITICAL: Failed assertion: c1->count_modes == c2->count_modes
(pm_rpm:3993) CRITICAL: error: 40 != 43
Subtest module-reload failed.
Comment 6 Martin Peres 2018-10-23 13:40:15 UTC
Also seen on APL: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_1980/fi-apl-guc/igt@pm_rpm@module-reload.html

Starting subtest: module-reload
(pm_rpm:3480) CRITICAL: Test assertion failure function assert_drm_connectors_equal, file ../tests/pm_rpm.c:512:
(pm_rpm:3480) CRITICAL: Failed assertion: c1->count_modes == c2->count_modes
(pm_rpm:3480) CRITICAL: error: 45 != 32
Subtest module-reload failed.
Comment 7 Dhinakaran Pandiyan 2018-11-08 23:21:13 UTC
(In reply to Martin Peres from comment #6)
> Also seen on APL:
> https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_1980/fi-apl-guc/
> igt@pm_rpm@module-reload.html
> 
> Starting subtest: module-reload
> (pm_rpm:3480) CRITICAL: Test assertion failure function
> assert_drm_connectors_equal, file ../tests/pm_rpm.c:512:
> (pm_rpm:3480) CRITICAL: Failed assertion: c1->count_modes == c2->count_modes
> (pm_rpm:3480) CRITICAL: error: 45 != 32
> Subtest module-reload failed.

There is a variation in the number of modes that are getting pruned. The driver decides the max supported dot clock based on source and sink capabilities; given that the source side caps. do not change, it most likely is the drm_dp_downstream_max_clock() from the lspcon device that is changing.
Comment 8 Dhinakaran Pandiyan 2018-11-08 23:37:02 UTC
(In reply to Dhinakaran Pandiyan from comment #7)
> (In reply to Martin Peres from comment #6)
> > Also seen on APL:
> > https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_1980/fi-apl-guc/
> > igt@pm_rpm@module-reload.html
> > 
> > Starting subtest: module-reload
> > (pm_rpm:3480) CRITICAL: Test assertion failure function
> > assert_drm_connectors_equal, file ../tests/pm_rpm.c:512:
> > (pm_rpm:3480) CRITICAL: Failed assertion: c1->count_modes == c2->count_modes
> > (pm_rpm:3480) CRITICAL: error: 45 != 32
> > Subtest module-reload failed.
> 
> There is a variation in the number of modes that are getting pruned. The
> driver decides the max supported dot clock based on source and sink
> capabilities; given that the source side caps. do not change, it most likely
> is the drm_dp_downstream_max_clock() from the lspcon device that is changing.
Scratch that, the logs show lspcon link training failures.
Comment 9 Dhinakaran Pandiyan 2018-11-09 03:20:13 UTC
(In reply to Martin Peres from comment #5)
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4780/fi-glk-j4005/
> igt@pm_rpm@module-reload.html
> 
> (pm_rpm:3993) CRITICAL: Test assertion failure function
> assert_drm_connectors_equal, file ../tests/pm_rpm.c:507:
> (pm_rpm:3993) CRITICAL: Failed assertion: c1->count_modes == c2->count_modes
> (pm_rpm:3993) CRITICAL: error: 40 != 43
> Subtest module-reload failed.

These failures are different from those on fi-apl-guc, we should split this bug.
* fi-apl-guc has lspcon link training failures and should possibly be merged with https://bugs.freedesktop.org/show_bug.cgi?id=108529

* fi-glk-j4005 has edid read errors.
Comment 10 James Ausmus 2018-12-11 18:50:13 UTC
Should this really be Highest? I'm seeing a 0.3% reproduction rate, last seen almost a month ago. I believe we have other bugs that affect BAT that reproduce more than 0.3%, that more warrant a "Highest" priority.

Thoughts?
Comment 11 Martin Peres 2018-12-11 19:50:02 UTC
(In reply to James Ausmus from comment #10)
> Should this really be Highest? I'm seeing a 0.3% reproduction rate, last
> seen almost a month ago. I believe we have other bugs that affect BAT that
> reproduce more than 0.3%, that more warrant a "Highest" priority.
> 
> Thoughts?

Agreed. The customer impact does not seem that great, and same for the severity. It could be demoted to high.
Comment 12 Dhinakaran Pandiyan 2018-12-18 22:56:46 UTC
(In reply to Dhinakaran Pandiyan from comment #9)
> (In reply to Martin Peres from comment #5)
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4780/fi-glk-j4005/
> > igt@pm_rpm@module-reload.html
> > 
> > (pm_rpm:3993) CRITICAL: Test assertion failure function
> > assert_drm_connectors_equal, file ../tests/pm_rpm.c:507:
> > (pm_rpm:3993) CRITICAL: Failed assertion: c1->count_modes == c2->count_modes
> > (pm_rpm:3993) CRITICAL: error: 40 != 43
> > Subtest module-reload failed.
> 
> These failures are different from those on fi-apl-guc, we should split this
> bug.
> * fi-apl-guc has lspcon link training failures and should possibly be merged
> with https://bugs.freedesktop.org/show_bug.cgi?id=108529
> 
> * fi-glk-j4005 has edid read errors.
The problematic display was already changed on this machine; let's open a new bug if this machine has any new issues.


To summarize, the only remaining problem is the lspcon link training failures on fi-apl-guc.

Martin,

How can I get the last failure of this signature on fi-apl-guc? Also, can you make cibuglog track only failures on this machine for this bug?
Comment 13 Dhinakaran Pandiyan 2018-12-19 19:07:42 UTC
Machine: fi-apl-guc
LSPCON link training failures are already being debugged as part of https://bugs.freedesktop.org/show_bug.cgi?id=103313. Given that the LSPCON chip (DP branch: OUI 00-60-ad dev-ID MC2800 HW-rev 2.2 SW-rev 1.75 quirks 0x0000) is the same, let's mark this bug a duplicate so that we have updates in one place.

*** This bug has been marked as a duplicate of bug 103313 ***
Comment 14 Martin Peres 2019-07-02 09:55:12 UTC
Seen in average once every 124 runs, now not seen for 2232 runs.
Comment 15 CI Bug Log 2019-07-02 10:25:20 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.