Ooooh, that one is nasty :)
I've actually seen this before, and can explain the basic idea why it happens.
There are multiple "sources" for STAT IRQs, and they seem to share an internal STAT interrupt flag line, and the real STAT interrupt is edge-triggered based on this line. If we only consider mode changes, this is still fairly simple, but LYC=LY coincidence complicates matters, and I'll leave it out for the moment.
Since we're only looking at modes, you can think of the internal interrupt line as if it was something like this:
Now, the actual STAT interrupt involving the normal CPU and IE/IF machinery is only fired when this internal line transitions from 0 to 1. When does that happen? For example when changing to a new mode.
This internal line is not like IF that is cleared, but more like a signal "should there be an interrupt right now", which keeps its state until the source bits change.
Let's say we start with this:
We move on to mode=3, and finally to mode=0, which changes internal_stat_irq_line to 1, which is an edge from 0 to 1 and causes an interrupt. This is quite simple, but what about:
We move on to mode=0, which triggers 0->1 transition. Regardless of whether an interrupt is actually handled (depending on IE/IF/IME), we get internal_stat_irq_line=1. Here's the interesting part: we then move on to mode=2, which has internal_stat_irq_line=1, but since it is already 1, it doesn't trigger a transition! No interrupt!
So, interrupts only happen if the internal irq line gets a chance to go to 0. Therefore we need to understand what things affect the internal line, and this is where it gets really complicated. I'm just guessing here at this point, but how about:
- Writes to STAT modeX_irq_enabled bits
- Mode changes
- If LYC=LY interrupt is enabled, current values of LYC and LY
- If LYC=LY interrupt is enabled, any written values of LYC and LY
- Whatever internally happens when entering/being inside of/exiting vblank
- Anything that changes the timings of the previous things...For example, timing of mode0, mode3, and probably even mode1 depend on current values and writes to many GPU registers and data areas
And all these with sub-M-cycle accuracy, which is needed because the behaviour depends on the precise T-cycle timing of when writes have an effect. And my research about the general GPU timings is still incomplete and we can't even accurately predict the answer to the question: when does mode=0 start?
We've already seen in the timer tests that the timer increments on transitions of certain register and counter bits, regardless of the reason of the transition. It's just the same thing here.
For example, let's imagine we have all STAT IRQ sources enabled. This means that the mode sources only allow the internal line to transition to 0 in mode=3, and LYC=LY source only if LYC!=LY. If we cleverly change LYC in sync with the LY values, we can keep the internal irq line as 1 even in mode=3, and never get an interrupt! This is just conjecture, but entirely possible.
And let's not forget that if the GPU decides to pulse some important source bit just for a single 4MHz cycle for some crazy implementation detail reason, it can cause a transition in the internal line even if the pulse is otherwise completely undetectable in normal GB code.
The interplay of different GPU things is so incredibly complicated...this is why I'm still investing in automated hardware testing and not writing and running test ROMs :)