My definition of "stable"
I catch myself always saying, "No, not yet stable enough." I cannot release it, even though it passes all tests. Which you can interpret that the tests suck. Not enough coverage, bad testcases, ...
Well, that is always the case. You can never have enough tests. Problem is that in my case, the compiler, testing costs a lot of time. LOT of time! I usually spend a week to do the final release testing, but more often it lasts several weeks, because one round of test results influence the decisions of TODO, SKIP and mandatory PASSing tests, and then I'll redo the tests. On all versions, with all platforms. You could rely on cpantesters to do that for you, but it is better to do the most common combinations by your own. That's why I use perlall with a few hundred perls.
But with passing tests you always have your definition of TODO tests. A todo test is always a sign of instability. "Sometimes it works, but not always". Or "It used to work, but it is not so important if it fails". Or "It used to fail, but somehow it looks like I fixed it now. But I'm not so sure".
But I came in the last years to a completely different definition of "stable". I call my app stable,
if the testsuite passes, AND
if small innocent changes to the source create expected results.
That means after a passing testsuite I always play around with the code a bit, doing minor improvements, or testing new features, and only if the results come out as expected I will call it stable enough. Only then I can trust my code.
I was often bitten by the "Action at a distance" anti pattern. Very often minor changes caused something completely unrelated to fail. E.g. loading another module, suddenly broke something which always worked, for no apparent reason.
E.g. a concrete perl example: PL_regex_padav is only relevant for threaded perls, holding the REGEXP bodies of stored qr// SV's. 5.8.1, 5.10 and then 5.16 changed the internal implementation of the PL_regex_padav offsets. 5.16 failed in the C compiler, but the same fix to the Bytecode compiler which looks sane fixed the Bytecode problems. The Bytecode compiler is much simplier than the C compiler and big implementation changes cause always synchronous changes in both. If you fix it in the Bytecode compiler you'll have to do the analogue in C. But in C suddenly all fell apart. The good thing, only in threaded perls > 5.15, so the errors are expected and isolated. Just the fix is not right yet.
Fixing compiled C code is always easy by debugging into it with gdb, one session native and one parallel session compiled, find a proper breakpoint and then compare the state. The reason why C failed could be related to something completely different. In C the PL_body_arenas were empty, and when it was initialized by sideeffect in the added fix (analog to the Bytecode fix), the whole PL_regex_padav array fell apart. gdb hw whatchpoints to check who is writing to it did not work.
Okay, the fix was just not good enough you could say. It works in Bytecode by accident but not in the general case. But this does not sound right to my experience. Something else not yet understood is going on. So I'm calling it instable.
Or if one fix in a 5.10-5.14 non-threaded case, causes changes for threaded code for no apparent reason.
Executive summary: With big complicated apps even after a passing testsuite and passing Q&A, either let Q&A play with it for some time or better play with it by yourself and see how it behaves. One additional week always plays well.
I can't usually say whether my code is stable or not until I put it in production. I like your approach, but time is always limited.