Is there some formal way(s) of quantifying potential flaws, or risk, and ensuring there’s sufficient spread of tests to cover them? Perhaps using some kind of complexity measure? Or a risk assessment of some kind?

Experience tells me I need to be extra careful around certain things - user input, code generation, anything with a publicly exposed surface, third-party libraries/services, financial data, personal information (especially of minors), batch data manipulation/migration, and so on.

But is there any accepted means of formally measuring a system and ensuring that some level of test quality exists?

35 points
*

Pit Mutation testing is useful. It basically tests how effective your tests are and tells you missed conditions that aren’t being tested.

For Java: https://pitest.org

Edit: corrected to the more general name instead of a specific implementation.

permalink
report
reply
8 points

The most extreme examples of the problem are tests with no assertions. Fortunately these are uncommon in most code bases.

Every enterprise I’ve consulted for that had code coverage requirements was full of elaborate mock-heavy tests with a single Assert.NotNull at the end. Basically just testing that you wrote the right mocks!

permalink
report
parent
reply
6 points

That’s exactly the sort of shit tests mutation testing is designed to address. Believe me it sucks when sonar requires 90% pit test pass rate. Sometimes the tests can get extremely elaborate. Which should be a red flag for design (not necessarily bad code).

Anyway I love what pit testing does. I hate being required to do it, but it’s a good thing.

permalink
report
parent
reply
1 point

Yeah. All the same. Create lazy metric - get lazy and useless results.

permalink
report
parent
reply
5 points

This is really interesting, I’ve never heard of such an approach before; clearly I need to spend more time reading up on testing methodologies. Thank you!

permalink
report
parent
reply
4 points

I’d never heard of mutation testing before either, and it seems really interesting. It reminds me of fuzzing, except for the code instead of the input. Maybe a little impractical for some codebases with long build times though. Still, I’ll have to give it a try for a future project. It looks like there’s several tools for mutation testing C/C++.

The most useful tests I write are generally regression tests. Every time I find a bug, I’ll replicate it in a test case, then fix the bug. I think this is just basic Test-Driven-Development practice, but it’s very useful to verify that your tests actually fail when they should. Mutation/Pit testing seems like it addresses that nicely.

permalink
report
parent
reply
2 points

We are running the above pi tests with an extra (Gradle based) build plugin so that it only runs mutations for the changed lines in that pull request. That drastically reduces runtime and still ensures that new code is covered to the mutation test level we want. Maybe something similar can be done for C or C++ projects.

permalink
report
parent
reply
2 points

I’m currently working on a C++ project that takes about 10 minutes to do a clean build (Plus another 5 minutes in CI to actually run the tests). Incremental builds are set up, and work quite well, but any header changes can easily result in a 5 minute incremental build.

As much as I’d like to try, I don’t see mutation testing being worthwhile for this project outside of maybe a few isolated modules that could be tested independently. It’s a highly interconnected codebase, and I’ve personally reviewed (or written) every test, so I already know they’re of fairly high quality, but it would be nice to be able to measure.

permalink
report
parent
reply
3 points

Does something like this exist for Python?

permalink
report
parent
reply
3 points
1 point

Oh sweet! This introduced a whole new world to me. Also seeing mutmut, is one better than the other?

permalink
report
parent
reply
20 points

80%. Much beyond that and you get into a decreasing return on the investment of making the tests.

permalink
report
reply
11 points
*

I think this is a good rule-of-thumb in general. But I think the best way to decide on the correct coverage is to go through uncovered code and make a conscious decision about it. In some classes it may be OK to have 30%, in others one wants to go all the way up to 100%. That’s why I’m against having a coverage percentage as a build/deployment gate.

permalink
report
parent
reply
6 points

Bingo, exactly this. I said 80 because that’s typically what I see our projects get to after writing actually useful tests. But if your coverage is 80% and it’s all just tests verifying that a constant is still set to whatever value, then yeah, thats a useless metric.

permalink
report
parent
reply
1 point

God I fucking wish my projects were like this

permalink
report
parent
reply
1 point

The 80-20 rule is for everything. Don’t waste 80% of effort to get that last 20% of coverage.

permalink
report
parent
reply
17 points

“When a measure becomes a target, it ceases to be a good measure”. — Goodhart’s law

permalink
report
reply
15 points
2 points

Great read

permalink
report
parent
reply
12 points

But is there any accepted means of formally measuring a system and ensuring that some level of test quality exists?

Formally? No, this is basically impossible by Rice’s Theorem. There is not even a guarantee that if you have 100% test coverage, the program is good (the tests could be flawed).

This is just a natural limitation of turing completeness. You can’t decide these properties while also having full computational power. In order to decide such things, you need a less powerful mode of computation (something not turing complete) that can be analyzed more thoroughly and with more guarantees.

permalink
report
reply
2 points

That makes sense, thank you. Yes, it’s specifically “test quality” I’m looking to measure, as 100% coverage is effectively meaningless if the tests are poor.

permalink
report
parent
reply
7 points

Yea I’m afraid the only real way to “measure” that is to read through the tests and the code and make a good ol human value judgement on the state of the code and tests. But it won’t give you a number.

permalink
report
parent
reply

Programming

!programming@programming.dev

Create post

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person’s post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you’re posting long videos try to add in some form of tldr for those who don’t want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev



Community stats

  • 3.5K

    Monthly active users

  • 1.7K

    Posts

  • 28K

    Comments