Thursday, December 30, 2021

Thoughts on Competition Scoring Methods

With Tapa Train currently running on LMI, I've been thinking about puzzle contests again. On paper, these all measure one thing: how fast people are at solves the presented puzzles. And this tends to be borne out in the results, with the same group of people near the top every time. But the order is variable, and can vary by quite a bit depending on the format. It's not immediately obvious why, but the way results are aggregated and compared make a huge difference to how these things can feel. I'm going to go into all the formats I can think of and try to pinpoint what they're measuring, beyond the obvious. I'll be citing some example cases, often from my own experiences, and also often from cases where the format has not suited me well. That's not to say that I think I deserve a better placement from using a different system, just that I can speak to my own experiences and how the systems have affected me better than I could someone else.

Single Puzzle

This is the simplest form of competition- give everyone the same puzzle, fastest solve wins. In a 1v1 matchup the better solver will usually take it, unless they break the puzzle or the slower solver makes a lucky guess. But this holds true even in larger groups, as evidenced by Puzzle Duel results- the fastest solve often goes to EKBM, Freddie Hand or I (with Azade often taking sudokus) but equally often it goes to someone else. It's exhilarating to get a really fast clean solve that beats everyone, but I'll be the first to admit that this format tends to say more about genre experience first, the specific puzzle second, and the specific solvers/attempts third.

Points for Puzzles

This is the standard form of a puzzle contest on the internet- some number of puzzles that all have point values, roughly weighted by expected time to solve, with scores being determined by earning points for correctly solved puzzles within a time limit. This format works for a reason, though it's still somewhat dependent on the mix of puzzle types. The difference between a 100 point puzzle being a Double Choco (Japan's 2021 GP round, 3 minute solve) and a Skyscrapers (WSPC Round 5, 15 minutes to break and get 0 points, or England's 2020 GP round, 10 minutes to break and get 0 points) for me is quite striking. At the top level this is almost entirely about finish time so there's no avoiding weaker genres, and can range from a sprint to finish them all (like WSPC Round 2) or tactical puzzle selection (like WSPC Round 3). There is still a big flaw with this format though, and it has to do with the point values themselves. If they're not well calibrated then the results can be a bit of a crapshoot for unfinishable sets, and a mistake on a high point value puzzle is much more costly than on a low point value. It's also a really bad feeling to run out of time deep into a puzzle, where if you'd selected a different puzzle to go for with the remaining time you would have gotten it- this can create "rounding errors" of sorts from an expected placement. I believe Muhorka mentioned being a few seconds away from a finish on the 20x20 round of the WSPC, missing only the big pointer Index Yajilin. Clearly, approaching the puzzles in the opposite order (ending with the low value Akari) would create a swing of over 100 points on a 600 point round, just from the order of approach!

In the case of multiple solvers finishing, finish time is the obvious tiebreaker. This tends to be the purest sort of result as well, as they did the same thing, one just did it faster. Like with single puzzles, this often says more about the genre of the hardest puzzles, but time on the easier ones matters and so does being consistent overall.

Aggregating Contests

Puzzle Ramayan, Tapa Train, and the Puzzle GP/WSPC do this in different ways, and all of these have their pros and cons. Puzzle Ramayan on LMI normalizes scores such that 10th place is worth 100 points and the median is worth 50 (I think, I could be remembering this wrong) and then drops the lowest two normalized scores. Dropping the lowest makes it so a bad round isn't costly, though for me it means Number Placement never counts for anything but ratings (I'll get to that!). Normalizing scores also makes finishing 5 minutes clear of the rest of the pack have a different meaning depending on how long it takes. For a concrete example, the region division set this year had Ken Endo finish at 40:42, worth 149.3 points. Third was at 55:32, and 10th all the way back at about 70 minutes, a difference of 75% / 30 minutes. Shading this year had me take the win at 21:51, with third at 27:21 (again about 25% slower) and 10th at 42:39, a difference of nearly 100% / 20 minutes. However, due to the median scores, the regions round was worth 125 normalized points, and the shading round worth 114.6- it was a much easier set. Is this correct? Impossible to say, but it is what it is. I think the final PR results ended up correct, with Ken Endo taking a commanding win, Walker Anderson in 2nd, followed by Freddie Hand and I close together. As PR tends towards easier puzzles and I tend towards logical solves over intuitive ones, the puzzle selection plays more to my strengths, and being able to drop a poor Number Placement round didn't hurt.

Puzzle GP/WSPC aggregation is much simpler in comparison. You just take all the round points and add them up and that's your final score. This evens things out quite a bit, but this really makes the point values matter, especially the scaling between rounds. Using the 2021 Puzzle GP as my example, many of the top results took a drop on Round 1, which had a peak score of 842.8 (2nd being 754). I got a respectable 702 which was the 5th highest score which did not drop. Round 5 had a top score of 1155 due to a finish with 30 spare minutes AND a higher base value. Given the international format and not wanting to force puzzle authors into a no-win situation, it makes sense to drop the lowest rounds, but what that really means is whatever round is most undervalued tends to not count, and any errors on high scoring rounds are significantly more costly. The two rounds I dropped were round 3 (606 points, but I had finished with 2 incorrect answer keys on correctly solved puzzles costing me 800+) and round 6 (657 points, made the wrong call of last puzzle to go for- I think it was the right choice, it just didn't totally pan out). If you move those answer key mistakes in round 3 to round 4 instead, a relatively lower scoring round, then I gain 2 placements on the overall standing, immediately. To me, THIS IS NOT DESIRABLE- 100 points of mistakes (200 with bonus) should be relatively the same regardless of where it happens, it shouldn't be that critical, and with normalizing, it wouldn't be. It would still matter, but less.

WSPC had no drops and I think is one of the closest reflections as a result. I still think round 3 was questionable and arguably unsuitable due to its point values rarely being representative and the round being nowhere near finishable. Of course, WSPC had its own issue of requiring a lot of time in a relatively small span of time, and some people having much more available time to prepare / find suitable times to do each round than others. In my case, as previously written, I could barely fit in the 10 puzzle rounds, let alone practice for some of the weirder stuff like the Night and Day round, and it cost me in a way that a single round would not.

Aggregating Single Puzzles

Of course, I mentioned Puzzle Duel and Tapa Train earlier, and both of these aggregate individual puzzles in 3 different ways between them. First, there's Puzzle Duel's rating system, which normalizes the median time to 2000 rating, and calculates a rating based on the proportion faster or slower than that. I find that the number says more about the genre and difficulty than anything else, but the relative movements of solvers within that rating tend to be accurate to within that solvers error rate. The harder puzzles are more important here, as you can potentially make up a lot of time and gain a lot of rating points, but a slow solve on an easy one can cost a lot. Overall, every puzzle feels like it matters with the only drawback being that you can just not attempt a puzzle if the genre / solve times aren't appealing. Personally, I skip Queens on principle and do everything else, even if I know I'll lose rating (hardish sudoku!). You'll also note that the fastest solve, if it's not fastest by enough, can lose rating points! This is extremely rare but it has happened. I feel like a well made rating measures consistency the best.

Puzzle Duel has also run contests on a series of puzzles with a different, unknown normalizing system. Every solve is worth points, with errors on the hard ones being relatively more costly purely due to the opportunity cost of a higher potential gain. Errors on easier puzzles matter still, but 2 points lost with a max score of 4.5 is significantly less than scoring 6 on a top score of 15. Doubleshow has worked okay with only a couple specific puzzles deciding most of the results, but I think PuzzleXL 2021 was a pretty good event. Every puzzle was big and hard so every puzzle mattered, and there was tension until the very end. The drawback is obvious, of course: when every puzzle is scored, and scored positively, then authors have 0 chance to compete if they contribute even 1 puzzle, as happened to me in the first Puzzle XL.

Finally, Tapa Train works entirely based on placement and aggregating those. Unlike every other format, the easy puzzles matter significantly more here! Having to restart, or taking a penalty matters so much more when times are short and the difference between top solvers and fast/moderate solvers is only a few seconds. On bigger and harder puzzles, I expect errors will mostly serve to order within a "tier". I unfortunately broke the second puzzle and with the format, that probably takes me completely out of contention while with a different system or with dropping, I'd still be in it. Heck, even a system of total time would still be competitive (though that would bring its own issue of easy puzzles barely mattering again. The one advantage of a system like this is that it can actually account for not solving a puzzle without breaking the system- you get (101-placement) points, so not solving is just taking a 0. I also dislike ranking by placement because it doesn't capture differences in speed- on the first puzzle, the top 3 times are 7.5, 8.8 and 11.6, a 4.1 second difference, worth 2 points. You know what else is a 4.1 second difference? 14th to 28th, worth 14 points. And it's harder to gain time when you get faster, so 3rd gaining 2.5 seconds gains NOTHING while 23rd gaining 2.5 seconds gains 10 points, despite being a much easier time to reach. 40th would gain 12 placements with the same time gain as well- the system doesn't really reflect performance meaningfully, I feel, but its unique properties make it the best fit for this event. I do hope to see a different, more suitable scoring system used for future iterations. Neither of these other methods are measuring consistency too much, even over the contest- other than perhaps consistently not having a bad solve.

Other methods?

I can't think of any other methods used, but if you've seen any that I didn't mention, or have further thoughts on the topic, let me know! I'm happy to hear about it and am usually up for talking about stuff that's caught my interest.

In conclusion, even though "time" and "points" are absolute measures, the way they're handled can place the emphasis on different puzzles or approaches, without changing the contents of the contests. Every method has its drawbacks and flaws, and when setting up a contest it's important to understand the nuances of what the chosen method is measuring. Personally, I lean towards total time when applicable, points when it's not, and normalized aggregation between contests. When aggregating contests, dropping the lowest tends to be unfair as it tends towards dropping the ones with the lowest potential, while when aggregating puzzles it tends to be more fair as it removes outlier performances. Honestly, I think I'd also recommend dropping the HIGHEST score in a case like this too, rather than just the lowest! I don't think I've seen a puzzle contest aggregation do that before, it might be worth it.

No comments:

Post a Comment