Dear Fortuna

Why You Should Not Trust Language Benchmarks

Wed, 11 Feb 2026 14:00:00 -0400

Intro

Recently I felt drawn back to OCaml. It was always one of my favorite languages, but it did have its issues with the standard library (There is Base, but I’d rather not). And, OCaml always lacked some optimization powers of a lower language like C. However, this is 2026, the Year of OCaml, maybe.

Plenty of QoL and new cool things like effects are added to the language and the actual standard library. There is also OxCaml, a bunch of extensions for more QoL, better control over allocation, layouts, etc.. Some of those extensions are even being added to OCaml itself! A dream come true for me certainly.

With all these thoughts, though, I have made a mistake in trying to look at OCaml benchmarks. It is well known that there are lies, damn lies, and (language) benchmarks. Comparing languages and their implementations is often a futile effort. At best you could argue for tiers, e.g. system language vs GCed language vs Python. But with things like OxCaml and Numpy Python, even that can be unreliable. However, it is an innocent idea in theory, just search for “OCaml performance benchmarks”.

On Google, the first few results are two 2020 discuss.ocaml.org posts, one informationless blog post that I hope was AI-generated, and a proudly vercel benchmarking site, Programming-Language-Benchmarks. Lower you can also see Benchmarks Game, Rust vs OCaml Medium article and some resources for benchmarking actual OCaml code. The article requires an account, so I can’t see the actual benchmark, but it seems to focus on actual language differences like GC in the intro. We will be looking at the benchmark aggregates in PLB and BG though.

But first, a warmup example.

C++ is slower than JavaScript

There are plenty of posts on every kind of tech and tech-adjacent site about how JavaScript somehow outperforms C or C++. e.g.

Why does JavaScript appear to be 4 times faster than C++?

The provided sources are simple and feel equivalent (well maybe Date vs clock()):

(function() {
    var a = 3.1415926, b = 2.718;
    var i, j, d1, d2;
    for(j=0; j<10; j++) {
        d1 = new Date();
        for(i=0; i<100000000; i++) {
            a = a + b;
        }
        d2 = new Date();
        console.log("Time Cost:" + (d2.getTime() - d1.getTime()) + "ms");
    }
    console.log("a = " + a);
})();

int main() {
    double a = 3.1415926, b = 2.718;
    int i, j;
    clock_t start, end;
    for(j=0; j<10; j++) {
        start = clock();
        for(i=0; i<100000000; i++) {
            a = a + b;
        }
        end = clock();
        printf("Time Cost: %dms\n", (end - start) * 1000 / CLOCKS_PER_SEC);
    }
    printf("a = %lf\n", a);
    return 0;
}

So why is JS faster? Some could say V8 magic. All kinds of tricks are used like pointer tagging, NaN-boxing, type predictions, not to mention just having a great GC. What should have been interpreted floats in theory magically turn to assembly bits. Maybe JIT has finally gotten to the point where it makes AOT obsolete. But it really is just microbenchmarks. JIT are great at optimizing hot loops. The real performance though tends to fall off, especially with all the objects and type messes.

Use of the language in the real software, rather than specialized algorithms or data structures, can be very different. Hence the term “microbenchmarks”. But even benchmarks of real software can hide that the difference is not in language, but in implementation of software itself. Oftentimes there are differences in architecture, design, usecases. Not to mention that newer works benefit from mistakes of previous works. Sometimes rewriting a C codebase in Rust (within reason) is beneficial for the sole reason that you now know what to do.

Of course, in this case, the OP just forgot to use optimizations for C++ whereas Node did it for him in JS. I am not even talking about any advanced performance sensitive flags, just good old -O2.

Another common issue is non-equivalent programs. E.g. a C program that is forced to convert a long to float every iteration vs JS one that simply used floats from the start (in theory everywhere, but reality is different). There is also a subissue there with “idiomatic code”. Idiomatic is not necessarily the most performant, but it is often the expected one for the language. Thus, there is the issue of when to forsake idiomatic code for performant code

The Numerous Problems

SIMD, Multi-threading?

When you use lower languages like C and C++, is it fair to use SIMD? That is a good question as SIMD can more accurately represent the ceiling of the language. At the same time, not all languages even provide SIMD intrinsics.

No need to mention Python or Ruby, when even Go required you to just write straight assembly until Go 1.26 (released yesterday) and it is still under an experimental flag! OCaml by virtue of OxCaml extensions now has SIMD intrinsics, but it did not have them before in any non-C way. Given that Go and OCaml theoretically have SIMD support, it is harder to say that comparing them without SIMD to C with SIMD is fair.

You could call C for many of these other languages. However, at that point, you can lose a lot of performance due to FFI overhead. The overhead itself also depends on where and how SIMD was used. And if we are talking about using C FFI, all bets are off anyway. What is the difference between implementing a small part in C vs just the entire algorithm. Not ideal.

There is also the question of whether it is realistic. Even in C you don’t just reach for SIMD whenever due to lack of abstractions. But then what should you do if the languages do provide those nice abstractions. They can make the SIMD a lot easier, portable, and simply accessible. Zig has fine abstractions for SIMD in the core language. Rust has them in its nightly builds. Similar abstractions also seem to be the goal for Go once they are finished with implementing intrinsics.

Using multiple CPU threads is another issue. Many languages support POSIX threads, but ergonomics can be awful. Some languages provide better abstractions, many don’t. C and C++ suprisingly make it easy with OpenMP, at least for more embarassing problems.

Implementation?

Furthermore, different languages may optimize for different purposes. Scheme, e.g., must be tail-recursive. It is even specified in exactly what ways tail-recursion must be handled in the standard. JavaScript implementations (besides Safari somehow) do not support it. Yet, it is actually in the ES6 spec (from 2015). C does it, but there is no requirement nor guarantee generally. Tail-recursion optimization can turn recursion into a while loop, so it is a very important feature of the language implementation.

These problems relate to issue of equivalence too. Should a C program using OpenMP, tail-recursion, and SIMD even be considered similar to any program that does not or can not. Hell, even if you are not using SIMD yourself, GCC automatically enables SSE2 for x86_64 programs,. SSE2 matters for libc (e.g. memcpy or memset), not to mention potential auto-vectorization. To go even further, what if a C program uses --ffast-math to optimize floating point? If the results are the same, does it even matter if that C program ignores IEEE compliance. Means do matter.

There is an argument for separating language spec and implementation. In majority of cases there is already one least unpopular choice, even if it is questionable in performance, like CPython. The issue is that some languages don’t really have a spec or common implementation, see Scheme for latter. Or, the spec might not be respected in all cases like the tail-recursion in ES6 for example. As another example, OxCaml provides many performance-related extensions that vanilla OCaml does not have yet.

Different implementations may suffer from lack of support. E.g. Pypy is much faster than CPython, benchmarks will show as much. However, there is a reason why PyPy has not replaced CPython with their incompatibilities. Another example is tinygo, which uses LLVM for backend. It can outperform go in some benchmarks, but it lacks certain features and has a different focus from go for real use.

Talking about LLVM, far too many languages are just LLVM frontends. This is for pragmatic reasons, however, it does make comparing them a bit difficult. An extra annoyance is that some languages may indeed optimize above LLVM (e.g. Rust), others may not at all.

Note that the issue is not only in that these questions are often unanswered, but in that answers to them are often controversial. It is easy to say or enforce one way, like banning SIMD, but it is hard to convince those who disagree.

Exhibit #111

There are plenty of sites and blog posts for ranking languages by performance (as most other metrics are even more pointless). Programming-Language-Benchmarks (I will use PLB for short) is certainly one of the better ones. It states that it is influenced by Benchmarks Game (BG), and you can see the similarities. Their primary purpose is to gather the fastest solutions for different language.

PLB and BG both provide a set of problems and solutions in different languages. Some of these problems are very basic like “helloworld”, others are more complicated (relatively) data structures and algorithms like LRU, “Least Recently Used”. Unlike many other benchmarks, these sites do provide the source code, the flags, and everything that they did. Though you would have to trust them on properly benchmarking these with warmup and whatnot.

PLB and BG do good in mentioning if the code uses multi-threading, SIMD, or in BG’s case some other forms of unsafe/horrid optimizations. PLB has a different name (e.g. 1.c vs 1-i.c), BG has a star, which I would prefer (as in all bets are off). The big benefit of BG is that it has a lot more implementations and problems, but PLB has plenty itself. And these do matter, as some implementation are significantly worse. For example, n-body problem on BG has * C gcc #9 on the top when sorted by time, optimized with 256bit AVX SIMD intrinsics. On the other hand, * C gcc #4 only uses 128bit SSE, so unless its algorithm is a direct upgrade, it simply has to be worse. And this is the same language and compiler. Not to mention all the different implementations inbetween in also optimized Rust and C++ code.

There are still plenty of more naive solutions as well, which in some cases can make C slower than (optimized) Java.

Regardless, from these problems it is very hard to tell what language is “better”. It is possible that someone wrote a very fast C++ solution with AVX against a poor SSE C solution. Both have SIMD, but different tiers.

BG does provide some box plots which show you the general floor, ceiling, mean, as well as outliers. This in theory provides a better description of the language capability. In reality, there really isn’t enough and all the problems I mentioned can even be compounded.

Another issue was, as I mentioned, potentially misleading remarks. The original PLB page that I shared was to OCaml vs Go comparison (interesting that this was the top result and not OCaml’s page). The first problem for me was that 1-m.go, multi-threaded version, was slower than 1.go, the base one. Granted 1.go was tinygo and 1-m.go was go, but unless go is an order of magnitude slower to cancel out multi-threading, something was off. Indeed, if you just hover the links you will realize that 1-m.go link points to 1.go. Interestingly, all -m.go problems except for spectral norm point to non-m variants, which does track with what numbers we get. This is in the benchmark yaml file for go, so I am not sure why the website values don’t match.

If you look over all the problems you will indeed see that tinygo and go are trading blows in speed and memory use. OCaml is behind in speed and memory, but usually not too far off, even winning in a few select cases. If we use C solution 2.c (no SIMD) for nbody problem as a baseline, then go’s speed is 11% slower, OCaml’s is 17% slower. So now that we have such a comparison, what does it really tell us? Nothing.

C is a good baseline as it is typically the fastest with lowest overhead. But realistically 11% and 17% are pretty close. The difference is essentially ~300 ms vs ~350 ms, not something you will even tell apart. The larger difference is memory usage, being 2MB vs 3.5MB vs 5MB, but you would expected that from GCs. We are not using floppys as our RAM, so that is extremely negligible. So C is not fast, Go and OCaml are somewhat close. But that was just one nbody problem.

I could instead pick nsieve, where C is ~250 ms, Go is ~300ms, and OCaml is ~900ms. Huh, suddenly OCaml looks like a grade below those two. Now you can suddenly make an argument that OCaml is much worse. Indeed on most of these problems, OCaml is trailing lightly behind Go, but on others it is nearly an order away. You could just chuck it up to GC being GC, maybe some optimization differences, maybe boxes since people like ref too much. Still, how can you make an argument that OCaml is close or far from Go, when the results are so varied. I guess you could say that because OCaml is varying so much, it is strictly worse. But what if your usecases align perfectly. What if you don’t pick a language based on a language benchmark.

In the end, the reality is that to make a conclusion you would need to read the source codes, check the compile flags, etc. to understand the differences, trade-offs, benefits. This inadvertedly requires at least passing understanding of the languages you look at. But at that point you could already be familiar with a language enough to gauge its speed. A beginner, on the other hand, might misinterpret the results completely without any knowledge to guide.

Note that these sites are likely made in jest, or mostly to collect optimized solutions to problems in a way that is gradable. I doubt (as in hope this is not the case) the authors genuinely believe that they are fairly comparing languages. But these sites can be misused, misunderstood, or twisted in ways that are orthogonal to their purpose. At the same time, a much worse example would be someone with something to prove.

Intent matters

As for any statistic, often it is not the numbers but the flags. PLB and BG don’t seem to own any horses in the race, though, it is not impossible for even personal bias to show up. In a lot of cases, it is on the author of the benchmark to tweak algorithms to make their predetermined winner look better. Here is a cool guide if you want to lie better.

It is not rare in this day and age for an innocent blog to be a thinly veiled advertisement for something. Many Github repos these days are faces of a product to sell or popularize. Benchmarks far too often are convenient for this, as people look at faster speed and better resource usage in awe. The same benchmarks that can easily be misleading, manipulated, cherry-picked, and abused.

The End

Don’t do language benchmarks, kids. It is a waste of time, memory, and gzipped bytes.

Nicest (small) LCG numbers

Mon, 09 Feb 2026 22:00:00 -0400

Intro

Recently I wanted to give a nice small example of a Linear Congruential Generator (LCG) to show how easy and simple PseudoRandom Number Generators (PRNG) are.

An LCG is typically x[i] = a * x[i - 1] + c (mod m), where x[0] is the initial seed, a is multiplier, c is additive, m is our space.

There are many good numbers online, and even plenty of bad numbers have a decent enough period, i.e. when the sequence starts repeating. You will not notice the issue immediately in many of those cases, even if a computer or a bad actor would. LCGs are simple and efficient, but are one of the least cryptographically secure after all.

The issue is that LCG numbers are frequently large. They have to be for good randomness over a large space, e.g. 2^32, the typical size of an int. This does form a good question though, what would be the nicest LCG numbers, i.e. still looks “random”, yet small (under a 100 preferably).

Brute Forcing into N

Now I am certain that there are analytical methods to find good numbers with the properties that I want. I would not be surprised if someone did that already either. But, these numbers are small, I can quite literally just loop over the entire input space and pick the best ones. We do need a scoring function to check how good the inputs are though.

Let’s start with uniqueness as our scoring function. That is, we just check whether a number occurs for each possible number. This rewards long periods and using all of our numbers as any future period or identical numbers will not impact the score.

Let us check all numbers up to (but excluding) 12.

The best result we get is m = 11, c = 1, a = 1, x[0] = 1 or:

1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 ...

Oh how great and random it is, most would almost certainly disagree. The sequence is not ideal in what we usually think of as “random”.

Yeah, I should have expected that. Despite not looking random, it has the best period of 11, known as full period (= m). It even uses all of our numbers!

What about second or third or fifteenth best? Just shifted, you get tweaks to c or x[0], but it is practically the same obvious ordered sequence.

The flaw is in the scoring system as it does not care about anything except uniqueness. Indeed, to it 1 2 3 4 5 and 4 2 5 3 1 are identical as long as same numbers are involved, even if to us the second is more “random”.

Discouraging Unfair Play

Well the core issue is that the algorithm can set a = 1, m = 11 and tweak c and x[0] with a simple ordered sequence. One way to deal with it somewhat is to check whether the difference between numbers changes. If it does not, reduce the score for each violation.

This system can have trouble tracking changes modulo. E.g. 10 0 1, has differences of 10 and 1, rather than 1 mod 11 for both. But we can just ignore those and look at other options. I call this technology Human-In-The-Loop™.

The results look much better:

x[0] = 5, a = 1, c = 5, m = 11 :: 5 10 4 9 3 8 2 7 1 6 0 5 ...
x[0] = 5, a = 1, c = 6, m = 11 :: 5 0 6 1 7 2 8 3 9 4 10 5 ...
x[0] = 1, a = 2, c = 0, m = 11 :: 1 2 4 8 5 10 9 7 3 6 1 2 ...
x[0] = 1, a = 6, c = 0, m = 11 :: 1 6 3 7 9 10 5 8 4 2 1 6 ...
x[0] = 1, a = 7, c = 0, m = 11 :: 1 7 5 2 3 10 4 6 9 8 1 7 ...
x[0] = 1, a = 8, c = 0, m = 11 :: 1 8 9 6 4 10 3 2 5 7 1 8 ...
x[0] = 1, a = 2, c = 1, m = 11 :: 1 3 7 4 9 8 6 2 5 0 1 3  ...
x[0] = 1, a = 6, c = 1, m = 11 :: 1 7 10 6 4 3 8 5 9 0 1 7 ...
x[0] = 1, a = 7, c = 1, m = 11 :: 1 8 2 4 7 6 10 5 3 0 1 8 ...
x[0] = 1, a = 8, c = 1, m = 11 :: 1 9 7 2 6 5 8 10 4 0 1 9 ...

The first two do only tweak c and x[0], but they don’t look too bad in comparison to 1 2 3 4 .... Also x[0] is set to 1, but this is more of a first mover advantage (my brute force loop starts with x[0] = 1), and this is a seed to begin with. I wouldn’t mind using these, but it would be useful to test it on more numbers.

Larger numbers

Let us then try up to (including) 50, this takes a second, but does find me some interesting results:

x[0] = 1, a = 11, c = 1, m = 50 :: 1 12 33 14 5 6 17 38 19 10 11 22 ...
x[0] = 1, a = 21, c = 1, m = 50 :: 1 22 13 24 5 6 27 18 29 10 11 32 ...
x[0] = 1, a = 31, c = 1, m = 50 :: 1 32 43 34 5 6 37 48 39 10 11 42 ...
x[0] = 1, a = 41, c = 1, m = 50 :: 1 42 23 44 5 6 47 28 49 10 11 2  ...
x[0] = 1, a = 11, c = 3, m = 50 :: 1 14 7 30 33 16 29 22 45 48 31   ...
x[0] = 1, a = 21, c = 3, m = 50 :: 1 24 7 0 3 16 39 22 15 18 31 4   ...
x[0] = 1, a = 31, c = 3, m = 50 :: 1 34 7 20 23 16 49 22 35 38 31   ...
x[0] = 1, a = 41, c = 3, m = 50 :: 1 44 7 40 43 16 9 22 5 8 31 24   ...
x[0] = 1, a = 11, c = 7, m = 50 :: 1 18 5 12 39 36 3 40 47 24 21    ...
x[0] = 1, a = 21, c = 7, m = 50 :: 1 28 45 2 49 36 13 30 37 34 21   ...

It of course prefers 50 as m, since that provides the largest period. The biggest issue is that a = 11 and +10 cousins have their issues, especially for c = 1. It is a little off to see so many multiples of 11. Likely due to m = 50. But with c = 7 or even c = 3, it looks decent enough. I will have to look into a way to improve it by adding more to the score.

I feel that while m = 50 is great for long periods, it is not good for “randomness”. So that is another avenue for improvement.

We can also look at m = 61 instead of m = 50 as it might be nicer:

x[0] = 1, a = 26, c = 0, m = 61 :: 1 26 5 8 25 40 3 17 15 24 14 59 9  ...
x[0] = 1, a = 30, c = 0, m = 61 :: 1 30 46 38 42 40 41 10 56 33 14 54 ...
x[0] = 1, a = 31, c = 0, m = 61 :: 1 31 46 23 42 21 41 51 56 28 14 7  ...
x[0] = 1, a = 35, c = 0, m = 61 :: 1 35 5 53 25 21 3 44 15 37 14 2 9  ...
x[0] = 1, a = 43, c = 0, m = 61 :: 1 43 19 24 56 29 27 2 25 38 48 51  ...
x[0] = 1, a = 44, c = 0, m = 61 :: 1 44 45 28 12 40 52 31 22 53 14 6  ...
x[0] = 1, a = 51, c = 0, m = 61 :: 1 51 39 37 57 40 27 35 16 23 14 43 ...
x[0] = 1, a = 54, c = 0, m = 61 :: 1 54 49 23 22 29 41 18 57 28 48 30 ...
x[0] = 1, a = 55, c = 0, m = 61 :: 1 55 36 28 15 32 52 54 42 53 48 17 ...
x[0] = 1, a = 59, c = 0, m = 61 :: 1 59 4 53 16 29 3 55 12 37 48 26 9 ...

I look at bigger as as small ones give obvious powers. These look alright, but have their issues too.

Final Results

My quest continues for the better numbers. There is also a limit to how many inputs I can handle, so maybe I should consider using some better state exploration than “loop over everything”. Naturally more work required for a better scoring system.

Finding out My Hashtable is Awful

Sun, 18 Jan 2026 16:00:00 -0400

Intro

I once found myself bored, though, not quite the useful kind of boredom. I did not want to do my projects or something nice. At the same time, I did not want to just spend it watching youtube or similar. Thus, I thought, might as well do some leetcode.

I am not particularly fond of leetcode generally. Some algorithms are nice, most are not, and I rarely learn as opposed to “memorize” patterns. Doing union-find on leetcode rarely feels as nice as using it for constant-folding optimizations in a compiler. Still, leetcode is necessary for a lot of interviews (for now, given AI and grindflation).

I went through a couple of algorithms, doing all of them in C to provide a modicum of joy. Well as close as you can get to joy given extra annoyances. E.g. leetcode C compiler setup fails on signed integer overflow. This requires -fsanitize=signed-integer-overflow on my gcc setup. This setting has its uses, but not when I just wanted to do a quick fnv1a.

Anyway, some problems went well, usually those I knew or those that are obvious to me. Some did not, and I had to give up and look up the solution and try to understand it. There were a few that relied on stuff that would take me a while to implement in C too.

One problem I had was “261. Contains Duplicate II”. I started with a simple naive double loop, essentially doing sliding window, but it left me wanting. I was nearing the end of my leetcode energy, so I decided to look up the solution. Hashtable, obviously. Very simple too, just a get and a put in a loop.

C does not have a hashtable natively. Leetcode apparently provides uthash, though I had never seen it in a wild (becomes obvious why when you see the ergonomics). There is also the libc hash table, but that is just one of POSIX April Fools jokes.

Anyway, I implemented one recently while testing out my C build system (Guile script, nothing too fancy, though it does cache). It even has SSE2 SIMD and 64bit SWAR (SIMD Within A Register). So I thought it would be good practice to do another one, and it very much was in hindsight.

Spec matters

got, good-old-table as I called it, based itself on Abseil SwissTable. Except I tried to avoid just reimplementing someone’s solution. I only read an overview and skimped on the details. It sounded simple enough.

There was a point though where I wondered why Abseil seemed to use tombstones. But, I did not overthink it and thought I will learn it eventually.

I also thought about doing benchmarks to compare to at least C++ STL std::unordered_map. But, that could have been too much work for a not too serious hashtable. Especially since it was more useful to compare to more than one library.

Thus, I decided to do something similar for my solution. But doing SWAR, and especially with ints as opposed to 64 bit values seemed like a waste of time. A speedup, sure, but not that serious. As such, I took to just doing linear probing. It should be good enough, that’s the base for a SwissTable anyway.

Absolute failure

So straight from memory I implemented a simple linear probing hashtable.

Here is its struct:

struct HashMap {
    int capacity;
    int length;
    struct {
        int valid, key, val;
    } arr[];
};

Nothing too involved besides the zero length array trick. That is just used to allocate everything in a single allocation.

I originally did not include length for funny reasons, but it is necessary. Well, you can optimize it out given that leetcode tests will never get that bad, but let’s not get into that.

So after finishing up and cleaning up any immediate compile-time errors, I ran the solution on basic tests. Success, and given how simple the solution really is, that was to be expected.

Here is how the “solution function” looks:

bool containsNearbyDuplicate(int *nums, int numsSize, int k) {
    struct HashMap *map = create_ht(64);

    for (int i = 0; i < numsSize; i++) {
        int *p = get_ht(map, nums[i]);
        if (p && i - *p <= k) {
            return true;
        }
        put_ht(&map, nums[i], i);
    }

    return false;
}

Freeing wastes cycles. Anyway, satisfied, I hit the submit button… Timed out.

That was weird. Even if my table is not particularly fast, that was orders of magnitude too slow. This was on a testcase of only 54500 ints, so it should not take that long.

I tried a couple of optimizations. E.g. valid field is not necessary since a key can never be more than 2^30. I even tried to just increase the preallocated memory to see if that could improve runtime, even if just for this one.

All of them were bandaids at best. This was a fundamental issue.

Evil assumptions

So what was at the core of my put and get that made this many times slower than it should be.

Well I had a simple assumption. The valid entry could be anywhere after the initial index from hash. If you do not see the problem immediately, think about it. What made it worse was that I do not need to delete anything.

Every entry is allocated right after the last one. So I was forcibly and completely needlessly going through the entire hash table. For every call to get and most calls to put. Most calls to put was because I had this useful thing called early return.

Ironic that the problem itself also has an early return.

The fix to get was adding an else return 0;. Immediate improvement.

Here is it so you can see the error of my ways:

for (int i = init; i < map->capacity; i++) {
  if (map->arr[i].valid) {
    if (map->arr[i].key == key) {
      return &map->arr[i].val;
    }
  } else { // just this part
    return 0;
  }
}

This is also a case of a small optimization hiding a better one. The original code used map->arr[i].valid && map->arr[i].key == key. This couples them when they really should have been separate.

The fix to put was slightly longer, but essentially the same.

I only realized this after being very annoyed by it and testing it locally.

Thus, the problem was submitted and solved in reasonable time.

Unreasonable time

Ok, no. The time was ~800ms. This counts as “solved”, but realistically extremely slow. This is array search in a loop speeds, if not worse. The other solutions that leetcode presented were 100ms at worst. I was bottom 0.20%.

This prompted me to do some extra testing. Nothing too precise or involved, but enough to see the problems. You can find the code for it here. I will include the (incomplete) table here anyway to avoid spoilers:

Name	Time	Time (-O2)
fine.c	5 ms	4 ms
fine.cpp	26 ms	6 ms
with_got.c	2227 ms	691 ms
awful.c	8292 ms	1893 ms

fine.c is the good solution, awful.c is the original bad solution. There is also fine.cpp which uses unordered_map and with_got.c which uses my got library.

The only test case I used was the one I timed out on. Thus, this table is a little useless to compare my table and unordered_map in my opinion. But it does show you the magnitudes of difference from the bad ones.

Still, it is interesting to see that my fine.c is better or comparable to unordered_map. Yet, the C++ solution is finished in ~80ms, mine is not.

I then decided to apply the valid field removal I talked about. I was able to get the time down to ~350ms, which is still far from ideal, but more manageable.

I also tried the uthash that leetcode provides. That one gave me ~90ms. The ergonomics and documentation were questionable. Lots of macros, which are not friendly to leetcode. You have to create your own entry and even include a magic field for a hash handle. The primary purpose of which seemed to be iteration, but could be more. Well now that is a little sad.

I then went on to do some stronger optimizations. I have started using static preallocated memory. Got rid of length for good with that. Tested different preallocation sizes. I found that allocating 256KB is enough to pass the tests with flying colors. Any more slowed down, any less slowed down.

Done in 2ms, and 16MB of memory per what leetcode reports. C++ uses ~100MB and uthash uses ~60MB for comparison. Beats ~98% and ~96% respectively. Pride restored, technically.

I included this as best.c in that same repo.

What happened

Going from ~800ms to ~350ms by just removing the valid field is not too surprising. This simply uses less memory which means we have to seek less if we hit a collision. Better cache usage and whatnot, possibly compiler optimizations too (leetcode uses -O2).

Going from ~350ms to 2ms is a different question. But the trick is that by having so much capacity (256KB), I basically turned get and put into array access operations, not “amortized”, actual O(1). At that point you could just create buckets for every used number.

I could and probably should do some perf testing to see whether there is another obvious mistake. E.g. if anything hashes to the end of the array that would always force a resize, a potentially serious issue. Still, I am satisfied with getting 2ms for now.

What did we learn

I should go fix my got library. I call it not too serious, but I cannot allow this level of underperfomance.

The main lesson would probably be an importance of assumptions. If you take in wrong or expensive assumptions, you may suffer. On the other hand, if you take in correct assumptions, you can benefit a lot from it. This often involves a tradeoff with generality like what I did for best.c, but other kinds exist too.

Update: got library should now be fixed (for now, before more bad assumption show up). I updated the repo by adding the (new) version, it is roughly on par with unordered_map there. Though, as I said, comparing hash tables based on one testcase and only on sorted integers is not a good idea.

To Debug or Apply AI

Mon, 25 Aug 2025 23:04:15 -0400

Intro

Recently, I have come across a peculiar issue. A bug most deranged. A kind of unspoken true evil you would not easily find in your kind and forgiving languages.

Can you find it in this cleaned-up function with totally all the necessary info?

int send_data(int sd, unsigned char *txk, unsigned char *data,
              uint64_t data_len, struct header *pack,
              struct sockaddr_in *saddr) {
  struct header spack;
  const uint64_t max_size = sizeof(pack->packet) - ABYTES;
  pack->packet.off = 0;

  while (data_len > max_size) {
    memcpy(&spack, pack, sizeof(spack));
    send_pack(sd, txk, data, max_size, &spack, saddr);

    pack->packet.off += max_size;
    data_len -= max_size;
  }

  return 0;
}

To Debug

Is to find wrongdoings

Ignore the code quality for a bit.

There are only so many things that could fail in these few lines, but I will spare you a need to write everything around it yourself (somehow).

The issue is an infinite loop, pretty simple.

We only have one loop (ignore send_pack for a moment), so it’s got to be that. But what could be wrong? max_size is clearly being subtracted from data_len, and nothing else really matters for this while loop. We could prove correctness or use a debugger, but it is easier to just put a few printfs like a true master.

Alright, after sprinkling them like ~~friends~~ mines on a minefield, we find a trace:

data_len: 248
max_size: 512

memcpy()..
send_pack()..

data_len: 0
max_size: 512

pack->packet.off..
data_len..

data_len: 0
max_size: 512

A 0 in search of a better place

Hmm, that is a 0 where it was not supposed to be. Now, this would not be surprising if we modified data_len somewhere. Maybe its address is messed with in some pointer I did not see. But, it is a const, right?

Checking the full code again.. and no. data_len is a const after all. There are no assignments or addresses misused.

Alright, there is clearly a change after the memcpy and send_pack. Got to be those then. Well, the memcpy is fine, those are equal sizes and sizeof is not incorrect thankfully.

What about my send_pack? I have tested it before and it seemed to work, but maybe it is completely broken inside for this specific usecase.

Here it is in all its (cleanish) glory:

int send_pack(int sd, unsigned char *txk, unsigned char *data,
              uint64_t data_len, struct header *pack,
              struct sockaddr_in *saddr) {
  pack->packet.size = data_len + ABYTES;
  memcpy(pack->packet.payload, data + pack->packet.off, data_len);

  hash(pack->packet.hash, sizeof(pack->packet.hash), 
       pack->packet.payload, data_len, 0, 0))

  encrypt(&pack->packet, 0, &pack->packet,
          sizeof(pack->packet) - ABYTES, 0,
          0, 0, pack->nonce, txk)

  sendto(sd, pack, sizeof(*pack), 0, (struct sockaddr *)saddr, 
         sizeof(*saddr));
  return 0;
}

Could look nicer, let’s ignore that, not important or related at all.

Well, nothing immediately outstanding, but if you look closer.

Const does not mean const always

data_len is used to decide the packet size and it is even in the memcpy and hash. Of course, in this case size refers to the payload, but what is max_size set to again?

const uint64_t max_size = sizeof(pack->packet) - ABYTES;

Hmm, I guess I never showed, but pack->packet is the entire struct, not the payload.

struct packet {
  // ...
  uint64_t off;
  unsigned char payload[200];
};

struct header {
  // ...
  struct packet packet;
};

So getting size of that would give you too many bytes to work with (it is larger than just the payload). Not so much that everything is corrupted, but a handful of things will be. The end result is this friend:

memcpy(pack->packet.payload, data + pack->packet.off, data_len);

It will go too far and overwrite some stack space. And, of course, what comes after that spack we use inside of send_pack on the stack?

struct header spack;
const uint64_t max_size =
    sizeof(pack->packet) - ABYTES;

So we just write too much. Stack corruption, yay!

data has space because packet.off is also incremented by max_size, a.k.a. 0. And of course we blow past spack into max_size, because both of them are right next to each other on stack. const is just a compiler hint when you write to arbitrary stack memory.

And here I thought that using the stack was always superiour to malloc. Haha.

Well, this was all in the debugging build without really any optimizing, so what about that -O2.

It just works. Ok, no it doesn’t. But there is no infinite loop, it just sends the packets as it is supposed to and ends the program. Those packets are not correct and the other end kindly lets me know it did not like what it had to receive.

Data is probably corrupted in some other ways (max_size is probably not on the stack anymore). Thankfully fixing it (i.e. using sizeof(pack->packet.payload) in max_size) makes it run perfectly on either build. At least as far as I am aware and that is good enough for prod (i.e. me).

Thus, lessons learned. ~~Use a safer language~~, ~~write nicer code~~, ask AI?

To Apply AI

It is not included solely for joke purposes

Indeed, it did not take me long really, probably not much longer than what it took you to get to here reading-wise. Not the hardest bug, nor the most annoying one, but more protein in code is rarely good.

Now I am not a vibe coder nor do I condone such violence upon the computer. And I would not really say I use all that much AI tooling (certainly not the latest and greatest). Still, I find it useful to occasionally indulge in these simple chatbots that most people would use.

Brainstorming, summarizing, finding info, maybe even being of help in debugging sometimes?

So I shall run some basic tests, will it (GPT-5, Claude Son 4) find this bug?

Tests for tests’s sake

First the basic basic, copy and paste the entire file (~300 LOC). Same question to both:

Can you fix this bug for me please. send_data is in an infinite loop. pasted code

Not the finest prompt engineering, but what do I know of the arcane. This is realistic in that this is the info I know from just running it. The total code size should not be an issue, it is not long, and not that dense. Yet, both of these nice machines reply with essentially the same issue and fixes.

Has to be those offsets. Granted they are done inside send_pack, so the bots must be confused upon seeing send_data without any offsets in the argument list. They must assume that offset handling is broken and thus infinite loop somehow goes from there?

If I listened to the voices, I would be led astray.

Except what was that Claude Son 4? Ah right, it has randomly suggested to use pack->packet.payload in place of pack->packet in that cursed sizeof.

It was an off-hand change, but a correct solution nonetheless.

Line 9: Also fixed max_size calculation to use sizeof(pack->packet.payload) instead of sizeof(pack->packet) - this ensures we’re using the actual payload size

I did not even notice it immediately. GPT-5 did not even compete too, it has ignored that entirely. I had repeated trials, but given the kinds of “memory” chatbots may use, I am skeptical of how useful that would be.

Another test I prepared was feeding it just the essentials: the structs, the send_pack, and the send_data, all you need. Same question. (Yes, could be this mystical “memory”, but it is not the exact same code at least)

GPT-5 is doing the same as it did before.

Son 4 is still obsessed with offsets, but it has the right idea about it being sizeof(payload). Except that for some reasons it excluded ABYTES, authentication bytes, which should count given that they are appended to the payload (i.e. in it)?

Ok, last test, what if I introduce it piecemeal.

First the send_data, front and center. Same good propositions from GPT, those data offsets, underflow, max_size is 0 but that was without solutions. Though what do you know, GPT does not know about internals of send_pack. Let us give it that next then. Same boring stuff about offsets again. I share the structs, and GPT finally realizes that it indeed was that sizeof thing.

Claude’s second dearest does not perform as well as you would think given above two tests. Obsession over offsets given only send_data is expected. But then the same happens with send_pack? Well it probably fixed it for me according to previous function so whatever. Finally, the structs open its eyes and it fixes the sizeof, while also removing ABYTES. Hmm.

To be advised by the enemy

From these simple tests, you could make a quick generalization of “Claude likes big code”, and “GPT likes chewing slowly”. I would also like to note that GPT-5 does not produce much text at all in comparison to Son 4, which could be significant and even explain this difference. But this is no peer reviewed study across three generations. I would recommend to not dwell on these points too much, get your own conclusions.

And Conclusion

The result is not the most impressive, GPT-5 required specific massaging of pieces to get it right, which is annoying. Also note that, if I did not know that the struct sizes were the issue, I would probably not give them to GPT-5. Yet they were the last piece that it needed.

Claude did better, but in weird ways. Sure it got the right answer given the full code, but that seems like an unnecessary divulgence (I have not hosted it on Github YET). And when given less information, it seems to have gotten more lost and started removing ABYTES for some reason.

Both also got heavy on offsets, basically all their tokens were on that. It probably confused them too much. Maybe it is wrong, but doubt is an enemy.

This is a hobby project in the end, I don’t put too much focus on perfectness, but I do care about learning and experiencing as much as possible myself. Will I use AI for debugging here? Maybe in desperation, after hours of incorrect turns. I have yet to get there with this project. But soon enough.