Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couple suggestions #264

Open
rdong8 opened this issue Aug 5, 2020 · 5 comments
Open

Couple suggestions #264

rdong8 opened this issue Aug 5, 2020 · 5 comments

Comments

@rdong8
Copy link

rdong8 commented Aug 5, 2020

I know I could probably do it myself in Excel but I think it would be useful to have a geometric mean of the test results. Also, could we get the Zig language added? And one last thing, I read this comment on reddit in response to this benchmark:

I don't know if you have other benchmarks on newer hardware, but at least the one you linked is not timing compilation and runtime consistently. The Python scripts, for example, are being timed in their entirety (it doesn't really do compilation); compiled languages like C are getting compiled first, and only timed on their runtime; Java is getting compiled to JVM bytecode first, but modern JVMs have Just-in-Time compilation to make things faster over time (modern Java isn't as slow as people used to gripe about); and Julia gets run as a script, but the JIT compilation is being timed with the execution (first time a function is called is usually slow cause it has to compile, but subsequent calls are fast).

@nuald
Copy link
Collaborator

nuald commented Aug 17, 2020

Geometric mean calculation requires more resources and I'm not sure it's worth it - in theory, I'd prefer to have median and quartiles like in my own project (https://github.com/nuald/simple-web-benchmark), but it could be an overkill here as the used algorithms are not quite consistent and optimized to provide the accurate timing between iterations. The provided numbers are good enough for the purpose of comparing the languages for the average situations, but the optimized benchmarks with the highly accurate results are out of scope of this project. Please consider it as a playground - we have some numbers, but if you want to have the great precision and deep comparisons, please use the appropriate algorithms and tools (and see below for some examples).

Zig language tests have been requested already, but we decided to not go with it - please see discussion in #188

Performance deviance for JIT have been considered too: #248 Right now, the proposed approach is running some minimal tests before benchmarks (WIP) - it addresses the lazy initialization and some precompilations. But having proper warm up doesn't necessary give better results as explained in the ticket. I guess the proper benchmarks for JIT languages would have disabled GC, but that would be out of scope of this project as we don't consider overly optimized code for the performance, but rather the average code.

As for the quoted text, I'm not quite sure about what's the point of the author. Surely, they can be edge cases when someone would be interested in "compile + run" time, but majority of developers don't need it (especially, if compilation have few phases like linting, preprocessing etc). The same about JIT - surely, it can optimize the frequently called methods, but in reality it depends on tasks. Profilers and other tools could help with the optimizing too, but it's out of scope of this project. The goal is to show some average performance, surely it's possible to have good optimized versions in many languages (especially if use tricks like inline assembler or C code calls).

@rdong8
Copy link
Author

rdong8 commented Oct 31, 2020

Geometric mean calculation requires more resources and I'm not sure it's worth it - in theory, I'd prefer to have median and quartiles like in my own project (https://github.com/nuald/simple-web-benchmark), but it could be an overkill here as the used algorithms are not quite consistent and optimized to provide the accurate timing between iterations. The provided numbers are good enough for the purpose of comparing the languages for the average situations, but the optimized benchmarks with the highly accurate results are out of scope of this project.

By geometric mean, I meant taking a geometric mean of all test results of a language implementation, and doing this for every language to build a final results table. You'd be taking the geomean of ex. CPython's brainfuck, base64, json, matmul and havlak times and putting that into a new table at the end. Then the same thing for all the other languages. This way, you could find the best performer overall. This is what Benchmarks Game does.

As for the quoted text, I believe that the criticism being made is that this benchmark is not consistent with the way it times compiled languages. Julia's JIT gets timed with the execution, but Java and C are compiled first and then being timed on the runtime alone.

@nuald
Copy link
Collaborator

nuald commented Oct 31, 2020

taking a geometric mean of all test results of a language implementation

The only pure tests of the languages are bf-tests, all other are mostly libraries tests. Havlak ones could be considered as the language tests too, but I guess we'll remove it in the near future as I found some inconsistencies there, and it's possible that these are not quite fair. Given that, I don't think it's worth to make any overall mean calculations as libraries in general are orthogonal to languages (like NumPy and its dependencies written in C and Fortran).

As for the JIT, unfortunately, I don't see any ways to do the fair comparisons. However, looks like there are misunderstanding about the measurement, so I've added notes in #281 (as time is measured only for the benchmark itself, Julia JIT compilation doesn't affect the results as it happens before the benchmark). Run-time JIT optimization is another story though, and I guess for now we're just going to live with that.

@ibsusu
Copy link

ibsusu commented Mar 8, 2021

@nuald Taking a look at that zig issue, I don't even like language but I think your stance of doing a big allocation inside a hotloop is pretty crazy, ngl.

@nuald
Copy link
Collaborator

nuald commented Mar 8, 2021

Heh, 132kb is not that big, plus it's closer to real world situations when in majority of the use-cases one would need an allocate memory for the encoding/decoding operations. Please note that the name "Base 64" could be a little bit misleading here, and the notes clearly indicate that:

Testing base64 encoding/decoding of the large blob into the newly allocated buffers.

All other tests allocate memory (granted, lower level languages have an advantage here as they could use stack allocation). I don't see any particular reason to make an exception, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants