Rendered at 14:55:34 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
Veserv 10 hours ago [-]
What is the point of making up claims of "extreme" performance without any accompanying benchmarks or comparisons?
It really should be shameful to use unqualified adjectives in headline claims without also providing the supporting evidence.
MDA2AV 5 hours ago [-]
I agree, I'll try adding some. We use the tool on a benchmarking platform so we run this thing hundreads of times daily and did dozens of tests against pretty much every other load generator (that I know of). Numbers are also always tied to the hardware where you run it and typically benchmarks provided by the maintainer himself are always biased and won't match what you get though.
I personally never care about benchmarks presented, it's much better to use and see for myself so didn't think much about having a table with values there but I can understand how it may help.
raks619 10 hours ago [-]
did you scroll down?
ziml77 10 hours ago [-]
I did and I still didn't see any numbers. Just a bunch of AI generated text about why it's supposedly fast. It even says it records numbers multiple times, so why aren't there any presented?
0x000xca0xfe 5 hours ago [-]
Interesting, I made something similar years ago when io_uring wasn't around yet and it is just a couple threads blocking on sendfile: https://github.com/evelance/sockbiter
Of course it needs to pre-generate the file and you need enough RAM for both the server running and caching the file but it needs almost zero CPU during the test run and can probably produce even more load than this io_uring tool.
MDA2AV 4 hours ago [-]
Very cool!
So I just tried your tool and it just hangs, I see you're sending close requests, is this configurable to keep-alive, or even better, nothing? In Http/1.1 keep-alive/close is better not used at all, never try to enforce this as it is not mandatory.
A lot of servers just ignore the close and don't close the connection (like the one I am using) so this can be the issue I am having.
0x000xca0xfe 3 hours ago [-]
Cool, thanks for trying it.
Try the -shutwr option if the server doesn't close the connection itself. I used it to test lots of exotic implementations and there are weird things going on in overload situations and around connection management. NodeJS for example started dropping connections on localhost(!!) on high load.
The tool was built for high values of keepalive requests, if the server is too fast just use more requests, e.g. -n 1000000 or something similar. Unfortunately some servers close keepalive connections after quite few requests, nginx has a default of 1000 for example.
This is just a simple tool I hacked together as a student to collect some data, didn't spend any time making it more accessible/user friendly, sorry.
MDA2AV 2 hours ago [-]
I ran into some lua erros and fixed them, eventually I got it running with -shutwr but the results are basically impossible
----------- Summary ----------
Successful connections: 8 out of 8 (0 failed).
Total bytes sent . . . . . 2599999960.00 B
Total bytes received . . . 82520.00 B
Benchmark duration . . . . 85.94 ms
Send throughput . . . . . 30252779546.89 B/sec
Receive throughput . . . . 960176.69 B/sec
Aggregate req/second . . . 93085476.96
The received data is too low. Also 93 million requests per second, the only way this is possible is due the fact that the load generator is not waiting for the server response and processing it. But I guess this is expectable since there might be some issues as I am using a much more recent kernel than you did when building this
I used -n 10000000 (10M)
0x000xca0xfe 1 hours ago [-]
As per the README responses are not checked by the tool.
If you only received ~80KB for 10M requests the server probably terminated the connection early before processing all requests (like nginx does after 1k requests on one TCP keepalive socket if you use the default configuration). Check the responses-XXX.txt files to see what happened. You then need to either adjust the server configuration or use multiple sockets with the max keepalive requests the server can handle.
If you run this tool on the same machine as the server process, the requests file is likely held in the file system cache (RAM and shared by all threads) and every recv() call by the server under test is essentially a memory copy at the speed of the machine's memory bandwidth, which can easily be >>10GB/s or millions of requests per second per connection. This is also way faster than typical servers can even parse HTTP/1.
But highly optimized servers running straight HTTP/1 without TLS or backend logic on multiple threads should absolutely hit multiple millions of requests per second with this tool. Researching how fast an HTTP/1 server can get was the reason I made this in the first place.
MDA2AV 51 minutes ago [-]
Ah, I think I understand now, we are bombarding the server in a H/1.1 pipelined approach so basically not waiting for server response to send the next request and theoretically using infinite pipelined depth as we never really check any response and simply jam the server with as many requests as possible. That would explain the results - the issue with that is that we don't really check if the server is able to process the extremely high number of requests and most of them are likely just lost and never processed by the server - so we are basically measuring the tool output capability not the server performance.
I can see in the results .txt that only a small portion of the sent requests actually result in a response, also not every server supports H/1.1 pipeline so they will flush once per request (typical workload), servers that support pipeline will have way higher throughput
0x000xca0xfe 30 minutes ago [-]
Exactly. For GET reqeuests HTTP/1 conformant servers must support pipelining or close the connection.
So this is the best way to generate extreme load and stress-test the internal architecture of an HTTP/1 server. But yeah the sendfile approach only works for this kind of testing and not in the generic case.
qcoudeyr 4 hours ago [-]
From my benchmark, i will keep using oha (https://github.com/hatoo/oha). Oha is more complete than gcannon and have similar req/s rate while handling ipv6, https, etc...
G3nt0 4 hours ago [-]
oha is one of the slowest load gen, you should look into h2load if you need h2/h3 support. I just tried oha and it pulls more CPU than the server I am testing, not to mention h2 and h3 results are just nonsense
bawolff 9 hours ago [-]
Really stupid question from someone who doesnt know much about io_uring. Wouldn't doing all this i/o async make the latency measurements less accurate? How do you know when the i/o starts if you are submitting it async in batches of 2048?
tuetuopay 7 hours ago [-]
The main difference with io_uring is you're not blocking the thread, just like O_NONBLOCK + epoll would, but don't have to rely on thread-level syscalls to do so: there's no expensive context switch to kernel mode. Using O_NONBLOCK + epoll is already async :)
In fact, in all cases, you don't know when the syscall actually starts execution even with regular calls. The only thing you're sure is the kernel "knows" about the syscall you want. However, you have absolutely no indication on whether it started to run or not.
The real question is: are the classical measures accurate? All we have is an upper bound on the time it took: I fired the write at t0 and finished reading the response at t1. This does not really change with io_uring. Batches will mostly change one fact: multiple measurements will share a t0, and possibly a t1 when multiple replies arrive at once.
Is it important? Yes and no. The most important thing in such benchmarks is for the added delay to be consistent between measurements, and when it starts to break down. So it's important if you're chasing every µs in the stack, but not if your goal is lowering the p99 which happens under heavy load. In this case, consistency between measurements is paramount in order to get histograms and such that make sense.
2 hours ago [-]
dijit 8 hours ago [-]
Its not a stupid question.
Normally when I have run latency calculations in the past I run them from the perspective of the caller, not the server.
In most cases this is over the network, a named pipe or sock file.
I guess it should be possible to run multiple runtimes inside a program that run independently.
It really should be shameful to use unqualified adjectives in headline claims without also providing the supporting evidence.
I personally never care about benchmarks presented, it's much better to use and see for myself so didn't think much about having a table with values there but I can understand how it may help.
Of course it needs to pre-generate the file and you need enough RAM for both the server running and caching the file but it needs almost zero CPU during the test run and can probably produce even more load than this io_uring tool.
So I just tried your tool and it just hangs, I see you're sending close requests, is this configurable to keep-alive, or even better, nothing? In Http/1.1 keep-alive/close is better not used at all, never try to enforce this as it is not mandatory.
A lot of servers just ignore the close and don't close the connection (like the one I am using) so this can be the issue I am having.
Try the -shutwr option if the server doesn't close the connection itself. I used it to test lots of exotic implementations and there are weird things going on in overload situations and around connection management. NodeJS for example started dropping connections on localhost(!!) on high load.
The tool was built for high values of keepalive requests, if the server is too fast just use more requests, e.g. -n 1000000 or something similar. Unfortunately some servers close keepalive connections after quite few requests, nginx has a default of 1000 for example.
This is just a simple tool I hacked together as a student to collect some data, didn't spend any time making it more accessible/user friendly, sorry.
----------- Summary ---------- Successful connections: 8 out of 8 (0 failed). Total bytes sent . . . . . 2599999960.00 B Total bytes received . . . 82520.00 B Benchmark duration . . . . 85.94 ms Send throughput . . . . . 30252779546.89 B/sec Receive throughput . . . . 960176.69 B/sec Aggregate req/second . . . 93085476.96
The received data is too low. Also 93 million requests per second, the only way this is possible is due the fact that the load generator is not waiting for the server response and processing it. But I guess this is expectable since there might be some issues as I am using a much more recent kernel than you did when building this
I used -n 10000000 (10M)
If you only received ~80KB for 10M requests the server probably terminated the connection early before processing all requests (like nginx does after 1k requests on one TCP keepalive socket if you use the default configuration). Check the responses-XXX.txt files to see what happened. You then need to either adjust the server configuration or use multiple sockets with the max keepalive requests the server can handle.
If you run this tool on the same machine as the server process, the requests file is likely held in the file system cache (RAM and shared by all threads) and every recv() call by the server under test is essentially a memory copy at the speed of the machine's memory bandwidth, which can easily be >>10GB/s or millions of requests per second per connection. This is also way faster than typical servers can even parse HTTP/1.
But highly optimized servers running straight HTTP/1 without TLS or backend logic on multiple threads should absolutely hit multiple millions of requests per second with this tool. Researching how fast an HTTP/1 server can get was the reason I made this in the first place.
I can see in the results .txt that only a small portion of the sent requests actually result in a response, also not every server supports H/1.1 pipeline so they will flush once per request (typical workload), servers that support pipeline will have way higher throughput
So this is the best way to generate extreme load and stress-test the internal architecture of an HTTP/1 server. But yeah the sendfile approach only works for this kind of testing and not in the generic case.
In fact, in all cases, you don't know when the syscall actually starts execution even with regular calls. The only thing you're sure is the kernel "knows" about the syscall you want. However, you have absolutely no indication on whether it started to run or not.
The real question is: are the classical measures accurate? All we have is an upper bound on the time it took: I fired the write at t0 and finished reading the response at t1. This does not really change with io_uring. Batches will mostly change one fact: multiple measurements will share a t0, and possibly a t1 when multiple replies arrive at once.
Is it important? Yes and no. The most important thing in such benchmarks is for the added delay to be consistent between measurements, and when it starts to break down. So it's important if you're chasing every µs in the stack, but not if your goal is lowering the p99 which happens under heavy load. In this case, consistency between measurements is paramount in order to get histograms and such that make sense.
Normally when I have run latency calculations in the past I run them from the perspective of the caller, not the server.
In most cases this is over the network, a named pipe or sock file.
I guess it should be possible to run multiple runtimes inside a program that run independently.