Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mekhanik evgenii/1196 itog 2 #1973

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

EvgeniiMekhanik
Copy link
Contributor

No description provided.

@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 8 times, most recently from d121627 to 2c52de8 Compare September 12, 2023 09:47
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the scheduling algorithm only and it seems it has implementation bugs as well as design weakness - I think we can do better.

I'll review the modern research (e.g. https://www.usenix.org/system/files/nsdi22-paper-gao_peixuan.pdf considers several old approaches and introduces a new one), but so far at least the idea to link stream objects into the list and traverse them looks silly: spatial locality is just poor having that the stream objects are relatively large and not so many of them can fit in the same page (even using the slab allocator).

The first idea is that instead of dealing with real linked list nodes (especially double linked!) with 8 bytes for each pointer, we can move the scheduler metadata out from TfwStream and introduce a packed data structure, which can be placed on the same page with the whole scheduler data and all the TfwStreamSchedEntry nodes. The new data structure can be linked with only 1-2 byte offsets within the memory area. I.e. in the most trivial case we can leave the algorithm, but use a packed data structure with better spacial locality.

Maybe, having good spacial locality (need to estimate how many entities can fit single cache line) we can traverse the anchors list to find the right place for the current deficit.

It makes sense to skew the existing unfairness to that streams already being scheduled (some frames have been sent on them) should get slightly more priority than streams having higher weight to get lower average load time.

fw/http_stream_sched.h Outdated Show resolved Hide resolved
fw/http_stream_sched.h Outdated Show resolved Hide resolved
fw/http_stream_sched.h Outdated Show resolved Hide resolved
fw/http_stream_sched.c Outdated Show resolved Hide resolved
fw/http_stream_sched.c Outdated Show resolved Hide resolved
tfw_h2_sched_stream_enqueue(stream, *parent);
entry = &stream->sched;
} else {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we break here? It seems stream is blocked and have not children, so why don't we try the next stream from the linked list? This should be either fixed or commented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use BUG() here, add appropriate commit for new algorithm

fw/http_stream_sched.c Outdated Show resolved Hide resolved
fw/http_stream.h Show resolved Hide resolved
fw/http_stream_sched.c Outdated Show resolved Hide resolved
fw/http_stream_sched.c Outdated Show resolved Hide resolved
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 2 times, most recently from eb9f218 to a97b4fa Compare September 14, 2023 14:06
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address all the comments and discussions (verbal meetings as well) about the algorithm in the comments for the file.

A couple of more comments.

It'd be good to benchmark the current implementation (as H2O reference) with an optimized one.

fw/http_stream_sched.c Outdated Show resolved Hide resolved
*dep = &sched->root;
}

void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://calendar.perfplanet.com/2018/http2-prioritization/ :

Chrome is the only browser to use the exclusive bit and it uses it on EVERY resource.

I hope it's an outdated story, but it'd be good to test Tempesta FW with the algorithm against Chrome. https://tempesta-tech.com/knowledge-base/HTTP2-streams-prioritization/ clearly shows that this isn't an outdated story.

*/
tfw_h2_stream_sched_remove(stream);
tfw_h2_change_stream_weight(stream, new_weight);
tfw_h2_add_stream_dep(stream, new_parent, excl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common pattern for the scheduler: pull a stream from the priority queue, recalculate it's deficit and reinsert the stream back according to the new deficit. If streams A and B have the same weight, then we just intermix them, i.e. both of the resources will be downloaded with the highest delay.

It seems only progressive JPEGs benefit from the behavior while all other resources, including WEBP or PNGs images, aren't. This should be a browser's job to assign different weight to different resources and the same weights to images only, but I'm not sure that browsers are able to do this. I found only https://calendar.perfplanet.com/2018/http2-prioritization/, but it's old and it seems for example than Firefox assigns the same weights to CSS.

Please test how Firefox and Chrome assign weights with the current implementation (also good to check resources like Instagram with many images or Gmail), document it and probably we'll need to make the algorithm aware about progressive JPEGs and intermix only their streams (it seems this is what Cloudflare does or did).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is called during stream reprioritization, because of stream priority frame

@EvgeniiMekhanik
Copy link
Contributor Author

This PR also fixes #1884

fw/sock_clnt.c Outdated
Comment on lines 228 to 233
cwnd_avail = (tp->snd_cwnd > tp->packets_out ?
tp->snd_cwnd - tp->packets_out : 0);
cwnd_avail_in_bytes = cwnd_avail * mss_now;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a doubts about calculating cwnd. As I understand this place must use tcp_cwnd_test() for calculation of cwnd, our version is very rough. Further we make a frames relying on cwnd and this is also a question for me, because cwnd it's not an exact size of data that can be pushed into write_queue, exact value is tcp window. For more accurate calculations we must use tcp_mss_split_point() that relies on cwnd. Maybe such rough calculations chosen intentionally, if so please describe it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rework this place, but you also not right here. tcp_cwnd_test returns min(halfcwnd, cwnd - in_flight) on each iteration of the loop for calculation count of bytes that can be sent per ONE loop iteration. Since we call this function on each loop iteration it returns 0 only when cwnd - in_flight became equal to zero! So we can use cwnd - in_flight for our purpose, but also we should take into account len of socket write queue, because we can push a lot of skbs on previous iteration if headers are big (we ignore cwnd when we make headers) or if there are some special frames were sent. We also cant use tcp_mss_split_point because it deal with skb. We can use it only if we make frame for each skb (use skb->len as a limit of frame length), but it i think, that it is bad, since we should use tls encryption for each skb in this case. So i think that rough calculation is better.

fw/http_frame.c Outdated Show resolved Hide resolved
fw/http_stream.h Outdated Show resolved Hide resolved
fw/http_frame.c Outdated
#endif

static int
tfw_h2_make_frames_for_stream(TfwH2Ctx *ctx, TfwStream *stream,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to rename this function. Now we also encode headers here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

fw/http_stream.h Outdated
* @processed - count of bytes, processed during prepare xmit
* callback;
* @nskbs - count of skbs processed during prepare xmit callback;
* @mark - mark of the resp skb_head;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can loose mark in tcp_write_xmit() -> tso_fragment(). Do we have tests for mark forwarded to backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create new issue #1987 because this problem is not resolved in this patch

fw/http_stream.h Outdated Show resolved Hide resolved
fw/http_frame.c Outdated
@@ -2548,6 +2561,8 @@ do { \
stream->xmit.is_blocked = !stream->rem_wnd;
T_FSM_EXIT();
}

ADJUST_TMP_AVAILABLE_CWND(*cwnd_awail, tmp_cwnd_awail);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can suggest do this on transition to next state from HTTP2_MAKE_HEADERS_FRAMES and HTTP2_MAKE_CONTINUATION_FRAMES, because check it every DATA frame which will be much more than HEADERS, not looks right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

fw/sock_clnt.c Outdated Show resolved Hide resolved
fw/ss_skb.h Outdated Show resolved Hide resolved
fw/sock.c Outdated Show resolved Hide resolved
fw/http_frame.c Outdated

stream->xmit.state = HTTP2_RELEASE_RESPONSE;

T_FSM_NEXT();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can use T_FSM_JMP here and other appropriate places that do not switch on every transition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as draft September 22, 2023 09:07
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 7 times, most recently from d549c9c to bb97f30 Compare September 29, 2023 13:32
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 3 times, most recently from 0912d77 to 8d636f1 Compare October 16, 2023 12:16
@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as ready for review October 16, 2023 12:17
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 3 times, most recently from dfa5448 to 837e065 Compare June 5, 2024 13:33
There are a lot of changes in this commit:
- All skbs which contain HEADERS or DATA are placed in the
  queue of the corresponding stream, instead of being added
  to the socket write queue.
- In tcp_push_pending_frames() we call special socket callback,
  which is used to move skbs from stream with highest priority
  to the socket write queue. Count of moved skbs is determined
  by current TCP CWND. Delayed queuing of skb to the socket
  write queue allows higher priority data to be placed ahead
  of lower priority one.
- We encode headers during the previos operation, because
  otherwise there can be a situation then we send header
  which is encoded using hpack dynamic table, before the
  same header which was not encoded and client can't
  decode it.
- We make frames during the previous operation, when skb is not
  located in the socket write queue, so we can not care about
  sequence numbers and so on.
- Rewrite HTTP2 framing to state machine.
- We don't need skb flags anymore.
- Move applying of new settings to ss_do_send, because new
  settings should be applyed just before ack to this settings
  is pushed to the socket write queue, otherwise new settings
  can be applyed and some frames with this new settings will
  be sent before ack to this settings.

Closes #1946
Previously when we send RST stream to the client
we immediately move stream to the closed queue.
If client open a new stream we shrink this queue
and delete streams from this queue, so there is
a chance that we have no time to send RST stream
to the client. Now we move such stream to the
closed queue only after sending RST stream.
Also make little fix in stream processing.
We should move stream to HTTP2_STREAM_REM_CLOSED
state when client don't close stream from it's
side.
We should process stream in HTTP2_STREAM_CLOSED
state since we don't move client to closed queue
immediately.
Since all hpack encoding was moved to the
'ss_do_send' function we don't need any
additional locks, since 'ss_do_send'
is always called under the socket lock.
HTTP2_PRIORITY frames are allowed for the idle streams. Moreover
some clients use this ability to set weight and dependency for
idle stream, so we should handle HTTP2_PRIORITY for such streams.
According to our investigation ebtree is the fastest
data structure for stream prioritization, so we should
port it to our project.
Implement stream scheduler to have ability to get
the most priority stream. HTTP2 connection context
contains stream scheduler and each stream also
contains scheduler for its children, so we have a
tree of schedulers. Each scheduler contains ebtree
of children streams. This tree is used to find the
most priority stream.
By default we use delayed queuing of skbs to
socket write queue, that allows higher priority
data to be placed ahead of lower priority data.
Setting "latency_optimized_write" option to
"false" disable this behaviour.
During working on broken tests, the following
problems were identified and fixed:
- we should zeroed skb->cb before pass skb
  to the kernel, because we use it for our
  purpose.
- we should call `ss_skb_init_for_xmit` before
  setting mark and tls type to prevent it possible
  zeroed in this function.
- HTTP2_STREAM_REM_CLOSED state is a state when we
  send RST_STREAM or END_STREAM flag to remote peer
  from any state other then open or local-half closed.
  (This state need to handle a situation when stream
  had been already closed on server side, but the
  client is not aware about that yet). Rework stream
  processing according this description.
- STREAM_RECV_PROCESS set ctx state to
  HTTP2_IGNORE_FRAME_DATA in case when stream processing
  fails. But we don't take it into account in our state
  machine. Fix this problem (see changes for state
  machine HTTP2_RECV_DATA, HTTP2_RECV_HEADER,
  HTTP2_RECV_CONT).
- Codestyle fixed
- Remove `tfw_h2_destroy_stream_send_queue` because we
  already iterate over all streams, when we close the
  connection and free all associated resources. So we
  don't need special function and should destroy stream
  send queue when we free all other resourses.
- Change algorithm of calculation of available send
  window. Now we calculate sender and receiver window
  and select the smallest of them.
- Do not change Conn_Closing value.
- Do not setup sk_fill_write_queue for not HTTP2
  connections. We use this function to make frames
  according current available window. For HTTP1 we
  don't need to make frames and don't need to call
  this function.
- Calculate stream deficit using precomputed values
  from the table not using devision, because devision
  is very slow operation.
- Remove unnecessary macros XMIT_FSM_JMP
- Move comment about `tfw_http_msg_linear_transform`
  before function definition
- Use memset instead of bzero_fast to zeroed skb->sb
- Add comments
According review we decide to remove this
option, because this option is true by default
and there is no sense to disable implement
ability to disable it.
Use BITMAP instead of array of bool values
to save settings which we should apply.
Previously we send error response and close
connection after sending all pending data.
But we should not send responses in case of
connection level error (when we close connection
after error response). This patch fix this problem.
Also decrease size and remove holes from some
structures.
- Ajust counf of skbs, which were previously (during making
  frames) pushed to socket write queue. In `tcp_write_xmit`
  main loop `cong_win` is calculated on each loop iteration
  and if we calculate `cong_win` for making frames without
  taking into account previously pushed skbs we push more
  data into socket write queue then we can send.
- Other frames (from any stream) MUST NOT occur between
  the HEADERS frame and any CONTINUATION frames that might
  follow. So add such frames to postponed list and send
  them later.
In case of TCP segmentation we can catch ENOMEM
when we try to encrypt several skb. The reason
is that we don't limit count of skb for encryption
only the total len. But if the len of each skb is
equal to 1, count of nents will be equal to
TLS_MAX_PAYLOAD_SIZE (16384) and sgt list allocation
fails. Probably a lot of memory allocation errors
in TCP segmentation tests causes because of this
problem.
Pass ss_action as a new argument to `sk_fill_write_queue`.
Send TCP shutdown if ss_action is SS_SHUTDOWN (function
is called from `ss_shutdown`) and there is no pending data
in our scheduler and do not send it if it is called from
`ss_do_close`.
Purge stream send queue and do not terminate connection
in case of STREAM_FSM_RES_TERM_STREAM error.
Also free stream->xmit.postponed skbs when connection
is closed.
We initialize http2 context in `tfw_tls_over`, but
if this function was not called, because tls handshake
fails we clean uninitialized context, that can lead to
kernel BUG. So we should clean tls context only if
tls->state is equal to TTLS_HANDSHAKE_OVER.
}

/*
* Linux kernel caclule count of out packets on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Linux kernel caclule count of out packets on
* Linux kernel calculates the count of out packets on

* This value is used to calculate cwnd_quota and
* break the loop if it is exceeded. We need to
* adjust count of packets_out which will be added
* later here, to recalculate cwnd_couta in our code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* later here, to recalculate cwnd_couta in our code.
* later here, to recalculate cwnd_quota in our code.

/*
* if *tls_record_len is equal to zero, that means that here we deal with
* new tls record. During tls record encryption extra TLS_MAX_OVERHEAD
* data can be added and we should adjuct it during snd_wnd calculation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* data can be added and we should adjuct it during snd_wnd calculation.
* data can be added and we should adjust it during snd_wnd calculation.

#define CALC_SND_WND_AND_SET_FRAME_TYPE(type) \
do { \
if (*tls_record_len == 0) \
(*not_account_in_flight += 2); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(*not_account_in_flight += 2); \
*not_account_in_flight += 2; \

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we always increment the number of in-flight segments to 2? Can we have only 1 in-flight segment?

if (WARN_ON_ONCE(limit <= FRAME_HEADER_SIZE))
return -EINVAL;
/*
* if *tls_record_len is equal to zero, that means that here we deal with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* if *tls_record_len is equal to zero, that means that here we deal with
* if *tls_record_len is equal to zero, then here we deal with

@@ -20,21 +20,19 @@
#ifndef __HTTP_STREAM__
#define __HTTP_STREAM__

#include <linux/rbtree.h>

#include "http_stream_sched.h"
#include "msg.h"
#include "http_parser.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a super-mega important, but it's nice to place headers in lexographical order - it's easier to find a header among many includes

stream->xmit.frame_length;
struct sk_buff *skb;

while(len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while(len) {
while (len) {

T_DBG3("%s: ctx [%p] conn %p sched %p\n", __func__, ctx, conn, sched);

tfw_h2_remove_idle_streams(ctx, UINT_MAX);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously there was tfw_h2_destroy_stream_send_queue(), but it's now removed. It seems you don't need to call tfw_h2_stream_sched_remove() because the schedule is going to be cleaned up with the HTTP/2 context. But why don't you call tfw_h2_stream_purge_send_queue() now? Where the pending skbs are removed from all the streams?

Also I see that tfw_h2_stream_purge_send_queue() is only called from tfw_h2_insert_frame_header(), which confuses me - it seems nobody frees the skbs now and we have a memleak now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tfw_h2_stream_purge_all_and_free_response clear all possible unsent skbs and response if present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tfw_h2_stream_purge_send_queue now clear only headers and data frames, but goaway|tls alert|rst stream not cleared. They still should be sent

/*
* Here we move children of dep scheduler to the current stream
* scheduler. If current stream scheduler has no children we move
* dep children as is (saving there deficit in the priority WFQ).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The H2O is just a some implementation, which can be correct or not, but it'd be good to reference the initial algorithm (typically an academic paper). In my understanding this is a mistake to call this algorithm WFQ, which don't operate with any deficits at all.

fw/http_stream_sched.c Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants