Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Remote port forwarding buffers can grow without limits (issue #658) #913

Merged
merged 3 commits into from
Nov 21, 2023

Conversation

rasantel
Copy link
Contributor

@rasantel rasantel commented Nov 21, 2023

This PR addresses #658.

Problem

The Buffer class can grow out of control when its consumer is unable to read data quickly enough to keep up with the producer for an extended period of time. This results in an out of memory (OOM) condition during ssh port forwarding. This issue has been hit in production at Delphix many times.

Diagnosis

When ChannelInputStream receives new data from its producer, it calls Buffer.putRawBytes which will double its internal array if there is not enough room to store data starting at its current write position wpos. Meanwhile, when the consumer reads data from ChannelInputStream, the buffer will advance its read position rpos to the next unread byte. If, and only if, the read position catches up with the write position, ChannelInputStream will call Buffer.clear to reset both positions to 0.

However, if for an extended amount of time the consumer is slower than the producer and the latter produces a large amount of data, the buffer will continue to double itself until it cannot fit in memory anymore, causing an OOM.

Solution

The key issue with this behavior of Buffer is that the amount of data between rpos and wpos can be relatively small compared to the space in the array before rpos, which is being wasted. Yet, as long as rpos doesn't catch up with wpos (i.e. as long as Buffer.clear does not get called), Buffer will continue to add new data after wpos and grow the array past its current end -- no matter how much space before rpos is available.

This PR proposes to use a circular buffer instead, so that the space before rpos is no longer wasted and the buffer no longer needs to grow beyond any limits while utilizing only a fraction of the necessary space. Also, the circular buffer can grow just like Buffer when more capacity is needed, but now it throws an exception when a configurable size limit is reached instead of potentially triggering an OOM (a scenario that is a lot less likely anyway because the space before rpos is now being utilized).

At first, I thought of refactoring the entire Buffer class to make it circular, but there are too many uses of Buffer in the code base, too many operations in the interface of Buffer whose implementation would need to change, and perhaps more critically too many instances of the internal data, rpos and wpos fields being exposed via getters to outside manipulation (e.g. PacketReader.readPacket which calls Buffer.array to get the internal data array) to make this practical and safe from side effects.

My proposed solution instead is to focus on the ChannelInputStream use case of Buffer, which implements only a subset of all Buffer methods, and create a new CircularBuffer class that implements only what ChannelInputStream needs, while solving the race condition that causes this OOM. As far as I can tell, ChannelInputStream appears to be the only use of Buffer where this race condition can happen. We use SSHJ at Delphix for all SSH needs and we have only hit this issue in that use case. (If other parts of SSHJ turn out to be affected, we can always expand CircularBuffer later to be used there as well.)

Testing in Controlled Environment

In a controlled environment, the issue can be reproduced by placing a debugger breakpoint at ChannelInputStream.read to temporarily stop the consumption of data (which moves rpos closer to wpos) while ChannelInputStream.receive continues to receive new data (which moves wpos farther from rpos and eventually forces buffer expansions).

After this fix, we can see with this breakpoint that the underlying array is being fully utilized and that it doesn't get expanded until there is truly no more space in it.

Just for reference, I include in this diff the test RemotePFPerformanceTest which allowed me to reproduce the issue locally. It's disabled, however, and I might just remove it in the end, because of its heavy impact on the performance of the system running it.

Testing in Production

In production, we hit this issue by using port forwarding to read a very large file on one end and write it to disk on the other end, usually when disk writes are slowed down by other processes also writing to it.

After applying this patch on top of SSHJ, we have been running the fix for almost three years at Delphix (including re-applying it on top of each new release of SSHJ). We haven't hit this issue again and we haven't experienced any negative side effects.

@rasantel rasantel requested a review from hierynomus as a code owner November 21, 2023 17:29
@hierynomus
Copy link
Owner

Wow, now that's what I call battle-hardened, running it for 2 years before submitting the PR! The code looks great. Nothing stands out AFACT. I'm fine with keeping the test for now and having it @Disabled.

Thanks a lot for the extended diagnosis and PR.

@hierynomus hierynomus merged commit 1c54788 into hierynomus:master Nov 21, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants