Skip to content

[agent] Cleanup unused, shimmed websocket connections after a configurable timeout.#166

Merged
ojarjur merged 1 commit intogoogle:masterfrom
hareeshv:mem-leak
Feb 24, 2026
Merged

[agent] Cleanup unused, shimmed websocket connections after a configurable timeout.#166
ojarjur merged 1 commit intogoogle:masterfrom
hareeshv:mem-leak

Conversation

@hareeshv
Copy link
Contributor

@hareeshv hareeshv commented Jan 23, 2026

This PR adds a configurable value for how long shimmed websocket connections can remain idle before they are garbage collected.

Previously, the underlying connection from the proxy agent to the backend server was only cleaned up when there was either an error while sending or receiving messages, or if the agent received an explicit close message.

When a client of shim protocol stopped communicating without first sending a close message, this caused the underlying connection between the proxy agent and backend server to be retained indefinitely. That, in turn, can result in a resource leak within the agent (and possibly in the backend server as well).

To account for that scenario, we needed to find some way to identify that a client of the shim protocol had stopped communicating, so that we could clean up those lost connections.

The shim protocol does not include any sort of ping or pong messages, but it does have polling requests sent by the client which can be used as a dead-man switch; once a sufficient amount of time has elapsed since the last such request, we can assume that the client is gone and the underlying connection can be removed.

This PR adds a configuration option to the agent to specify what that amount of time should be (with a default of 1 hour), and implements the agent-side cleanup logic to identify and close expired connections.

Copy link
Collaborator

@ojarjur ojarjur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

There are some formatting issues that need to be addressed, but also please update the PR description to give background on what the issue is (connections leaking of the close message gets dropped) and how this PR addresses it.

Thanks!

@hareeshv hareeshv force-pushed the mem-leak branch 2 times, most recently from a045972 to d24d23c Compare January 29, 2026 11:33
// The server messages channel has been closed.
return nil, fmt.Errorf("attempt to read a server message from a closed websocket connection")
}
conn.updateActivity()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to be called at the beginning of this method (i.e. on line 258).

Otherwise, a connection that the client is continuously polling on, but where the server has not yet responded, will be closed as inactive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed now; it is no longer necessary with the updates on line 258 and 259

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout here needs to be the minimum of 20 seconds and some amount less than the configured idle timeout (e.g. half of the idle timeout).

That way, a polling request from an actively connected client will timeout and be re-run before the activity tracking logic decides that the connection is idle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could put in some sort of check to enforce that the idle timeout is greater than 20 seconds (e.g. setting a minimum of 30 seconds).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Omar,
are you suggesting to implement the above fix for case <-time.After(time.Second * 20): ?
and we should parameterized the timeout value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is that the timeout used here (on line 279) needs to be smaller than the timeout used for detecting idle connections.

The issue is that the client might have a connection open and be waiting for a message from the server.

If that happens, then the connection activity is only updated at the start of each polling request from the client.

However, the polling request will be held open as long as the timeout here on line 279.

So, if the idle timeout for connections is not longer than the timeout on line 279, then connection will appear to be idle even though there is a client actively polling it.

To prevent that, we need to do something to make sure that the timeout used on line 279 is shorter than the timeout used for idleness detection.

This can either setting a lower bound on the timeout for idle detection, or making the timeout on line 279 shorter in the case that the idle detection timeout is not longer than 20 seconds.

We also want some buffer between the two timeouts; i.e. the timeout on line 279 needs to be strictly shorter than the timeout used for idle detection.

Finally, we need to add a test case that exercises this scenario to verify that our change doesn't cause the connections to be prematurely pruned.

Copy link
Collaborator

@ojarjur ojarjur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good aside from the one extraneous line that can be removed now (commented below)

// The server messages channel has been closed.
return nil, fmt.Errorf("attempt to read a server message from a closed websocket connection")
}
conn.updateActivity()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed now; it is no longer necessary with the updates on line 258 and 259

@skjasimul-lang
Copy link

All review comments are addressed ,please review.

@ojarjur
Copy link
Collaborator

ojarjur commented Feb 14, 2026

The build is failing due to a missing import.

@skjasimul-lang
Copy link

is there any way to trigger workflow , so we can get the test results as soon as possible.

@skjasimul-lang
Copy link

Hi Omar, can you please trigger the workflow and see if this looks good to you.

@ojarjur
Copy link
Collaborator

ojarjur commented Feb 24, 2026

Hi Omar, can you please trigger the workflow and see if this looks good to you.

Done, but the build is still broken.

The ba-proxy-agent currently experiences increasing memory consumption,
leading to daily or weekly restarts. Previous mitigation attempts were
insufficient, and the issue has been isolated to high memory usage on
the agent connection side.

This CL mitigates the issue by enforcing a timeout on idle connections.
Since the root cause remains elusive, forcefully closing unused
connections prevents memory accumulation from persistent links.

Implementation details:
* Introduced `lastActivityTime` property to agent connections, which
    updates upon usage.
* Added a background routine to monitor connection activity.
* Configured the routine to explicitly close connections that remain
    idle for more than 30 seconds.
Copy link
Collaborator

@ojarjur ojarjur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes are good now, but this still needs a proper PR description.

@ojarjur ojarjur changed the title handling agent memory leak [agent] Expire (and remove) unused, shimmed websocket connections after a configurable timeout. Feb 24, 2026
@ojarjur ojarjur changed the title [agent] Expire (and remove) unused, shimmed websocket connections after a configurable timeout. [agent] Cleanup unused, shimmed websocket connections after a configurable timeout. Feb 24, 2026
@ojarjur ojarjur merged commit 4124b20 into google:master Feb 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants