fix: add timeout parameter to to_dataframe and to_arrow met… #2354

chalmerlowe · 2026-01-06T14:09:11Z

Description

This PR adds a timeout parameter to the to_dataframe() and to_arrow() methods (and their corresponding *_iterable, *_geodataframe and QueryJob wrappers) in the BigQuery client library.

This addresses an issue where these methods could hang indefinitely if the underlying BigQuery Storage API stream blocked (e.g., due to firewall issues or network interruptions) during the download phase. The added timeout parameter ensures that the download operation respects the specified time limit and raises a concurrent.futures.TimeoutError if it exceeds the duration.

Changes

Modified google/cloud/bigquery/_pandas_helpers.py:
- Updated _download_table_bqstorage to accept a timeout argument.
- Implemented a timeout check within the result processing loop.
- Updated wrapper functions download_dataframe_bqstorage and download_arrow_bqstorage to accept and pass the timeout parameter.
Modified google/cloud/bigquery/table.py:
- Updated RowIterator methods (to_arrow_iterable, to_arrow, to_dataframe_iterable, to_dataframe, to_geodataframe) to accept and pass timeout.
- Updated _EmptyRowIterator methods to match the RowIterator signature, preventing TypeError when a timeout is provided for empty result sets.
Modified google/cloud/bigquery/job/query.py:
- Updated QueryJob methods (to_arrow, to_dataframe, to_geodataframe) to accept timeout and pass it to the result iterator.
Updated unit tests in tests/unit/job/test_query_pandas.py, tests/unit/test_table.py, and tests/unit/test_table_pandas.py to reflect the signature changes.

Fixes internal bug: b/468091307

…hods

gemini-code-assist · 2026-01-06T14:09:27Z

Summary of Changes

Hello @chalmerlowe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the BigQuery client library by integrating a timeout mechanism into its data retrieval methods, specifically to_dataframe and to_arrow. This change prevents operations from hanging indefinitely during data download from the BigQuery Storage API, offering users better control over execution duration and improving application stability in the face of network inconsistencies.

Highlights

Timeout Parameter: Introduced a timeout parameter to to_dataframe() and to_arrow() methods, and their corresponding iterable and QueryJob wrappers, in the BigQuery client library.
Indefinite Hang Prevention: Addresses an issue where data download operations could hang indefinitely due to underlying BigQuery Storage API stream blocking (e.g., from firewall issues or network interruptions).
Error Handling: The new timeout parameter ensures that a concurrent.futures.TimeoutError is raised if the download operation exceeds the specified time limit, preventing unresponsive applications.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a timeout parameter to the to_dataframe() and to_arrow() methods, which is a valuable addition to prevent indefinite hangs during data download. The core timeout logic in _download_table_bqstorage is well-implemented, correctly using manual thread pool management to handle timeouts gracefully.

I've identified a critical issue where the method signatures for _EmptyRowIterator in google/cloud/bigquery/table.py have not been updated to include the timeout parameter. This will cause a TypeError for empty result sets and needs to be addressed. I've left a detailed comment on this.

Additionally, I noticed that QueryJob.to_dataframe in google/cloud/bigquery/job/query.py is missing the timeout parameter, which is inconsistent with the other updated methods. Please consider adding it for API consistency.

google/cloud/bigquery/table.py

…natures and QueryJob.to_dataframe

…ntiate new _EmptyRowIterator for each test in test_methods_w_timeout to prevent ValueError\n- Remove total_rows property from _EmptyRowIterator to fix AssertionError\n- Remove redundant __iter__ method in _EmptyRowIterator to fix lint error

tests/unit/test_table.py

chalmerlowe · 2026-01-07T13:19:15Z

google/cloud/bigquery/_pandas_helpers.py

-                    # Call result() on any finished threads to raise any
-                    # exceptions encountered.
-                    future.result()
+    # Manually manage the pool to control shutdown behavior on timeout.


You may wonder why we switched away from using a contextmanager here.

Here's a breakdown of the change:

Manual Pool Creation: Instead of the with statement, the code now creates the
ThreadPoolExecutor directly:

pool = concurrent.futures.ThreadPoolExecutor(max_workers=max(1, total_streams))

Conditional Shutdown: The finally block now uses a wait_on_shutdown boolean flag:

pool.shutdown(wait=wait_on_shutdown)
This flag is set to False only when a TimeoutError is raised.

Timeout Handling: Inside the while not_done: loop, there's new logic to check if the
elapsed time has exceeded the specified timeout. If it has, it raises a
concurrent.futures.TimeoutError and sets wait_on_shutdown to False.

Why this change?

The standard context manager for ThreadPoolExecutor always waits for all futures to complete
upon exiting the block. This behavior is not desirable when a timeout is implemented. If the
download times out, we want to stop waiting for the worker threads immediately and not block
until they all finish.

By manually managing the pool and using the wait_on_shutdown flag, we can tell the
pool.shutdown() method not to wait for the threads to complete if a timeout has occurred.
This allows the TimeoutError to be propagated up quickly, rather than being stuck waiting for
threads that are potentially hanging.

So, the change was necessary to ensure the timeout parameter works effectively and the
function doesn't hang unnecessarily when the timeout duration is exceeded. The core logic of
submitting tasks to the pool and retrieving results from the queue remains very similar, but
the shutdown process is now more nuanced to handle timeouts.

Good catch!

chalmerlowe · 2026-01-07T13:22:15Z

google/cloud/bigquery/table.py


-    pages = ()
-    total_rows = 0
-


A few words on why these were deleted/modified.

Summary of Changes to _EmptyRowIterator

In the _EmptyRowIterator class, the class-level attributes pages = () and total_rows = 0 were replaced with new handling.

Reasoning:

These class-level attributes were causing issues with how instances of _EmptyRowIterator were behaving, particularly in test scenarios.

total_rows = 0: While seemingly correct, having this at the class level interfered with the instance-level _total_rows attribute inherited from the RowIterator parent class. The parent class expects to manage _total_rows at the instance level. Removing the class-level attribute allows the instance-level attribute to be a single source of truth, initialized to 0 in the __init__ method, and prevents potential conflicts or unexpected behavior.

pages = (): This class-level attribute was not used and added unnecessary clutter. The _EmptyRowIterator is designed to represent an empty result set, so it never has any pages. The __iter__ method already correctly returns an empty iterator (iter(())). Additionally, to ensure methods like to_arrow_iterable function correctly without making network requests, a pages property was added to return an empty iterator.

Replacement and Improvement:

total_rows: The class-level total_rows = 0 was removed. Instead, self._total_rows is now consistently initialized to 0 within the __init__ method of _EmptyRowIterator. This aligns with the parent class's instance-level attribute management and ensures each instance has its own _total_rows set to 0, reflecting an empty iterator.

pages: The class-level pages = () was removed. To provide the necessary empty iterable for methods like to_arrow_iterable, a pages property was added:

@property def pages(self): return iter(())

This approach is cleaner as it explicitly provides an empty iterator when the pages attribute is accessed, making the class's intent clearer and ensuring compatibility with methods expecting an iterable of pages.

These changes make the _EmptyRowIterator more robust and predictable by adhering to the instance attribute patterns of its parent class and explicitly defining an empty pages iterable.

chalmerlowe · 2026-01-07T13:56:33Z

We worked up a script to reproduce/simulate the customer's issue.

After updating the code, when we run the repro script, we get the following results depending on whether we include an argument for the new timeout parameter OR not.

--- Testing without timeout ---                                 
Calling to_dataframe()... (Expect hang for 10 seconds)          
   [MockStorage] read_rows called. Blocking for 10 seconds...   
   [MockStorage] read_rows finished.                            
to_dataframe() returned.                                        
Time taken: 10.21 seconds                                       
SUCCESS: Call hung as expected.                                 
Setting up mocks...                                             
                                                                
--- Testing with timeout (5 seconds) ---                        
Calling to_dataframe()... (Expect TimeoutError after ~5 seconds)
   [MockStorage] read_rows called. Blocking for 10 seconds...   
SUCCESS: Caught expected exception: Download timed out after 5.0 seconds.
Time taken: 5.01 seconds                                        
SUCCESS: Timeout occurred within the expected range.            
   [MockStorage] read_rows finished.

vchudnov-g

This looks good to me. I'm not very knowledgeable about bigquery in particular, but the changes from the status quo look like they do what's intended.

I really appreciate your detailed explanations of the various changes. Those made the review easier.

fix(bigquery): add timeout parameter to to_dataframe and to_arrow met…

c908a5c

…hods

chalmerlowe requested a review from a team as a code owner January 6, 2026 14:09

blunderbuss-gcf bot assigned gkevinzheng Jan 6, 2026

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery API. labels Jan 6, 2026

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

google/cloud/bigquery/table.py Show resolved Hide resolved

chalmerlowe added 3 commits January 6, 2026 09:25

fix(bigquery): address review feedback - update _EmptyRowIterator sig…

f57810c

…natures and QueryJob.to_dataframe

fix(bigquery): update tests to expect timeout parameter

031adf1

test(bigquery): add unit tests for timeout logic in _pandas_helpers.py

39febf9

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jan 6, 2026

chalmerlowe commented Jan 7, 2026

View reviewed changes

tests/unit/test_table.py Outdated Show resolved Hide resolved

Update tests/unit/test_table.py

43e6a40

chalmerlowe commented Jan 7, 2026

View reviewed changes

vchudnov-g approved these changes Jan 7, 2026

View reviewed changes

Merge branch 'main' into fix/to-dataframe-timeout

5acd054

tswast approved these changes Jan 8, 2026

View reviewed changes

chalmerlowe merged commit 4f67ba2 into main Jan 8, 2026
30 checks passed

chalmerlowe deleted the fix/to-dataframe-timeout branch January 8, 2026 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add timeout parameter to to_dataframe and to_arrow met… #2354

fix: add timeout parameter to to_dataframe and to_arrow met… #2354

Uh oh!

chalmerlowe commented Jan 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

chalmerlowe Jan 7, 2026

Uh oh!

tswast Jan 8, 2026

Uh oh!

chalmerlowe Jan 7, 2026

Uh oh!

chalmerlowe commented Jan 7, 2026

Uh oh!

vchudnov-g left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: add timeout parameter to to_dataframe and to_arrow met… #2354

fix: add timeout parameter to to_dataframe and to_arrow met… #2354

Uh oh!

Conversation

chalmerlowe commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chalmerlowe Jan 7, 2026

Choose a reason for hiding this comment

Here's a breakdown of the change:

Why this change?

Uh oh!

tswast Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

chalmerlowe Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

chalmerlowe commented Jan 7, 2026

Uh oh!

vchudnov-g left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chalmerlowe commented Jan 6, 2026 •

edited

Loading