Skip to content

Conversation

@chalmerlowe
Copy link
Collaborator

@chalmerlowe chalmerlowe commented Jan 6, 2026

Description

This PR adds a timeout parameter to the to_dataframe() and to_arrow() methods (and their corresponding *_iterable, *_geodataframe and QueryJob wrappers) in the BigQuery client library.

This addresses an issue where these methods could hang indefinitely if the underlying BigQuery Storage API stream blocked (e.g., due to firewall issues or network interruptions) during the download phase. The added timeout parameter ensures that the download operation respects the specified time limit and raises a concurrent.futures.TimeoutError if it exceeds the duration.

Changes

  • Modified google/cloud/bigquery/_pandas_helpers.py:
    • Updated _download_table_bqstorage to accept a timeout argument.
    • Implemented a timeout check within the result processing loop.
    • Updated wrapper functions download_dataframe_bqstorage and download_arrow_bqstorage to accept and pass the timeout parameter.
  • Modified google/cloud/bigquery/table.py:
    • Updated RowIterator methods (to_arrow_iterable, to_arrow, to_dataframe_iterable, to_dataframe, to_geodataframe) to accept and pass timeout.
    • Updated _EmptyRowIterator methods to match the RowIterator signature, preventing TypeError when a timeout is provided for empty result sets.
  • Modified google/cloud/bigquery/job/query.py:
    • Updated QueryJob methods (to_arrow, to_dataframe, to_geodataframe) to accept timeout and pass it to the result iterator.
  • Updated unit tests in tests/unit/job/test_query_pandas.py, tests/unit/test_table.py, and tests/unit/test_table_pandas.py to reflect the signature changes.

Fixes internal bug: b/468091307

@chalmerlowe chalmerlowe requested a review from a team as a code owner January 6, 2026 14:09
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery API. labels Jan 6, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @chalmerlowe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the BigQuery client library by integrating a timeout mechanism into its data retrieval methods, specifically to_dataframe and to_arrow. This change prevents operations from hanging indefinitely during data download from the BigQuery Storage API, offering users better control over execution duration and improving application stability in the face of network inconsistencies.

Highlights

  • Timeout Parameter: Introduced a timeout parameter to to_dataframe() and to_arrow() methods, and their corresponding iterable and QueryJob wrappers, in the BigQuery client library.
  • Indefinite Hang Prevention: Addresses an issue where data download operations could hang indefinitely due to underlying BigQuery Storage API stream blocking (e.g., from firewall issues or network interruptions).
  • Error Handling: The new timeout parameter ensures that a concurrent.futures.TimeoutError is raised if the download operation exceeds the specified time limit, preventing unresponsive applications.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a timeout parameter to the to_dataframe() and to_arrow() methods, which is a valuable addition to prevent indefinite hangs during data download. The core timeout logic in _download_table_bqstorage is well-implemented, correctly using manual thread pool management to handle timeouts gracefully.

I've identified a critical issue where the method signatures for _EmptyRowIterator in google/cloud/bigquery/table.py have not been updated to include the timeout parameter. This will cause a TypeError for empty result sets and needs to be addressed. I've left a detailed comment on this.

Additionally, I noticed that QueryJob.to_dataframe in google/cloud/bigquery/job/query.py is missing the timeout parameter, which is inconsistent with the other updated methods. Please consider adding it for API consistency.

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jan 6, 2026
…ntiate new _EmptyRowIterator for each test in test_methods_w_timeout to prevent ValueError\n- Remove total_rows property from _EmptyRowIterator to fix AssertionError\n- Remove redundant __iter__ method in _EmptyRowIterator to fix lint error
# Call result() on any finished threads to raise any
# exceptions encountered.
future.result()
# Manually manage the pool to control shutdown behavior on timeout.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may wonder why we switched away from using a contextmanager here.

Here's a breakdown of the change:

  1. Manual Pool Creation: Instead of the with statement, the code now creates the
    ThreadPoolExecutor directly:

    pool = concurrent.futures.ThreadPoolExecutor(max_workers=max(1, total_streams))

  2. Conditional Shutdown: The finally block now uses a wait_on_shutdown boolean flag:

    pool.shutdown(wait=wait_on_shutdown)
    This flag is set to False only when a TimeoutError is raised.

  3. Timeout Handling: Inside the while not_done: loop, there's new logic to check if the
    elapsed time has exceeded the specified timeout. If it has, it raises a
    concurrent.futures.TimeoutError and sets wait_on_shutdown to False.

Why this change?

The standard context manager for ThreadPoolExecutor always waits for all futures to complete
upon exiting the block. This behavior is not desirable when a timeout is implemented. If the
download times out, we want to stop waiting for the worker threads immediately and not block
until they all finish.

By manually managing the pool and using the wait_on_shutdown flag, we can tell the
pool.shutdown() method not to wait for the threads to complete if a timeout has occurred.
This allows the TimeoutError to be propagated up quickly, rather than being stuck waiting for
threads that are potentially hanging.

So, the change was necessary to ensure the timeout parameter works effectively and the
function doesn't hang unnecessarily when the timeout duration is exceeded. The core logic of
submitting tasks to the pool and retrieving results from the queue remains very similar, but
the shutdown process is now more nuanced to handle timeouts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!


pages = ()
total_rows = 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few words on why these were deleted/modified.

Summary of Changes to _EmptyRowIterator

In the _EmptyRowIterator class, the class-level attributes pages = () and total_rows = 0 were replaced with new handling.

Reasoning:

These class-level attributes were causing issues with how instances of _EmptyRowIterator were behaving, particularly in test scenarios.

  1. total_rows = 0: While seemingly correct, having this at the class level interfered with the instance-level _total_rows attribute inherited from the RowIterator parent class. The parent class expects to manage _total_rows at the instance level. Removing the class-level attribute allows the instance-level attribute to be a single source of truth, initialized to 0 in the __init__ method, and prevents potential conflicts or unexpected behavior.

  2. pages = (): This class-level attribute was not used and added unnecessary clutter. The _EmptyRowIterator is designed to represent an empty result set, so it never has any pages. The __iter__ method already correctly returns an empty iterator (iter(())). Additionally, to ensure methods like to_arrow_iterable function correctly without making network requests, a pages property was added to return an empty iterator.

Replacement and Improvement:

  • total_rows: The class-level total_rows = 0 was removed. Instead, self._total_rows is now consistently initialized to 0 within the __init__ method of _EmptyRowIterator. This aligns with the parent class's instance-level attribute management and ensures each instance has its own _total_rows set to 0, reflecting an empty iterator.

  • pages: The class-level pages = () was removed. To provide the necessary empty iterable for methods like to_arrow_iterable, a pages property was added:

    @property
    def pages(self):
        return iter(())

    This approach is cleaner as it explicitly provides an empty iterator when the pages attribute is accessed, making the class's intent clearer and ensuring compatibility with methods expecting an iterable of pages.

These changes make the _EmptyRowIterator more robust and predictable by adhering to the instance attribute patterns of its parent class and explicitly defining an empty pages iterable.

@chalmerlowe
Copy link
Collaborator Author

We worked up a script to reproduce/simulate the customer's issue.

After updating the code, when we run the repro script, we get the following results depending on whether we include an argument for the new timeout parameter OR not.

--- Testing without timeout ---                                 
Calling to_dataframe()... (Expect hang for 10 seconds)          
   [MockStorage] read_rows called. Blocking for 10 seconds...   
   [MockStorage] read_rows finished.                            
to_dataframe() returned.                                        
Time taken: 10.21 seconds                                       
SUCCESS: Call hung as expected.                                 
Setting up mocks...                                             
                                                                
--- Testing with timeout (5 seconds) ---                        
Calling to_dataframe()... (Expect TimeoutError after ~5 seconds)
   [MockStorage] read_rows called. Blocking for 10 seconds...   
SUCCESS: Caught expected exception: Download timed out after 5.0 seconds.
Time taken: 5.01 seconds                                        
SUCCESS: Timeout occurred within the expected range.            
   [MockStorage] read_rows finished.              

Copy link

@vchudnov-g vchudnov-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'm not very knowledgeable about bigquery in particular, but the changes from the status quo look like they do what's intended.

I really appreciate your detailed explanations of the various changes. Those made the review easier.

@chalmerlowe chalmerlowe merged commit 4f67ba2 into main Jan 8, 2026
30 checks passed
@chalmerlowe chalmerlowe deleted the fix/to-dataframe-timeout branch January 8, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery API. size: l Pull request size is large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants