-
Notifications
You must be signed in to change notification settings - Fork 520
[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit fixes the load_numpy string performance test case. It keeps the CLI usage consistent with the other test cases, but converts the dtype to the correct one internally.
This commit fixes the array boolean convert breaking for row numbers above 64. It also adds a bit more error handling to prevent cases like this in the future.
This commit parallelizes the column processing in the pandas DataFrame to FrameBlock conversion.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2154 +/- ##
============================================
+ Coverage 72.03% 72.33% +0.30%
- Complexity 43937 44211 +274
============================================
Files 1441 1443 +2
Lines 166106 166353 +247
Branches 32428 32477 +49
============================================
+ Hits 119655 120334 +679
+ Misses 37199 36789 -410
+ Partials 9252 9230 -22 ☔ View full report in Codecov by Sentry. |
|
LGTM! Thank you for your contribution @Nakroma! |
|
LGTM as well. How did you measure the time? |
|
@Baunsgaard I used the IO benchmark scripts for the figure provided above: https://github.com/apache/systemds/blob/main/scripts/perftest/runAllIO.sh |
Great! Then you can get better numbers:
Modify the script to start the context not in the 'run' part, instead move it to the 'setup' part, and remember to shut down the system with ctx.close() after you are done measuring. |
|
@Baunsgaard Okey yeah that makes sense - pushed a commit for that 👍 I didnt move it to the setup but rather inside the global context, so .close() time is not included in the timing and also to support the |
what are the times then? |
So there is some more constant time, the building of the frameblock a few lines before the I looked at the profiling a bit more and it seems like most time is spent on socket communication between Java and Python. My assumption would be that this adds quite a bit of overhead and doesn't parallelize well. |
We could put the allocation into the parallel transfer call for each column?
I was afraid this was the case, Maybe there is a way to set the thread pool to use on the java side to make it better at receiving requests in parallel. However, that is probably just a dream. |
This commit moves the assignment of column data to the FrameBlock to the parallel column processing.
|
LGTM, I have now merged it. While merging I played around with parallelizing the Python API and found out that it spawns a thread per connection. There is indeed an overhead in this connection, but it is not the main problem. To make the transfer more efficient we could:
I see around 20% utilization of my CPU when transferring 10k by 10k integer matrices, so there is room for improvement. |


Draft for the student project
SYSTEMDS-3548.Current contributions: