Skip to content

Conversation

@Nakroma
Copy link
Contributor

@Nakroma Nakroma commented Dec 17, 2024

Draft for the student project SYSTEMDS-3548.

Current contributions:

  • Fixes some minor bugs related to the performance tests
  • Parallelizes pandas_to_frame_block column processing (see image below for speed up, tested on my machine)

load_pandas

This commit fixes the load_numpy string performance test case. It keeps the CLI usage consistent with the other test cases, but converts the dtype to the correct one internally.
This commit fixes the array boolean convert breaking for row numbers above 64. It also adds a bit more error handling to prevent cases like this in the future.
This commit parallelizes the column processing in the pandas DataFrame to FrameBlock conversion.
@codecov
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 20.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 72.33%. Comparing base (d3fcfb1) to head (6b1f68c).
Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
.../apache/sysds/runtime/util/Py4jConverterUtils.java 20.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2154      +/-   ##
============================================
+ Coverage     72.03%   72.33%   +0.30%     
- Complexity    43937    44211     +274     
============================================
  Files          1441     1443       +2     
  Lines        166106   166353     +247     
  Branches      32428    32477      +49     
============================================
+ Hits         119655   120334     +679     
+ Misses        37199    36789     -410     
+ Partials       9252     9230      -22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Nakroma Nakroma marked this pull request as ready for review December 18, 2024 14:04
@christinadionysio
Copy link
Contributor

LGTM! Thank you for your contribution @Nakroma!

@Baunsgaard
Copy link
Contributor

LGTM as well.

How did you measure the time?
Is it with startup time of the system?

@Nakroma
Copy link
Contributor Author

Nakroma commented Dec 18, 2024

@Baunsgaard
Copy link
Contributor

@Baunsgaard I used the IO benchmark scripts for the figure provided above:

https://github.com/apache/systemds/blob/main/scripts/perftest/runAllIO.sh https://github.com/apache/systemds/blob/main/scripts/perftest/python/io/load_pandas.py

Great!

Then you can get better numbers:

Modify the script to start the context not in the 'run' part, instead move it to the 'setup' part, and remember to shut down the system with ctx.close() after you are done measuring.

@Nakroma
Copy link
Contributor Author

Nakroma commented Dec 19, 2024

@Baunsgaard Okey yeah that makes sense - pushed a commit for that 👍 I didnt move it to the setup but rather inside the global context, so .close() time is not included in the timing and also to support the args.number parameter.

@Baunsgaard
Copy link
Contributor

@Baunsgaard Okey yeah that makes sense - pushed a commit for that 👍 I didnt move it to the setup but rather inside the global context, so .close() time is not included in the timing and also to support the args.number parameter.

what are the times then?

@Nakroma
Copy link
Contributor Author

Nakroma commented Dec 19, 2024

what are the times then?

seems to be about a difference of 1-2s, at least on my local machine

load_pandas

@Baunsgaard
Copy link
Contributor

what are the times then?

seems to be about a difference of 1-2s, at least on my local machine

load_pandas

60% speedup on int32 and 100% on int64 is great!
However, it does seem to me like there is something else taking time from your results. I would expect speedup closer to the number of cores in your system.

@Nakroma
Copy link
Contributor Author

Nakroma commented Dec 19, 2024

60% speedup on int32 and 100% on int64 is great! However, it does seem to me like there is something else taking time from your results. I would expect speedup closer to the number of cores in your system.

So there is some more constant time, the building of the frameblock a few lines before the .convert calls for example is around 400ms.

I looked at the profiling a bit more and it seems like most time is spent on socket communication between Java and Python. My assumption would be that this adds quite a bit of overhead and doesn't parallelize well.

@Baunsgaard
Copy link
Contributor

So there is some more constant time, the building of the frameblock a few lines before the .convert calls for example is around 400ms.

We could put the allocation into the parallel transfer call for each column?

I looked at the profiling a bit more and it seems like most time is spent on socket communication between Java and Python. My assumption would be that this adds quite a bit of overhead and doesn't parallelize well.

I was afraid this was the case, Maybe there is a way to set the thread pool to use on the java side to make it better at receiving requests in parallel. However, that is probably just a dream.

This commit moves the assignment of column data to the FrameBlock to the parallel column processing.
@Baunsgaard
Copy link
Contributor

LGTM, I have now merged it.

While merging I played around with parallelizing the Python API and found out that it spawns a thread per connection. There is indeed an overhead in this connection, but it is not the main problem.

To make the transfer more efficient we could:

  1. Reduce the number of calls by fusing many operations into single calls to java.
  2. Reduce the current serialization bottleneck by slicing up the array into many smaller byte arrays when sending over.

I see around 20% utilization of my CPU when transferring 10k by 10k integer matrices, so there is room for improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants