Conversation
stsievert
left a comment
There was a problem hiding this comment.
Thanks for the contribution! At first glance, this looks pretty good. I'd appreciate some more review.
All these comments and linting errors are minor. If you want to run the linting yourself, look at https://ml.dask.org/contributing.html#style
Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>
Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>
|
ok, I am ready for more feedback. I should mention that I am running |
|
There are a couple of extra things I want to point out. While scikit-learn returns a scipy.sparse.coo_array,I return a sparse.COO. I don't know how well the rest of dask-ml plays with sparse.COO, and I am not sure how committed to 1 to 1 matching scikit-learn dask-ml is. I have also seen two other implementations of '_handle_zeros_in_scale' in the dask-ml code base. I would have used the existing implementations but none used epsilon to catch near zero values. |
stsievert
left a comment
There was a problem hiding this comment.
You've also got some style errors. Could you look at run the linting commands at https://ml.dask.org/contributing.html#style ?
|
@stsievert what else would you need for this pr to be merged/approved? |
|
@stsievert are you willing to take a second look at this, hoping to push this push this through. If you are busy, would you happen to know of another maintiner we can bring in? |
|
This looks good. A couple suggestions:
|
|
I have added the 'dask_ml.utils.assert_estimator_equal' to the tests. Regarding linting, it seems that flake8 and black are in conflict. Would you like me to prioritize one over the other? |
Hi,
I am trying to add TfidfTransformer to dask. I am going to assume that adding TfidfVectorizer can be easily done later down the line, by stacking CountVectorizer and TfidfTransformer .
This is not the first attempt to add tfidf to dask-ml(https://github.com/dask/dask-ml/pull/869/files#diff-421c76129d7900a3ae83237eab5785858c0652d9f550d92e0202e45ebdcb2977). I can't comment on the differences between my attempt and previous attempts; but I can guaranty that my implementation only use's the dask array api, no other api.