[Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict#783
[Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict#783sarahyurick wants to merge 2 commits intodask-contrib:datafusion-sql-plannerfrom
Conversation
| delayed_model = [delayed(model.fit)(x_p, y_p) for x_p, y_p in zip(X_d, y_d)] | ||
| model = delayed_model[0].compute() | ||
| model = ParallelPostFit(estimator=model) | ||
| output_meta = np.array([]) |
There was a problem hiding this comment.
With this, output_meta is always []. Should this maybe be in some sort of try/except block since we're only handling the CPU case?
There was a problem hiding this comment.
I dont think we can just hardcode the meta to be output_meta to be np.array([]) . We also use cuML for this case and that outputs a cuDF Series.
| prediction = model.predict(df[training_columns]) | ||
| part = df[training_columns] | ||
| output_meta = model.predict_meta | ||
| if part.shape[0].compute() == 0 and output_meta is not None: |
There was a problem hiding this comment.
compute() is needed on the Delayed object to get the number of rows in the partition. I believe that right now, output_meta will always be []?
There was a problem hiding this comment.
You dont need to compute for this to do this, we can do it lazily too.
| df = context.sql(sql_select) | ||
| prediction = model.predict(df[training_columns]) | ||
| part = df[training_columns] | ||
| output_meta = model.predict_meta |
There was a problem hiding this comment.
AttributeError: 'KMeans' object has no attribute 'predict_meta'
VibhuJawa
left a comment
There was a problem hiding this comment.
We should not hard code any meta values and should only handle cases when model is ParallelPostFit .
| delayed_model = [delayed(model.fit)(x_p, y_p) for x_p, y_p in zip(X_d, y_d)] | ||
| model = delayed_model[0].compute() | ||
| model = ParallelPostFit(estimator=model) | ||
| output_meta = np.array([]) |
There was a problem hiding this comment.
I dont think we can just hardcode the meta to be output_meta to be np.array([]) . We also use cuML for this case and that outputs a cuDF Series.
| prediction = model.predict(df[training_columns]) | ||
| part = df[training_columns] | ||
| output_meta = model.predict_meta | ||
| if part.shape[0].compute() == 0 and output_meta is not None: |
There was a problem hiding this comment.
You dont need to compute for this to do this, we can do it lazily too.
| empty_output = self.handle_empty_partitions(output_meta) | ||
| if empty_output is not None: | ||
| return empty_output | ||
| prediction = model.predict(part) |
There was a problem hiding this comment.
We should wrap the predict like the following for cases only for when we have a ParallelPostFit model.
if isinstance(model, ParallelPostFit):
output_meta = model.predict_meta
if predict_meta is None:
predict_meta = model.estimator.predict(part._meta_nonempty)
prediction = part.map_partitions(_predict, predict_meta, model.estimator, meta=predict_meta)
def _pedict(part, predict_meta, estimator):
if part.shape[0] == 0 and predict_meta is not None:
empty_output = handle_empty_partitions(output_meta)
return empty_output
return estimator.predict(part)
VibhuJawa
left a comment
There was a problem hiding this comment.
Please add more tests
Tagging @randerzander and @VibhuJawa
Changes in
create_model.pyhandle nullable types, such as for this example:Changes in
predict.pyhandle empty partitions, modeled based on this Dask-ML PR: dask/dask-ml#912