Skip to content

Conversation

@davidjpyu
Copy link

As we discussed earlier, there are implementations of mistral concurrently on both sides, after comparison, there is no meaningful difference on mistral.py, mistral_layer.py, and templates.py, so only auto_model.py is updated here and it is tested with its output answer correctly.

Although everything works fine, one thing to note is that generate.py works well using DEVICE = "cuda:0" in line 20, but not if using other gpus like changing that line to DEVICE = "cuda:1" instead. The bug would fall into cache.py line 79:
hidden_states = flashinfer.single_prefill_with_kv_cache(
with message:
File "/home/zhuominc/anaconda3/envs/junpuy_test/lib/python3.10/site-packages/flashinfer/prefill.py", line 186, in single_prefill_with_kv_cache
packed_custom_mask = packbits(
File "/home/zhuominc/anaconda3/envs/junpuy_test/lib/python3.10/site-packages/flashinfer/quantization.py", line 65, in packbits
return _kernels.packbits(x, bitorder)
RuntimeError: PackBits failed with error code an illegal memory access was encountered

I'm not sure if that is expected or it is something to figure out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant