-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I would need a script that takes an UD corpus and splits it into different parts. Ideally it would be specifiable either as proportions, i.e. 10/90, or as or by sentence numbers 100/900 etc. Other ideas are welcome too. I was assuming there is already something like this as in some point the corpora are split, but I didn't find anything. I don't know what's the typical way to do the split, but one useful possibility could be to specify whether the split is done by randomly selecting sentences or by taking consecutively from top, for example.
Just to give background, I would use it within a loop that creates different sized subsets from the corpus, after one portion would had been extracted as unchanging test portion. So I'm not totally sure how this is most generalizable for different uses, but this sounds like a generally useful idea anyway, so I'm adding it here.
Thanks a lot for many useful scripts here!