Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoNLL support #10

Open
ksteimel opened this issue Apr 26, 2018 · 3 comments
Open

CoNLL support #10

ksteimel opened this issue Apr 26, 2018 · 3 comments

Comments

@ksteimel
Copy link

It'd be good to support CoNLL format in a generic sense (and then perhaps some of the more specific CoNLL formats as an offshoot). I'd be happy to work on this if this is something you think would be worth it.

@oxinabox
Copy link
Member

I think it is worth it yes.

@Evizero has support for it in MLDatasets.jl
https://github.com/JuliaML/MLDatasets.jl/blob/master/src/CoNLL.jl
which would be a starting point.

if that is ported across, and enhanced to match the CorpusLoaders style:

  • Lazily loaded from disk
  • using MultiResolutionIterators.jl

And is working well, perhaps we can talk about deprecating it out of MLDatasets.jl.
Though there are perhaps pros to having two loaders for that, since MLDatasets.jl's is much simpler maybe.

@Ayushk4
Copy link
Member

Ayushk4 commented May 29, 2019

I am starting with the addition of CoNLL 2003 Corpus. The original files from the shared task are freely available.

To extract the required files from it, one needs to have the Reuters Corpus file rcv1.tar.xz and build the original files with it. This is available from Dataverse Harward or NIST website. However, obtaining the Reuters corpus requires a user agreement and maybe some time for it to get approved.

Instead of doing this, there are files of CoNLL 2003 that have been built and are openly available.

I feel it will be very very difficult to take care of the downloading part with the former method and that I should go with the latter approach. What do you suggest in this case?

Edit: I feel the latter approach will be simpler overall as well as easier to multiplicate this to other CoNLL datasets.

@oxinabox
Copy link
Member

The later sounds legit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants