Source: GitHub
Brat2BIO
A simple script that convert Brat format to BIO format
Conversion code adapted from GithubGist by thatguysimon
What this git repo for?
This scripts provide a simple tools to convert Brat Standoff format to BIO format.
what’s different:
- support multiple span of text from a single entity
- save to sentense level BIO csv file
- visualize the output annotations (e.g. distribution of the sentence length, entity distribution)
- multi-process speed up
A step by step instruction
prerequisits
pycorenlp; corenlp
Install CoreNLP
Follow the setup instruction. Make sure enviroment variables are added to the system path (Important).
Run bash script
sudo chmod +x convert.sh ## can skip this step if already have the execute permission
./convert.sh sample output ## the first arg points to sample data, the second args indicate the path of output directory
What expect to see from output
|file|description| |:—|:—:| ner-crf-training-data.tsv | the output BIO annotations| re-training-data.corp | origional data corpus|
(Optional) Ready to training Deep learning models?
The demo.ipynb provide a simple data preparation pipeline that separate the data into sentencs and label:
- Sentences containing a list of tokenized sentences
- Lable containing a list of corresponding IOB
Check the notebook for details