Source: GitHub

Brat2BIO

A simple script that convert Brat format to BIO format

Conversion code adapted from GithubGist by thatguysimon


What this git repo for?

This scripts provide a simple tools to convert Brat Standoff format to BIO format.

what’s different:

  • support multiple span of text from a single entity
  • save to sentense level BIO csv file
  • visualize the output annotations (e.g. distribution of the sentence length, entity distribution)
  • multi-process speed up

A step by step instruction

prerequisits

pycorenlp; corenlp

Install CoreNLP

Follow the setup instruction. Make sure enviroment variables are added to the system path (Important).

Run bash script

sudo chmod +x convert.sh ## can skip this step if already have the execute permission
./convert.sh sample output ## the first arg points to sample data, the second args indicate the path of output directory

What expect to see from output

|file|description| |:—|:—:| ner-crf-training-data.tsv | the output BIO annotations| re-training-data.corp | origional data corpus|

(Optional) Ready to training Deep learning models?

The demo.ipynb provide a simple data preparation pipeline that separate the data into sentencs and label:

  • Sentences containing a list of tokenized sentences
  • Lable containing a list of corresponding IOB

Check the notebook for details