KcBERT Pre-Training Corpus
KcBERT Pre-Training Corpus is the training data for KcBERT, Korean comments BERT, released by beomi@github. The data specification is as follows:
- author: beomi@github
- repository: https://github.com/Beomi/KcBERT
- size: - train: 86,246,285 examples
 
1. In Python
Execute Python console, download the corpus, and read it.
Downloading the corpus
You can download KcBERT Pre-Training Corpus in the local by the following procedure.
from Korpora import Korpora
Korpora.fetch("kcbert")
First, download the corpus to Korpora, a directory under the user’s local computer root (~/Korpora). If you want to download it in other path, please assign root_dir=custom_path when you execute fetch function.
If you assign force_download=True when you execute the fetch function, the corpus is downloaded again regardless of its presence in the local. The default is False.
Reading the corpus
You can read KcBERT Pre-Training Corpus in Python console with the following scheme. If the corpus is not in the local, the downloading is accompanied.
from Korpora import Korpora
corpus = Korpora.load("kcbert")
You can read KcBERT Pre-Training Corpus as below; the result is the same as the above operation.
from Korpora import KcBERTKorpus
corpus = KcBERTKorpus()
Execute one of the above, and the copus is assigned to the variable corpus. train denotes the train data of KcBERT Pre-Training Corpus, and you can check the first instance as:
>>> corpus.train[0]
우리에게 북한은 꼭 없애야 할 적일뿐
The method get_all_texts lets you check all the texts (news comments) in KcBERT Pre-Training Corpus.
>>> corpus.get_all_texts()
2. In terminal
You can download the corpus without executing Python console. The command is as below.
korpora fetch --corpus kcbert
First, download the corpus to Korpora, a directory under the user’s local computer root (~/Korpora). If you want to download it in other path, please assign --root_dir custom_path when you execute fetch function in the terminal.
If you assign --force_download when you execute fetch function in the terminal, the corpus is downloaded again regardless of its presence in the local.