Modu: Newspaper

Modu: Newspaper is a dataset released by National Institute of Korean Language. Data specification is as follows.

author: National Institute of Korean Language
repository: https://corpus.korean.go.kr
references: document
size:
- train: about 3,500,000 examples

Data structure is as follows:

Attributes	Property
document_id	Unique id of the article
title	Title of the metadata (not the actual title of the article)
author	author of the article
publisher	newspaper publisher
date	published date
topic	topic of the article (politics, business, social affairs, lifestyle, IT/science, entertainment, sports, culture, beauty/health)
original_topic	original topic categorized by the newpaper publishers
paragraph	body of the article (the first line seems to the title of the article)

Warning

Due to the licensing issue of Modu corpus, Korpora does not provide any download functions for this corpus. Rather, it only offers a load function.If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.

You can load the corpus from your Python console as follows.

from Korpora import Korpora
corpus = Korpora.load("modu_news")
  

Warning

The code assumes that the corpus has already been unzipped into NIKL_NEWSPAPER directory within ~/Korpora (~/Korpora/NIKL_NEWSPAPER).If the root directory is not ~/Korpora, add root_dir=custom_path argument to the load method.

You can also load the corpus as follows. The output of these codes is identical to that of previous codes.

from Korpora import ModuNewsKorpus
corpus = ModuNewsKorpus()
  

Warning

The codes assumes that the corpus has already been unzipped into ~/Korpora/NIKL_NEWSPAPER within the current user’s local root. If the corpus exists in another directory, add root_dir=custom_path argument in ModuNewsKorpus class declaration.

Tip

If load_light=True, only the paragraphs and document_id are loaded. If it it set as False, all metadata are loaded as well. The default value of load_light is True.

If you use either one of these previous examples, you can load the corpus into the variable corpus. train refers to the training dataset of the corpus, and you can check its first training instance as follows.

>>> corpus.train[0]
ModuNews(document_id='NPRW1900000010.1', title='한국경제 2018년 기사', author='김현석', publisher='한국경제신문사', date='20180101', topic='생활', original_topic='국제', paragraph=['"라니냐로 겨울 가뭄 온다"…', '...'])
  

By executing the get_all_texts method, you can access all paragraphs (bodies of all articles) within the corpus.

>>> corpus.get_all_texts()
[''"라니냐로 겨울 가뭄 온다"...', ... ]