AI Hub Ko-En Parallel Corpus

AI Hub Ko-En Parallel Corpus is the data released by AI Hub. Data specification is as follows:

Data Property Volume
Spoken language train 400,000
Conversation train 100,000
News train 801,387
Korean culture train 100,646
Decree train 100,298
Government website train 100,087
TOTAL train 1,602,418
Warning

Due to the license issue, in Korpora package, only the loading is provided for AI Hub Ko-En Parallel Corpus, not the downloading.If you want to use the corpus, it should be downloaded manually from AI Hub, guided by the verification process.Also, the translation data from AI Hub is in the file format of compressed or excel (.xlsx).If the files are unzipped, the names are in Hangul, the letter for the Korean language.Hangul in the file names might incur unexpected problems depending on the operating systems.Thus, in Korpora, it is assumed that the corpus is downloaded and all the file names are modified to Latin alphabet as below.

Hangul file name Latin alphabet file name
1_구어체(1)_200226.xlsx 1_spoken(1)_200226.xlsx
1_구어체(2)_200226.xlsx 1_spoken(2)_200226.xlsx
2_대화체_200226.xlsx 2_conversation_200226.xlsx
3_문어체_뉴스(1)_200226.xlsx 3_news(1)_200226.xlsx
3_문어체_뉴스(2)_200226.xlsx 3_news(2)_200226.xlsx
3_문어체_뉴스(3)_200226.xlsx 3_news(3)_200226.xlsx
3_문어체_뉴스(4)_200226.xlsx 3_news(4)_200226.xlsx
4_문어체_한국문화_200226.xlsx 4_korean_culture_200226.xlsx
5_문어체_조례_200226.xlsx 5_decree_200226.xlsx
6_문어체_지자체웹사이트_200226.xlsx 6_government_website_200226.xlsx

Reading the whole data at once

The example script for reading the whole AI Hub Ko-En Parallel Corpus in Python console is as follows:

from Korpora import Korpora
corpus = Korpora.load("aihub_translation")
Warning

The code above operates given the corpus is present unzipped in ~/Korpora/AIHub_translation.If the root directory differs from ~/Korpora, please add root_dir=custom_path as you call load function.

You can also read AI Hub Ko-En Parallel Corpus as below; the result is the same as the above operation.

from Korpora import AIHubTranslationKorpus
corpus = AIHubTranslationKorpus()
Warning

The code above operates given the corpus is present unzipped in the directory ~/Korpora/AIHub_translation which is under the user’s local computer root.If the corpus exists in other directory, please add root_dir=custom_path as you declare the class AIHubTranslationKorpus.

Select and execute one between the above two codes, and the copus is assigned to the variable corpus. train denotes the train data of AI Hub Ko-En Parallel Corpus, and you can check the first instance as:

>>> corpus.train[0]
SentencePair(text="'Bible Coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 앱입니다.", pair="Bible Coloring' is a coloring application that allows you to experience beautiful stories in the Bible.")
>>> corpus.train[0].text
'Bible Coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 앱입니다.
>>> corpus.train[0].pair
Bible Coloring' is a coloring application that allows you to experience beautiful stories in the Bible.

Reading only Spoken language data

The example script for reading Spoken language data from AI Hub Ko-En Parallel Corpus in Python console is as follows:

from Korpora import Korpora
corpus = Korpora.load("aihub_spoken_translation")
Warning

The code above operates given the corpus is present unzipped in ~/Korpora/AIHub_translation.If the root directory differs from ~/Korpora, please add root_dir=custom_path as you call load function.

You can also read Spoken language data from AI Hub Ko-En Parallel Corpus as below; the result is the same as the above operation.

from Korpora import AIHubSpokenTranslationKorpus
corpus = AIHubSpokenTranslationKorpus()
Warning

The code above operates given the corpus is present unzipped in the directory ~/Korpora/AIHub_translation which is under the user’s local computer root.If the corpus exists in other directory, please add root_dir=custom_path as you declare the class AIHubSpokenTranslationKorpus.

Select and execute one between the above two codes, and the copus is assigned to the variable corpus. train denotes the train data of Spoken language data from AI Hub Ko-En Parallel Corpus, and you can check the first instance as:

>>> corpus.train[0]
SentencePair(text="'Bible Coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 앱입니다.", pair="Bible Coloring' is a coloring application that allows you to experience beautiful stories in the Bible.")
>>> corpus.train[0].text
'Bible Coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 앱입니다.
>>> corpus.train[0].pair
Bible Coloring' is a coloring application that allows you to experience beautiful stories in the Bible.

Reading only Conversation data

The example script for reading Conversation data from AI Hub Ko-En Parallel Corpus in Python console is as follows:

from Korpora import Korpora
corpus = Korpora.load("aihub_conversation_translation")
Warning

The code above operates given the corpus is present unzipped in ~/Korpora/AIHub_translation.If the root directory differs from ~/Korpora, please add root_dir=custom_path as you call load function.

You can also read Conversation data from AI Hub Ko-En Parallel Corpus as below; the result is the same as the above operation.

from Korpora import AIHubConversationTranslationKorpus
corpus = AIHubConversationTranslationKorpus()
Warning

The code above operates given the corpus is present unzipped in the directory ~/Korpora/AIHub_translation which is under the user’s local computer root.If the corpus exists in other directory, please add root_dir=custom_path as you declare the class AIHubConversationTranslationKorpus.

Select and execute one between the above two codes, and the copus is assigned to the variable corpus. train denotes the train data of Conversation data from AI Hub Ko-En Parallel Corpus, and you can check the first instance as:

>>> corpus.train[0]
SentencePair(text='이번 신제품 출시에 대한 시장의 반응은 어떤가요?', pair="How is the market's reaction to the newly released product?")
>>> corpus.train[0].text
번 신제품 출시에 대한 시장의 반응은 어떤가요?
>>> corpus.train[0].pair
How is the market's reaction to the newly released product?

Reading only News data

The example script for reading News data from AI Hub Ko-En Parallel Corpus in Python console is as follows:

from Korpora import Korpora
corpus = Korpora.load("aihub_news_translation")
Warning

The code above operates given the corpus is present unzipped in ~/Korpora/AIHub_translation.If the root directory differs from ~/Korpora, please add root_dir=custom_path as you call load function.

You can also read News data from AI Hub Ko-En Parallel Corpus as below; the result is the same as the above operation.

from Korpora import AIHubNewsTranslationKorpus
corpus = AIHubNewsTranslationKorpus()
Warning

The code above operates given the corpus is present unzipped in the directory ~/Korpora/AIHub_translation which is under the user’s local computer root.If the corpus exists in other directory, please add root_dir=custom_path as you declare the class AIHubNewsTranslationKorpus.

Select and execute one between the above two codes, and the copus is assigned to the variable corpus. train denotes the train data of News data from AI Hub Ko-En Parallel Corpus, and you can check the first instance as:

>>> corpus.train[0]
SentencePair(text='스키너가 말한 보상은 대부분 눈으로 볼 수 있는 현물이다.', pair="Skinner's reward is mostly eye-watering.")
>>> corpus.train[0].text
스키너가 말한 보상은 대부분 눈으로 볼 수 있는 현물이다.
>>> corpus.train[0].pair
Skinner's reward is mostly eye-watering.

Reading only Korean culture data

The example script for reading Korean culture data from AI Hub Ko-En Parallel Corpus in Python console is as follows:

from Korpora import Korpora
corpus = Korpora.load("aihub_korean_culture_translation")
Warning

The code above operates given the corpus is present unzipped in ~/Korpora/AIHub_translation.If the root directory differs from ~/Korpora, please add root_dir=custom_path as you call load function.

You can also read Korean culture data from AI Hub Ko-En Parallel Corpus as below; the result is the same as the above operation.

from Korpora import AIHubKoreanCultureTranslationKorpus
corpus = AIHubKoreanCultureTranslationKorpus()
Warning

The code above operates given the corpus is present unzipped in the directory ~/Korpora/AIHub_translation which is under the user’s local computer root.If the corpus exists in other directory, please add root_dir=custom_path as you declare the class AIHubKoreanCultureTranslationKorpus.

Select and execute one between the above two codes, and the copus is assigned to the variable corpus. train denotes the train data of Korean culture data from AI Hub Ko-En Parallel Corpus, and you can check the first instance as:

>>> corpus.train[0]
SentencePair(text='강릉 기생 매화가 등장하는 판소리 열두마당의 하나인 「강릉매화전」은 판소리 특유의 해학이 담겨져 있기도 하다.', pair="<Gangneung Maehwajeon>, one of the twelve madang of pansori that Gangneung's gisaeng Maehwa appears, also contains a unique humor of pansori.")
>>> corpus.train[0].text
강릉 기생 매화가 등장하는 판소리 열두마당의 하나인 「강릉매화전」은 판소리 특유의 해학이 담겨져 있기도 하다.
>>> corpus.train[0].pair
<Gangneung Maehwajeon>, one of the twelve madang of pansori that Gangneung's gisaeng Maehwa appears, also contains a unique humor of pansori.

Reading only Decree data

The example script for reading Decree data from AI Hub Ko-En Parallel Corpus in Python console is as follows:

from Korpora import Korpora
corpus = Korpora.load("aihub_decree_translation")
Warning

The code above operates given the corpus is present unzipped in ~/Korpora/AIHub_translation.If the root directory differs from ~/Korpora, please add root_dir=custom_path as you call load function.

You can also read Decree data from AI Hub Ko-En Parallel Corpus as below; the result is the same as the above operation.

from Korpora import AIHubDecreeTranslationKorpus
corpus = AIHubDecreeTranslationKorpus()
Warning

The code above operates given the corpus is present unzipped in the directory ~/Korpora/AIHub_translation which is under the user’s local computer root.If the corpus exists in other directory, please add root_dir=custom_path as you declare the class AIHubDecreeTranslationKorpus.

Select and execute one between the above two codes, and the copus is assigned to the variable corpus. train denotes the train data of Decree data from AI Hub Ko-En Parallel Corpus, and you can check the first instance as:

>>> corpus.train[0]
SentencePair(text='의원의 회의규칙 제47조제1항', pair="Article 47(1) of the Members' Meeting Rules")
>>> corpus.train[0].text
의원의 회의규칙 제47조제1항
>>> corpus.train[0].pair
Article 47(1) of the Members' Meeting Rules

Reading only Government website data

The example script for reading Government website data from AI Hub Ko-En Parallel Corpus in Python console is as follows:

from Korpora import Korpora
corpus = Korpora.load("aihub_government_website_translation")
Warning

The code above operates given the corpus is present unzipped in ~/Korpora/AIHub_translation.If the root directory differs from ~/Korpora, please add root_dir=custom_path as you call load function.

You can also read Government website data from AI Hub Ko-En Parallel Corpus as below; the result is the same as the above operation.

from Korpora import AIHubGovernmentWebsiteTranslationKorpus
corpus = AIHubGovernmentWebsiteTranslationKorpus()
Warning

The code above operates given the corpus is present unzipped in the directory ~/Korpora/AIHub_translation which is under the user’s local computer root.If the corpus exists in other directory, please add root_dir=custom_path as you declare the class AIHubGovernmentWebsiteTranslationKorpus.

Select and execute one between the above two codes, and the copus is assigned to the variable corpus. train denotes the train data of Government website data from AI Hub Ko-En Parallel Corpus, and you can check the first instance as:

>>> corpus.train[0]
SentencePair(text='"경기도가 말산업 육성을 위해 총예산 245,193천원으로 2013년 경기도 용인시 남사면 소재의 축산위생연구소 가축연구팀 부지에 경기도말시험사육장을 신축하고, 올해 2월 승용마 8두를 입식하여 본격적인 승용마 시험 연구에 돌입하였다고 밝혔다."', pair='"The Gyeonggi provincial government announced that it has established a Gyeonggi-do test farm on the site of the livestock research team of livestock sanitation Institute in Namsa-myeon, Yongin, Gyeonggji province in 2013 with a total budget of 245 million and 193 thousand won to foster the horse industry, and that it has begun full-fledged testing of eight riding horses in February this year."')
>>> corpus.train[0].text
"경기도가 말산업 육성을 위해 총예산 245,193천원으로 2013년 경기도 용인시 남사면 소재의 축산위생연구소 가축연구팀 부지에 경기도말시험사육장을 신축하고, 올해 2월 승용마 8두를 입식하여 본격적인 승용마 시험 연구에 돌입하였다고 밝혔다."
>>> corpus.train[0].pair
"The Gyeonggi provincial government announced that it has established a Gyeonggi-do test farm on the site of the livestock research team of livestock sanitation Institute in Namsa-myeon, Yongin, Gyeonggji province in 2013 with a total budget of 245 million and 193 thousand won to foster the horse industry, and that it has begun full-fledged testing of eight riding horses in February this year."