pythonで韓国語の形態素解析をするため、KoNLPyを導入してみます。
インストール
$ pip install konlpy
MeCabのインストール
taggerにMeCabを使用する場合は、別途MeCabのインストールが必要です。 taggerは数種類をサポートしていますが、今回は実行速度が圧倒的に速いMeCabを導入します。
読み込み時間(クラスと辞書)
- Kkma: 5.6988 secs
- Komoran: 5.4866 secs
- Hannanum: 0.6591 secs
- Okt (previous Twitter): 1.4870 secs
- Mecab: 0.0007 secs
実行時間(100K characters)
- Kkma: 35.7163 secs
- Komoran: 25.6008 secs
- Hannanum: 8.8251 secs
- Okt (previous Twitter): 2.4714 secs
- Mecab: 0.2838 secs
Morphological analysis and POS tagging — KoNLPy 0.5.2 documentation より
$ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
内部ではmecab-koとmecab-ko-dic、mecab-pythonのインストールが行われています。(私はすでに前者2つはインストール済みでした。)
$ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh) Installing automake (A dependency for mecab-ko) ==> Downloading https://homebrew.bintray.com/bottles/automake-1.16.2.mojave.bottle.tar.gz ==> Downloading from https://akamai.bintray.com/fe/fe26d4df57481b6a7ca0a6915c37c53648c27ffb41926b3570c45f80fdd8888e?__gda__=exp=1587821736~h ######################################################################## 100.0% ==> Pouring automake-1.16.2.mojave.bottle.tar.gz 🍺 /usr/local/Cellar/automake/1.16.2: 131 files, 3.4MB mecab-ko is already installed mecab-ko-dic is already installed Install mecab-python /tmp ~/api-env/python-analysis-api Cloning into 'mecab-python-0.996'... remote: Counting objects: 17, done. remote: Compressing objects: 100% (16/16), done. remote: Total 17 (delta 3), reused 0 (delta 0) Unpacking objects: 100% (17/17), done. ~/api-env/python-analysis-api Processing /tmp/mecab-python-0.996 Installing collected packages: mecab-python Running setup.py install for mecab-python ... done Successfully installed mecab-python-0.996-ko-0.9.2 You are using pip version 9.0.1, however version 20.0.2 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Done.
動作確認
$ python >>> from konlpy.tag import Mecab >>> from konlpy.utils import pprint >>> mecab = Mecab() >>> pprint(mecab.pos("빨리 가자.")) [('빨리', 'MAG'), ('가', 'VV'), ('자', 'EF'), ('.', 'SF')]