ElasticPress 4.3

ElasticPressは、WordPressでElasticsearchを使うためのプラグインという話は以前にも何回かしていて、Elasticsearch本体のバージョンアップや、ElasticPressのバージョンアップの際に、日本語を対象として使うならいろいろ手直しが必要だと書いてきました。メジャーバージョン3から4へのアップグレード時にはそれほど大きな対応は必要なく、以前の変更が有効だったのですが、今回の4.2から4.3へのアップグレードでは大きな変更が加えられており、従来の対策では期待した結果が得られない状況となりました。

まずはおさらいとしてElasticsearchでSudachi(日本語形態素解析プラグイン)を使う準備から。Elasticsearchのバージョンは6.8.23を使用します。「Sudachi公式」からプラグインのソースをダウンロードし、使用するバージョン(6.8.23)に対応したプラグインをビルドします。主要コマンドは以下の通り。

$ git clone https://github.com/WorksApplications/elasticsearch-sudachi.git
$ ./gradlew -PelasticsearchVersion=6.8.23 build
$ /usr/share/elasticsearch/bin/elasticsearch-plugin install file:///analysis-sudachi-6.8.23-2.1.0.zip

Elasticsearch側の準備が出来たら、WordPressのプラグインをインストールします。設定のページでは、自分のElasticsearchを使う指定をします。Elasticsearchに無事接続していることが確認出来たら、オリジナル(素の設定)のままインデックスを作成しエラーが出ないことを確認しておきます。

本題の修正です。修正するファイルはElasticsearchのバージョンによって異なります。今回のように6.8.23など、古いバージョンを使う際には「5-2.php」を修正します。7.10.2などの比較的新しいバージョンの場合は「7-0.php」を修正します。自分が使うElasticsearchのバージョンを確認しましょう。では、まずはマッピングから、の前にレプリカの設定を変えておかないとインデックスが「イエロー」になります。レプリカが一つ以上ある場合にはそのままで構いません。

//      'index.number_of_replicas'         => apply_filters( 'ep_default_index_number_of_replicas', 1 ),
        'index.number_of_replicas'         => apply_filters( 'ep_default_index_number_of_replicas', 0 ),

次に「Tokenizer」の設定を追加します。「analysis」と「analyzer」記述の間に追記する感じです。そして「analyzer」の「default」を「sudachi_tokenizer」に変更しておきます。

        'analysis' 
            'tokenizer' => array(
                'sudachi_tokenizer' => array(
                    'type' => 'sudachi_tokenizer',
                    'split_mode' => 'A',
                    'settings_path' => '/etc/elasticsearch/sudachi/sudachi.json',
            'analyzer'   => array(
                'default'          => array(
//                  'tokenizer'   => 'standard',
                    'tokenizer'   => 'sudachi_tokenizer',
                    'resources_path' => '/etc/elasticsearch/sudachi_dict',

デフォルトからの変更はありませんが、念の為「sudachi.json」の記述を以下に示します。

{
    "systemDict" : "system_full.dic",
    "inputTextPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.DefaultInputTextPlugin" },
        { "class" : "com.worksap.nlp.sudachi.ProlongedSoundMarkInputTextPlugin",
          "prolongedSoundMarks": ["ー", "-", "⁓", "〜", "〰"],
          "replacementSymbol": "ー"}
    ],
    "oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.MeCabOovProviderPlugin" },
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "補助記号", "一般", "*", "*", "*", "*" ],
          "leftId" : 5968,
          "rightId" : 5968,
          "cost" : 3857 }
    ],
    "pathRewritePlugin" : [
        { "class" : "com.worksap.nlp.sudachi.JoinNumericPlugin",
          "joinKanjiNumeric" : true },
        { "class" : "com.worksap.nlp.sudachi.JoinKatakanaOovPlugin",
          "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ],
          "minLength" : 3
        }
    ]
}

この「setting path=/etc/elasticsearch/sudachi」の直下に「system_core.dic」が必要との記述もあり、念の為に配置します。「system_full.dic」は「resources_path」の指定通り「/etc/elasticsearch/sudachi_dict」に配置します。

「Sudachi」用に「filter」の設定を追加します。「split」「part_of_speach」「baseform」「ja_stop」「normalizedform」辺りを設定しておけば十分だと思います。

//                  'filter'      => apply_filters( 'ep_default_analyzer_filters', array( 'standard', 'ewp_word_delimiter', 'lowercase', 'stop', 'ewp_snowball' ) ),
                    'filter'      => apply_filters( 'ep_default_analyzer_filters', array( 'split', 'part_of_speech', 'baseform', 'ja_stop', 'normalizedform', 'ewp_word_delimiter', 'lowercase', 'stop', 'ewp_snowball' ) ),
                    /**
                     */
//                  'char_filter' => apply_filters( 'ep_default_analyzer_char_filters', array( 'html_strip' ) ),
                    'char_filter' => apply_filters( 'ep_default_analyzer_char_filters', array( 'html_strip', 'icu_normalizer', 'kuromoji_iteration_mark' ) ),
                    /**

            'filter'     => array(
                'split' => array(
                    'type' => 'sudachi_split',
                    'mode' => 'search',
                ),
                'part_of_speech' => array(
                    'type' => 'sudachi_part_of_speech',
                    'stoptags' => array( '助詞', '助動詞', '補助記号,句点', '補助記号,読点' ),
                ),
                'baseform' => array(
                    'type' => 'sudachi_baseform',
                ),
                'ja_stop' => array(
                    'type' => 'sudachi_ja_stop',
                    'stoptags' => array( '_japanese_', 'は', 'です' ),
                ),
                'normalizedform' => array(
                    'type' => 'sudachi_normalizedform',
                ),

これらの設定がうまくいけば、以下のような結果が得られるインデックスを作成することができます。「split(分割モード)」と「part_of_speach(品詞)」がうまく作用していて、最近のインデックス設定の中ではかなり気に入っている部類に入ります。

$ curl localhost:9200/arigatojp-post-1/_analyze?pretty -H "Content-Type: application/json" -d '{"analyzer":"default","text":"日本人のやすみはみじかいね"}'
{
  "tokens" : [
    {
      "token" : "日本",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "人",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "休み",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "短い",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 5
    }
  ]
}

そしてもう一つ、今回のバージョンで大きく変更されたアルゴリズムの部分です。オリジナルの指定は「4.0」ですが、このアルゴリズムバージョンでは結果が絞られ過ぎるので「default」のアルゴリズムを使用するように修正します。

        public function get_search_algorithm( string $search_text, array $search_fields, array $query_vars ) : \ElasticPress\SearchAlgorithm {
//              $search_algorithm_version_option = \ElasticPress\Utils\get_option( 'ep_search_algorithm_version', '4.0' );
                $search_algorithm_version_option = \ElasticPress\Utils\get_option( 'ep_search_algorithm_version', 'default' );

最終更新日: 2022年11月5日

いつも楽しい2人の休暇日記

コメントを残すコメントをキャンセル

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル