UTF8 인코딩이 최대 길이 32766보다 깁니다.

Programing

UTF8 인코딩이 최대 길이 32766보다 깁니다.

crosscheck 2020. 12. 12. 10:06

UTF8 인코딩이 최대 길이 32766보다 깁니다.

Elasticsearch 클러스터를 1.1에서 1.2로 업그레이드했는데 다소 큰 문자열을 인덱싱 할 때 오류가 발생합니다.

{
  "error": "IllegalArgumentException[Document contains at least one immense term in field=\"response_body\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[7b 22 58 48 49 5f 48 6f 74 65 6c 41 76 61 69 6c 52 53 22 3a 7b 22 6d 73 67 56 65 72 73 69]...']",
  "status": 500
}

색인의 매핑 :

{
  "template": "partner_requests-*",
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "request": {
      "properties": {
        "asn_id": { "index": "not_analyzed", "type": "string" },
        "search_id": { "index": "not_analyzed", "type": "string" },
        "partner": { "index": "not_analyzed", "type": "string" },
        "start": { "type": "date" },
        "duration": { "type": "float" },
        "request_method": { "index": "not_analyzed", "type": "string" },
        "request_url": { "index": "not_analyzed", "type": "string" },
        "request_body": { "index": "not_analyzed", "type": "string" },
        "response_status": { "type": "integer" },
        "response_body": { "index": "not_analyzed", "type": "string" }
      }
    }
  }
}

문서를 검색했지만 최대 필드 크기와 관련된 것을 찾지 못했습니다. 핵심 유형 섹션 에 따르면 not_analyzed필드에 대해 "분석기를 수정"해야하는 이유를 이해하지 못합니다 .

따라서 단일 용어의 최대 크기에 문제가 있습니다. 필드를 not_analyzed로 설정하면 하나의 단일 용어로 취급됩니다. 기본 Lucene 인덱스에서 단일 용어의 최대 크기는 32766 바이트이며, 이는 하드 코딩 된 것으로 생각됩니다.

두 가지 기본 옵션은 유형을 이진으로 변경하거나 문자열을 계속 사용하지만 색인 유형을 "no"로 설정하는 것입니다.

not_analyzed정확한 필터링을 원하기 때문에 속성에 대해 정말로 원한다면 다음을 사용할 수 있습니다."ignore_above": 256

다음은 PHP에서 사용하는 방법의 예입니다.

'mapping'    => [
    'type'   => 'multi_field',
    'path'   => 'full',
    'fields' => [
        '{name}' => [
            'type'     => 'string',
            'index'    => 'analyzed',
            'analyzer' => 'standard',
        ],
        'raw' => [
            'type'         => 'string',
            'index'        => 'not_analyzed',
            'ignore_above' => 256,
        ],
    ],
],

귀하의 경우에는 John Petrone이 말하고 설정 한대로 수행하고 싶을 수 "index": "no"있지만 저처럼 예외를 검색 한 후이 질문을 찾는 다른 사람에게는 다음과 같은 옵션이 있습니다.

세트 "index": "no"
세트 "index": "analyze"
설정 "index": "not_analyzed"및"ignore_above": 256

해당 속성을 필터링할지 여부와 방법에 따라 다릅니다.

John이 게시 한 것보다 더 나은 옵션이 있습니다. 그 솔루션으로는 더 이상 가치를 검색 할 수 없기 때문입니다.

문제로 돌아 가기 :

문제는 기본적으로 필드 값이 단일 용어 (전체 문자열)로 사용된다는 것입니다. 해당 용어 / 문자열이 32766 바이트보다 길면 Lucene에 저장할 수 없습니다.

이전 버전의 Lucene은 용어가 너무 긴 경우에만 경고를 등록하고 값을 무시합니다. 최신 버전에서는 예외가 발생합니다. 버그 수정 참조 : https://issues.apache.org/jira/browse/LUCENE-5472

해결책:

가장 좋은 옵션은 긴 문자열 값을 사용하여 필드에 (사용자 지정) 분석기를 정의하는 것입니다. 분석기는 긴 문자열을 더 작은 문자열 / 용어로 분리 할 수 있습니다. 그것은 너무 긴 기간의 문제를 해결할 것입니다.

해당 기능을 사용하는 경우 "_all"필드에 분석기도 추가하는 것을 잊지 마십시오.

분석기는 REST API로 테스트 할 수 있습니다. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

대신 index매핑 부분 을 변경해야했습니다 . 이렇게하면 값이 인덱싱되지 않습니다. 반환 된 문서 (검색, 가져 오기 등)에서 계속 사용할 수 있지만 쿼리 할 수 없습니다.nonot_analyzed

lucene 한계를 초과하는 토큰을 처리하는 한 가지 방법은 truncate필터 를 사용하는 것 입니다. ignore_above키워드 와 유사합니다 . 설명하기 위해 5. Elasticsearch는 8191UTF-8 문자가 최대 4 바이트를 차지할 수 있으므로 ignore_above = 32766/4 = 사용을 제안 합니다. https://www.elastic.co/guide/en/elasticsearch/reference/6.3/ignore-above.html

curl -H'Content-Type:application/json' localhost:9200/_analyze -d'{
  "filter" : [{"type": "truncate", "length": 5}],
  "tokenizer": {
    "type":    "pattern"
  },
  "text": "This movie \n= AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}'

산출:

{
  "tokens": [
    {
      "token": "This",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "movie",
      "start_offset": 5,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "AAAAA",
      "start_offset": 14,
      "end_offset": 52,
      "type": "word",
      "position": 2
    }
  ]
}

분석기를 변경하여이 문제를 해결했습니다.

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "standard" : {
                    "tokenizer": "standard",
                    "filter": ["standard", "lowercase", "stop"]
                }
            }
        }
    }
}

을 사용하는 경우 searchkickelasticsearch를로 업그레이드 >= 2.2.0하고 searchkick 1.3.4이상을 사용하고 있는지 확인하십시오 .

This version of searchkick sets ignore_above = 256 by default, thus you won't get this error when UTF > 32766.

This is discussed here.

In Solr v6+ I changed the field type to text_general and it solved my problem.

<field name="body" type="string" indexed="true" stored="true" multiValued="false"/>   
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>

Using logstash to index those long messages, I use this filter to truncate the long string :

    filter {
        ruby {
            code => "event.set('message_size',event.get('message').bytesize) if event.get('message')"
        }
        ruby {
            code => "
                if (event.get('message_size'))
                    event.set('message', event.get('message')[0..9999]) if event.get('message_size') > 32000
                    event.tag 'long message'  if event.get('message_size') > 32000
                end
            "
         }
     }

It adds a message_size field so that I can sort the longest messages by size.

It also adds the long message tag to those that are over 32000kb so I can select them easily.

It doesn't solve the problem if you intend to index those long messages completely, but if, like me, don't want to have them in elasticsearch in the first place and want to track them to fix it, it's a working solution.

I've stumbled upon the same error message with Drupal's Search api attachments module:

Document contains at least one immense term in field="saa_saa_file_entity" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.

Changing the fields type from string to Fulltext (in /admin/config/search/search-api/index/elastic_index/fields) solved the problem for me.

참고URL : https://stackoverflow.com/questions/24019868/utf8-encoding-is-longer-than-the-max-length-32766

'Programing' 카테고리의 다른 글

모든 출력을 / dev / null로 리디렉션하는 방법은 무엇입니까? (0)	2020.12.12
Node.js – 이벤트 js 72가 처리되지 않은 '오류'이벤트를 발생시킵니다. (0)	2020.12.12
JUnit에서 assertEquals (double, double)이 사용되지 않는 이유는 무엇입니까? (0)	2020.12.12
Mapbox GL JS와 Mapbox.js (0)	2020.12.12
RxJS 맵 연산자 (각도)에서 오류를 발생시키는 방법 (0)	2020.12.12

현재글UTF8 인코딩이 최대 길이 32766보다 깁니다.

crosscheck

UTF8 인코딩이 최대 길이 32766보다 깁니다.

UTF8 인코딩이 최대 길이 32766보다 깁니다.

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

UTF8 인코딩이 최대 길이 32766보다 깁니다.

UTF8 인코딩이 최대 길이 32766보다 깁니다.

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바