UTF8에서 STL 로의 와이드 문자 변환

Programing

UTF8에서 STL 로의 와이드 문자 변환

crosscheck 2020. 11. 8. 09:20

UTF8에서 STL 로의 와이드 문자 변환

플랫폼 독립적 인 방식으로 std :: string의 UTF8 문자열을 std :: wstring으로 또는 그 반대로 변환 할 수 있습니까? Windows 응용 프로그램에서는 MultiByteToWideChar 및 WideCharToMultiByte를 사용합니다. 그러나 코드는 여러 OS 용으로 컴파일되며 표준 C ++ 라이브러리로 제한됩니다.

5 년 전에이 질문을했습니다. 이 스레드는 당시 저에게 매우 도움이되었고 결론에 도달 한 다음 프로젝트를 진행했습니다. 최근에 그 프로젝트와는 전혀 관련이없는 비슷한 것을 최근에 필요로해서 재밌습니다. 가능한 해결책을 연구하다가 내 질문을 우연히 발견했습니다. :)

지금 선택한 솔루션은 C ++ 11을 기반으로합니다. Constantin이 그의 답변 에서 언급 한 부스트 라이브러리 는 이제 표준의 일부입니다. std :: wstring을 새로운 문자열 유형 std :: u16string으로 바꾸면 변환은 다음과 같습니다.

UTF-8에서 UTF-16으로

std::string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::u16string dest = convert.from_bytes(source);

UTF-16에서 UTF-8로

std::u16string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string dest = convert.to_bytes(source);

다른 답변에서 볼 수 있듯이 문제에 대한 여러 가지 접근 방식이 있습니다. 그래서 내가 받아 들여지는 대답을 선택하지 않는 것입니다.

UTF8-CPP : 이식 가능한 방식의 C ++를 사용한 UTF-8

Boost 직렬화 라이브러리utf8_codecvt_facet 에서 추출 할 수 있습니다 .

사용 예 :

  typedef wchar_t ucs4_t;

  std::locale old_locale;
  std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

  // Set a New global locale
  std::locale::global(utf8_locale);

  // Send the UCS-4 data out, converting to UTF-8
  {
    std::wofstream ofs("data.ucd");
    ofs.imbue(utf8_locale);
    std::copy(ucs4_data.begin(),ucs4_data.end(),
          std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
  }

  // Read the UTF-8 data back in, converting to UCS-4 on the way in
  std::vector<ucs4_t> from_file;
  {
    std::wifstream ifs("data.ucd");
    ifs.imbue(utf8_locale);
    ucs4_t item = 0;
    while (ifs >> item) from_file.push_back(item);
  }

부스트 소스에서 utf8_codecvt_facet.hpp및 utf8_codecvt_facet.cpp파일을 찾습니다 .

문제 정의는 8 비트 문자 인코딩이 UTF-8임을 명시 적으로 나타냅니다. 이것은 이것을 사소한 문제로 만듭니다. 필요한 것은 하나의 UTF 사양에서 다른 사양으로 변환하는 데 약간의 조작입니다.

UTF-8 , UTF-16 및 UTF-32 에 대한 이러한 Wikipedia 페이지의 인코딩을 살펴보십시오 .

원칙은 간단합니다. 입력을 통해 하나의 UTF 사양에 따라 32 비트 유니 코드 코드 포인트를 조립 한 다음 다른 사양에 따라 코드 포인트를 내 보냅니다. 개별 코드 포인트는 다른 문자 인코딩과 마찬가지로 변환이 필요하지 않습니다. 그것이 이것을 간단한 문제로 만드는 것입니다.

다음은 wchar_tUTF-8 변환 의 빠른 구현 이며 그 반대의 경우도 마찬가지입니다. 입력이 이미 적절하게 인코딩되었다고 가정합니다. "Garbage in, garbage out"이 여기에 적용됩니다. 인코딩 확인은 별도의 단계로 수행하는 것이 가장 좋다고 생각합니다.

std::string wchar_to_UTF8(const wchar_t * in)
{
    std::string out;
    unsigned int codepoint = 0;
    for (in;  *in != 0;  ++in)
    {
        if (*in >= 0xd800 && *in <= 0xdbff)
            codepoint = ((*in - 0xd800) << 10) + 0x10000;
        else
        {
            if (*in >= 0xdc00 && *in <= 0xdfff)
                codepoint |= *in - 0xdc00;
            else
                codepoint = *in;

            if (codepoint <= 0x7f)
                out.append(1, static_cast<char>(codepoint));
            else if (codepoint <= 0x7ff)
            {
                out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            else if (codepoint <= 0xffff)
            {
                out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            else
            {
                out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            codepoint = 0;
        }
    }
    return out;
}

상기 코드는 UTF-16, UTF-32의 입력에 모두 작동 범위 때문에 단순히 d800내지가 dfff잘못된 코드 포인트이고; UTF-16을 디코딩하고 있음을 나타냅니다. 그것이 wchar_t32 비트 라는 것을 알고 있다면 함수를 최적화하기 위해 일부 코드를 제거 할 수 있습니다.

std::wstring UTF8_to_wchar(const char * in)
{
    std::wstring out;
    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (sizeof(wchar_t) > 2)
                out.append(1, static_cast<wchar_t>(codepoint));
            else if (codepoint > 0xffff)
            {
                out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
                out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
            }
            else if (codepoint < 0xd800 || codepoint >= 0xe000)
                out.append(1, static_cast<wchar_t>(codepoint));
        }
    }
    return out;
}

wchar_t32 비트 라는 것을 알고 있다면 이 함수에서 일부 코드를 제거 할 수 있지만이 경우에는 아무런 차이가 없습니다. 표현식 sizeof(wchar_t) > 2은 컴파일 시간에 알려 지므로 괜찮은 컴파일러는 죽은 코드를 인식하고 제거합니다.

There are several ways to do this, but the results depend on what the character encodings are in the string and wstring variables.

If you know the string is ASCII, you can simply use wstring's iterator constructor:

string s = "This is surely ASCII.";
wstring w(s.begin(), s.end());

If your string has some other encoding, however, you'll get very bad results. If the encoding is Unicode, you could take a look at the ICU project, which provides a cross-platform set of libraries that convert to and from all sorts of Unicode encodings.

If your string contains characters in a code page, then may $DEITY have mercy on your soul.

ConvertUTF.h ConvertUTF.c

Credit to bames53 for providing updated versions

You can use the codecvt locale facet. There's a specific specialisation defined, codecvt<wchar_t, char, mbstate_t> that may be of use to you, although, the behaviour of that is system-specific, and does not guarantee conversion to UTF-8 in any way.

UTFConverter - check out this library. It does such a convertion, but you need also ConvertUTF class - I've found it here

Created my own library for utf-8 to utf-16/utf-32 conversion - but decided to make a fork of existing project for that purpose.

https://github.com/tapika/cutf

(Originated from https://github.com/noct/cutf )

API works with plain C as well as with C++.

Function prototypes looks like this: (For full list see https://github.com/tapika/cutf/blob/master/cutf.h )

//
//  Converts utf-8 string to wide version.
//
//  returns target string length.
//
size_t utf8towchar(const char* s, size_t inSize, wchar_t* out, size_t bufSize);

//
//  Converts wide string to utf-8 string.
//
//  returns filled buffer length (not string length)
//
size_t wchartoutf8(const wchar_t* s, size_t inSize, char* out, size_t outsize);

#ifdef __cplusplus

std::wstring utf8towide(const char* s);
std::wstring utf8towide(const std::string& s);
std::string  widetoutf8(const wchar_t* ws);
std::string  widetoutf8(const std::wstring& ws);

#endif

Sample usage / simple test application for utf conversion testing:

#include "cutf.h"

#define ok(statement)                                       \
    if( !(statement) )                                      \
    {                                                       \
        printf("Failed statement: %s\n", #statement);       \
        r = 1;                                              \
    }

int simpleStringTest()
{
    const wchar_t* chineseText = L"主体";
    auto s = widetoutf8(chineseText);
    size_t r = 0;

    printf("simple string test:  ");

    ok( s.length() == 6 );
    uint8_t utf8_array[] = { 0xE4, 0xB8, 0xBB, 0xE4, 0xBD, 0x93 };

    for(int i = 0; i < 6; i++)
        ok(((uint8_t)s[i]) == utf8_array[i]);

    auto ws = utf8towide(s);
    ok(ws.length() == 2);
    ok(ws == chineseText);

    if( r == 0 )
        printf("ok.\n");

    return (int)r;
}

And if this library does not satisfy your needs - feel free to open following link:

http://utf8everywhere.org/

and scroll down at the end of page and pick up any heavier library which you like.

I don't think there's a portable way of doing this. C++ doesn't know the encoding of its multibyte characters.

As Chris suggested, your best bet is to play with codecvt.

참고URL : https://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl

'Programing' 카테고리의 다른 글

CSS에 값이있는 속성으로 요소를 타겟팅하려면 어떻게해야합니까? (0)	2020.11.08
jQuery에서 브라우저 스크롤 위치를 어떻게 얻습니까? (0)	2020.11.08
같은 페이지에서 다른 버전의 jQuery를 어떻게 실행합니까? (0)	2020.11.08
Clojure의 Let 대 바인딩 (0)	2020.11.08
인터프리터 / 컴파일러는 어떻게 작동합니까? (0)	2020.11.08

현재글UTF8에서 STL 로의 와이드 문자 변환

crosscheck

UTF8에서 STL 로의 와이드 문자 변환

UTF8에서 STL 로의 와이드 문자 변환

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

UTF8에서 STL 로의 와이드 문자 변환

UTF8에서 STL 로의 와이드 문자 변환

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바