텍스트 JavaScript에서 HTML 제거

Programing

텍스트 JavaScript에서 HTML 제거

crosscheck 2020. 10. 3. 09:58

텍스트 JavaScript에서 HTML 제거

JavaScript에서 html 문자열을 가져 와서 html을 제거하는 쉬운 방법이 있습니까?

브라우저에서 실행중인 경우 가장 쉬운 방법은 브라우저가 대신 수행하도록하는 것입니다.

function stripHtml(html)
{
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

참고 : 사람들이 주석에서 언급했듯이 HTML의 소스를 제어하지 않는 경우 (예 : 사용자 입력에서 올 수있는 모든 항목에 대해 실행하지 마십시오) 피하는 것이 가장 좋습니다. 이러한 시나리오의 경우 브라우저가 작업을 수행하도록 할 수 있습니다 . 현재 널리 사용되는 DOMParser 사용에 대한 Saba의 답변을 참조하십시오 .

myString.replace(/<[^>]*>?/gm, '');

가장 간단한 방법 :

jQuery(html).text();

그것은 html 문자열에서 모든 텍스트를 검색합니다.

Shog9 의 승인 된 답변의 편집 된 버전을 공유하고 싶습니다 .

로 마이크 사무엘이 코멘트로 지적, 그 함수는 인라인 자바 스크립트 코드를 실행할 수 있습니다.
그러나 Shog9 는 "브라우저가 당신을 위해 그것을하게하라"고 말할 때 옳습니다.

그래서 .. 여기 DOMParser를 사용하여 편집 된 버전 :

function strip(html){
   var doc = new DOMParser().parseFromString(html, 'text/html');
   return doc.body.textContent || "";
}

인라인 자바 스크립트를 테스트하는 코드는 다음과 같습니다.

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

또한 구문 분석시 리소스를 요청하지 않습니다 (예 : 이미지).

strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")

jQuery 메소드에 대한 확장으로, 문자열이 HTML로 구성되지 않을 수있는 경우 (예 : 양식 필드에서 HTML을 제거하려는 경우)

jQuery(html).text();

html이 없으면 빈 문자열을 반환합니다.

사용하다:

jQuery('<p>' + html + '</p>').text();

대신.

업데이트 : 주석에서 지적했듯이 일부 상황에서이 솔루션은 html값이 html공격자에 의해 영향 을 받을 수있는 경우 포함 된 javascript를 실행 하므로 다른 솔루션을 사용합니다.

하이퍼 링크 (a href)를 그대로 유지하면서 일반 텍스트 이메일로 HTML 변환

hypoxide에서 게시 한 위의 기능은 잘 작동하지만 기본적으로 Web RichText 편집기 (예 : FCKEditor)에서 만든 HTML을 변환하고 모든 HTML을 지우지 만 HTML과 STMP 이메일 (HTML 및 일반 텍스트 모두)에 올바른 부분을 만드는 데 도움이되는 일반 텍스트 버전

오랜 시간 동안 Google을 검색하고 동료들은 Javascript의 정규식 엔진을 사용하여 이것을 생각해 냈습니다.

str='this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>
';
str=str.replace(/<br>/gi, "\n");
str=str.replace(/<p.*>/gi, "\n");
str=str.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<(?:.|\s)*?>/g, "");

str변수는이처럼 시작한다 :

this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>

코드가 실행 된 후 다음과 같이 보입니다.

this string has html code i want to remove
Link Number 1 -> BBC (Link->http://www.bbc.co.uk)  Link Number 1


Now back to normal text and stuff

보시다시피 모든 HTML이 제거되었으며 하이퍼 링크 텍스트로 인내 된 링크는 여전히 손상되지 않았습니다. 또한 <p>및 <br>태그를 \n(개행 문자) 로 대체하여 일종의 시각적 서식이 유지되었습니다.

링크 형식 (예 :)을 변경하려면을 BBC (Link->http://www.bbc.co.uk)편집하십시오 $2 (Link->$1). 여기서는 $1href URL / URI이고은 $2하이퍼 링크 텍스트입니다. 대부분의 SMTP 메일 클라이언트는 일반 텍스트 본문에 직접 링크를 사용하여 사용자가 클릭 할 수 있도록이를 변환합니다.

이 정보가 유용하기를 바랍니다.

수용된 답변에 대한 개선.

function strip(html)
{
   var tmp = document.implementation.createHTMLDocument("New").body;
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

이렇게하면 다음과 같이 실행해도 해를 끼치 지 않습니다.

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Firefox, Chromium 및 Explorer 9 이상은 안전합니다. Opera Presto는 여전히 취약합니다. 또한 문자열에 언급 된 이미지는 http 요청을 저장하는 Chromium 및 Firefox에서 다운로드되지 않습니다.

이것은 모든 Javascript 환경 (NodeJS 포함)에서 작업을 수행해야합니다. text.replace(/<[^>]+>/g, '');

Jibberboy2000의 답변 을 변경 하여 여러 <BR />태그 형식 을 포함하고 , 내부 <SCRIPT>와 <STYLE>태그를 모두 제거 하고 , 여러 줄 바꿈과 공백을 제거하여 결과 HTML을 형식화하고, 일부 HTML 인코딩 코드를 정상으로 변환했습니다. 몇 가지 테스트 후 대부분의 전체 웹 페이지를 페이지 제목과 내용이 유지되는 간단한 텍스트로 변환 할 수있는 것으로 보입니다.

간단한 예에서

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<!--comment-->

<head>

<title>This is my title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>

    body {margin-top: 15px;}
    a { color: #D80C1F; font-weight:bold; text-decoration:none; }

</style>
</head>

<body>
    <center>
        This string has <i>html</i> code i want to <b>remove</b><br>
        In this line <a href="http://www.bbc.co.uk">BBC</a> with link is mentioned.<br/>Now back to &quot;normal text&quot; and stuff using &lt;html encoding&gt;                 
    </center>
</body>
</html>

된다

이것은 내 제목입니다

이 문자열에는 제거하려는 HTML 코드가 있습니다.

이 줄 에는 링크가있는 BBC ( http://www.bbc.co.uk )가 언급되어 있습니다.

이제 "일반 텍스트"로 돌아가서

JavaScript 함수 및 테스트 페이지는 다음과 같습니다.

function convertHtmlToText() {
    var inputText = document.getElementById("input").value;
    var returnText = "" + inputText;

    //-- remove BR tags and replace them with line break
    returnText=returnText.replace(/<br>/gi, "\n");
    returnText=returnText.replace(/<br\s\/>/gi, "\n");
    returnText=returnText.replace(/<br\/>/gi, "\n");

    //-- remove P and A tags but preserve what's inside of them
    returnText=returnText.replace(/<p.*>/gi, "\n");
    returnText=returnText.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 ($1)");

    //-- remove all inside SCRIPT and STYLE tags
    returnText=returnText.replace(/<script.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
    returnText=returnText.replace(/<style.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
    //-- remove all else
    returnText=returnText.replace(/<(?:.|\s)*?>/g, "");

    //-- get rid of more than 2 multiple line breaks:
    returnText=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");

    //-- get rid of more than 2 spaces:
    returnText = returnText.replace(/ +(?= )/g,'');

    //-- get rid of html-encoded characters:
    returnText=returnText.replace(/&nbsp;/gi," ");
    returnText=returnText.replace(/&amp;/gi,"&");
    returnText=returnText.replace(/&quot;/gi,'"');
    returnText=returnText.replace(/&lt;/gi,'<');
    returnText=returnText.replace(/&gt;/gi,'>');

    //-- return
    document.getElementById("output").value = returnText;
}

다음 HTML과 함께 사용되었습니다.

<textarea id="input" style="width: 400px; height: 300px;"></textarea><br />
<button onclick="convertHtmlToText()">CONVERT</button><br />
<textarea id="output" style="width: 400px; height: 300px;"></textarea><br />

var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

이것은 다음과 같이 잘못된 HTML에 대해 더 탄력적 인 정규식 버전입니다.

닫히지 않은 태그

Some text <img

태그 속성 내부의 "<", ">"

Some text <img alt="x > y">

개행

Some <a href="http://google.com">

코드

var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

nickf 또는 Shog9보다 덜 우아한 또 다른 해결책은 <body> 태그에서 시작하는 DOM을 재귀 적으로 걷고 각 텍스트 노드를 추가하는 것입니다.

var bodyContent = document.getElementsByTagName('body')[0];
var result = appendTextNodes(bodyContent);

function appendTextNodes(element) {
    var text = '';

    // Loop through the childNodes of the passed in element
    for (var i = 0, len = element.childNodes.length; i < len; i++) {
        // Get a reference to the current child
        var node = element.childNodes[i];
        // Append the node's value if it's a text node
        if (node.nodeType == 3) {
            text += node.nodeValue;
        }
        // Recurse through the node's children, if there are any
        if (node.childNodes.length > 0) {
            appendTextNodes(node);
        }
    }
    // Return the final result
    return text;
}

콘텐츠의 링크와 구조 (h1, h2 등)를 유지하려면 TextVersionJS를 확인해야 합니다. HTML 이메일을 일반 텍스트로 변환하기 위해 생성되었지만 모든 HTML과 함께 사용할 수 있습니다.

사용법은 매우 간단합니다. 예를 들어 node.js에서 :

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

또는 순수 js가있는 브라우저에서 :

<script src="textversion.js"></script>
<script>
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
</script>

require.js에서도 작동합니다.

define(["textversionjs"], function(createTextVersion) {
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
});

대부분의 답변을 시도한 후 모든 답변이 엣지 케이스가 아니더라도 내 요구를 완전히 지원할 수 없었습니다.

나는 PHP가 어떻게 작동하는지 탐구하기 시작했고 여기에서 strip_tags 메소드를 복제하는 php.js lib를 발견했습니다 : http://phpjs.org/functions/strip_tags/

function stripHTML(my_string){
    var charArr   = my_string.split(''),
        resultArr = [],
        htmlZone  = 0,
        quoteZone = 0;
    for( x=0; x < charArr.length; x++ ){
     switch( charArr[x] + htmlZone + quoteZone ){
       case "<00" : htmlZone  = 1;break;
       case ">10" : htmlZone  = 0;resultArr.push(' ');break;
       case '"10' : quoteZone = 1;break;
       case "'10" : quoteZone = 2;break;
       case '"11' : 
       case "'12" : quoteZone = 0;break;
       default    : if(!htmlZone){ resultArr.push(charArr[x]); }
     }
    }
    return resultArr.join('');
}

> 내부 속성 및 <img onerror="javascript">새로 생성 된 dom 요소를 설명합니다.

용법:

clean_string = stripHTML("string with <html> in it")

데모:

https://jsfiddle.net/gaby_de_wilde/pqayphzd/

끔찍한 일을하는 최고의 답변 데모 :

https://jsfiddle.net/gaby_de_wilde/6f0jymL6/1/

많은 사람들이 이미 이에 대해 대답했지만, 문자열에서 HTML 태그를 제거하지만 제거하지 않으려는 태그 배열을 포함 할 수 있도록 작성한 함수를 공유하는 것이 유용 할 것이라고 생각했습니다. 꽤 짧고 잘 작동했습니다.

function removeTags(string, array){
  return array ? string.split("<").filter(function(val){ return f(array, val); }).map(function(val){ return f(array, val); }).join("") : string.split("<").map(function(d){ return d.split(">").pop(); }).join("");
  function f(array, value){
    return array.map(function(d){ return value.includes(d + ">"); }).indexOf(true) != -1 ? "<" + value : value.split(">")[1];
  }
}

var x = "<span><i>Hello</i> <b>world</b>!</span>";
console.log(removeTags(x)); // Hello world!
console.log(removeTags(x, ["span", "i"])); // <span><i>Hello</i> world!</span>

가장 쉬운 방법은 위에서 언급 한 것처럼 정규식을 사용하는 것입니다. 그것들을 많이 사용할 이유는 없지만. 시험:

stringWithHTML = stringWithHTML.replace(/<\/?[a-z][a-z0-9]*[^<>]*>/ig, "");

원본 Jibberboy2000 스크립트를 수정했습니다. 누군가에게 유용하길 바랍니다.

str = '**ANY HTML CONTENT HERE**';

str=str.replace(/<\s*br\/*>/gi, "\n");
str=str.replace(/<\s*a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<\s*\/*.+?>/ig, "\n");
str=str.replace(/ {2,}/gi, " ");
str=str.replace(/\n+\s*/gi, "\n\n");

@MikeSamuel의 보안 문제를 해결하는 버전은 다음과 같습니다.

function strip(html)
{
   try {
       var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
       doc.documentElement.innerHTML = html;
       return doc.documentElement.textContent||doc.documentElement.innerText;
   } catch(e) {
       return "";
   }
}

HTML 마크 업이 유효한 XML이 아닌 경우 빈 문자열을 반환합니다 (즉, 태그를 닫고 속성을 따옴표로 묶어야 함). 이것은 이상적이지는 않지만 잠재적 인 보안 악용 문제를 방지합니다.

유효한 XML 마크 업이 필요하지 않은 경우 다음을 사용해 볼 수 있습니다.

var doc = document.implementation.createHTMLDocument("");

그러나 그것은 다른 이유로도 완벽한 해결책이 아닙니다.

iframe sandbox 속성을 사용하여 html 태그를 안전하게 제거 할 수 있습니다 .

여기서 아이디어는 문자열을 정규식으로 변환하는 대신 DOM 요소에 텍스트를 삽입 한 다음 해당 요소 의 textContent/ innerText속성 을 쿼리하여 브라우저의 기본 파서를 활용 한다는 것입니다.

텍스트를 삽입하는 데 가장 적합한 요소는 샌드 박스 처리 된 iframe이므로 임의 코드 실행 ( XSS 라고도 함)을 방지 할 수 있습니다 .

이 접근 방식의 단점은 브라우저에서만 작동한다는 것입니다.

다음은 내가 생각 해낸 것입니다 (전투 테스트되지 않음).

const stripHtmlTags = (() => {
  const sandbox = document.createElement("iframe");
  sandbox.sandbox = "allow-same-origin"; // <--- This is the key
  sandbox.style.setProperty("display", "none", "important");

  // Inject the sanbox in the current document
  document.body.appendChild(sandbox);

  // Get the sandbox's context
  const sanboxContext = sandbox.contentWindow.document;

  return (untrustedString) => {
    if (typeof untrustedString !== "string") return ""; 

    // Write the untrusted string in the iframe's body
    sanboxContext.open();
    sanboxContext.write(untrustedString);
    sanboxContext.close();

    // Get the string without html
    return sanboxContext.body.textContent || sanboxContext.body.innerText || "";
  };
})();

사용법 ( 데모 ) :

console.log(stripHtmlTags(`<img onerror='alert("could run arbitrary JS here")' src='bogus'>XSS injection :)`));
console.log(stripHtmlTags(`<script>alert("awdawd");</` + `script>Script tag injection :)`));
console.log(stripHtmlTags(`<strong>I am bold text</strong>`));
console.log(stripHtmlTags(`<html>I'm a HTML tag</html>`));
console.log(stripHtmlTags(`<body>I'm a body tag</body>`));
console.log(stripHtmlTags(`<head>I'm a head tag</head>`));
console.log(stripHtmlTags(null));

jQuery를 사용하면 다음을 사용하여 간단히 검색 할 수 있습니다.

$('#elementID').text()

아래 코드를 사용하면 일부 html 태그를 유지하면서 나머지는 모두 제거 할 수 있습니다.

function strip_tags(input, allowed) {

  allowed = (((allowed || '') + '')
    .toLowerCase()
    .match(/<[a-z][a-z0-9]*>/g) || [])
    .join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)

  var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
      commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;

  return input.replace(commentsAndPhpTags, '')
      .replace(tags, function($0, $1) {
          return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
      });
}

환상적인 htmlparser2 순수 JS HTML 파서 를 사용하는 것도 가능합니다 . 다음은 작동하는 데모입니다.

var htmlparser = require('htmlparser2');

var body = '<p><div>This is </div>a <span>simple </span> <img src="test"></img>example.</p>';

var result = [];

var parser = new htmlparser.Parser({
    ontext: function(text){
        result.push(text);
    }
}, {decodeEntities: true});

parser.write(body);
parser.end();

result.join('');

출력은 This is a simple example.

See it in action here: https://tonicdev.com/jfahrenkrug/extract-text-from-html

This works in both node and the browser if you pack you web application using a tool like webpack.

I just needed to strip out the <a> tags and replace them with the text of the link.

This seems to work great.

htmlContent= htmlContent.replace(/<a.*href="(.*?)">/g, '');
htmlContent= htmlContent.replace(/<\/a>/g, '');

I have created a working regular expression myself:

str=str.replace(/(<\?[a-z]*(\s[^>]*)?\?(>|$)|<!\[[a-z]*\[|\]\]>|<!DOCTYPE[^>]*?(>|$)|<!--[\s\S]*?(-->|$)|<[a-z?!\/]([a-z0-9_:.])*(\s[^>]*)?(>|$))/gi, '');

simple 2 line jquery to strip the html.

 var content = "<p>checking the html source&nbsp;</p><p>&nbsp;
  </p><p>with&nbsp;</p><p>all</p><p>the html&nbsp;</p><p>content</p>";

 var text = $(content).text();//It gets you the plain text
 console.log(text);//check the data in your console

 cj("#text_area_id").val(text);//set your content to text area using text_area_id

The accepted answer works fine mostly, however in IE if the html string is null you get the "null" (instead of ''). Fixed:

function strip(html)
{
   if (html == null) return "";
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

Using Jquery:

function stripTags() {
    return $('<p></p>').html(textToEscape).text()
}

input element support only one line text:

The text state represents a one line plain text edit control for the element's value.

function stripHtml(str) {
  var tmp = document.createElement('input');
  tmp.value = str;
  return tmp.value;
}

Update: this works as expected

function stripHtml(str) {
  // Remove some tags
  str = str.replace(/<[^>]+>/gim, '');

  // Remove BB code
  str = str.replace(/\[(\w+)[^\]]*](.*?)\[\/\1]/g, '$2 ');

  // Remove html and line breaks
  const div = document.createElement('div');
  div.innerHTML = str;

  const input = document.createElement('input');
  input.value = div.textContent || div.innerText || '';

  return input.value;
}

    (function($){
        $.html2text = function(html) {
            if($('#scratch_pad').length === 0) {
                $('<div id="lh_scratch"></div>').appendTo('body');  
            }
            return $('#scratch_pad').html(html).text();
        };

    })(jQuery);

Define this as a jquery plugin and use it like as follows:

$.html2text(htmlContent);

For escape characters also this will work using pattern matching:

myString.replace(/((&lt)|(<)(?:.|\n)*?(&gt)|(>))/gm, '');

참고URL : https://stackoverflow.com/questions/822452/strip-html-from-text-javascript

'Programing' 카테고리의 다른 글

--save와 --save-dev의 차이점은 무엇입니까? (0)	2020.10.03
중첩 된 JavaScript 개체 키의 존재 여부 테스트 (0)	2020.10.03
뮤텍스 란 무엇입니까? (0)	2020.10.03
Chrome의 코드에서 JavaScript 중단 점을 설정하는 방법은 무엇입니까? (0)	2020.10.02
JavaScript 개체의 속성을 어떻게 열거합니까? (0)	2020.10.02

현재글텍스트 JavaScript에서 HTML 제거

crosscheck

텍스트 JavaScript에서 HTML 제거

텍스트 JavaScript에서 HTML 제거

하이퍼 링크 (a href)를 그대로 유지하면서 일반 텍스트 이메일로 HTML 변환

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

텍스트 JavaScript에서 HTML 제거

텍스트 JavaScript에서 HTML 제거

하이퍼 링크 (a href)를 그대로 유지하면서 일반 텍스트 이메일로 HTML 변환

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바