php截断带html字符串文章内容的方法

作者：袖梨 2022-06-24

博主写好一篇文章，博客后台一般会在搜索页面或者列表页面给出文章标题和截断了的的文章部分作为进一步阅读的入口。

Function: mb_substr( $str, $start, $length, $encoding )

$str，需要截断的字符串
$start，截断开始处
$length，长度（注意，这个跟mb_strimwidth不同，1就代表一个中文字符）
$encoding，编码，我设为 utf-8

例：截断文章标题，控制在15个文字

代码如下	复制代码
<?php echo mb_substr('www.111com.net原创', 0, 15,"utf-8"); ?>

这样对于纯文本没问题，但是我的是中间有html标签的于是问题来了。怎样截断一篇文章。注意，这篇文章不仅仅是普通的字符串文本，而是包含了各种格式化标签和样式内容的文本。如果处理不当，这些闭合标签无法正常关闭，从而破坏整个文档流。

如果单纯是纯文本，下面这个函数差不多是够用的。

代码如下

复制代码

<?php
/**
* 字符串截取，支持中文和其他编码
*
* @param string $str 需要转换的字符串
* @param string $start 开始位置
* @param string $length 截取长度
* @param string $charset 编码格式
* @param string $suffix 截断字符串后缀
* @return string
*/
function substr_ext($str, $start=0, $length, $charset="utf-8", $suffix="")
{
if(function_exists("mb_substr")){
return mb_substr($str, $start, $length, $charset).$suffix;
}
elseif(function_exists('iconv_substr')){
return iconv_substr($str,$start,$length,$charset).$suffix;
}
$re['utf-8'] = "/[x01-x7f]|[xc2-xdf][x80-xbf]|[xe0-xef][x80-xbf]{2}|[xf0-xff][x80-xbf]{3}/";
$re['gb2312'] = "/[x01-x7f]|[xb0-xf7][xa0-xfe]/";
$re['gbk'] = "/[x01-x7f]|[x81-xfe][x40-xfe]/";
$re['big5'] = "/[x01-x7f]|[x81-xfe]([x40-x7e]|xa1-xfe])/";
preg_match_all($re[$charset], $str, $match);
$slice = join("",array_slice($match[0], $start, $length));
return $slice.$suffix;
}

但是，如果需要截断是网页中的某部分格式化文本，上面的函数就不够用了。它不具备处理格式化标签的能力。

这时，需要一个新函数，它应该是以上函数的升级加强版，它必须有能力正确的处理标签,下面找到一个

strip_tags() 函数剥去 HTML、XML 以及 PHP 的标签。

例子 1

代码如下	复制代码
<?php echo strip_tags("Hello world!"); ?>

输出：

Hello world!

这样就好做了我们只要在上面基础上如下操作

代码如下

复制代码

<?php
$a = strip_tags("Hello world!");
substr_ext( $a,10) ;

但是发现html不见了这个也不是什么好的解决办法了。
?>

接着google 发现cns写了一个支持html截取字符串的函数

代码如下

复制代码

/**
* 获取字符在字符串中第N次出现的位置
* @param string $text 字符串
* @param string $key 字符
* @param int $int N
* @return int
*/
function strpos_int($text, $key, $int)
{
$keylen = strlen($key);
global $textlen;
if (!$textlen)
$textlen = strlen($text);
static $textpos = 0;
$pos = strpos($text, $key);
$int--;
if ($pos)
{
if ($int == 0)
$textpos+=$pos;
else
$textpos+=$pos + $keylen;
}
else
{
$int = 0;
$textpos = $textlen;
}
if ($int > 0)
{
strpos_int(substr($text, $pos + $keylen), $key, $int);
}
return $textpos;
}

/**
* 截取HTML
* @param string $string HTML 字符串
* @param int $length 截取的长度
* @param string $dot
* @param string $append
* @return string
*/
function cuthtml($string, $length, $dot = ' ...', $append = "")
{
$str = strip_tags($string);//先过滤标签
$new_str = iconv_substr($str, 0, $length, 'utf-8');
$last = iconv_substr($new_str, -1, 1, 'utf-8');
$sc = substr_count($new_str, $last);
$position = strpos_int($string, $last, $sc); //获取截取真实的长度
if (function_exists('tidy_parse_string'))//服务器开启tidy的话直接用函数不全html代码即可
{
$options = array("show-body-only" => true);
return tidy_parse_string(mb_substr($string, 0, $position) . $dot . $append, $options, 'UTF8');
} else //没有开启tidy
{
if (strlen($string) {
return $string;
}

$pre = chr(1);
$end = chr(1);
$string = str_replace(array('&', '"', ''), array($pre . '&' . $end, $pre . '"' . $end, $pre . '' . $end), $string);

$strcut = '';

$n = $tn = $noc = 0;
while ($n {

$t = ord($string[$n]);
if ($t == 9 || $t == 10 || (32 {
$tn = 1;
$n++;
$noc++;
} elseif (194 {
$tn = 2;
$n += 2;
$noc += 2;
} elseif (224 {
$tn = 3;
$n += 3;
$noc += 2;
} elseif (240 {
$tn = 4;
$n += 4;
$noc += 2;
} elseif (248 {
$tn = 5;
$n += 5;
$noc += 2;
} elseif ($t == 252 || $t == 253)
{
$tn = 6;
$n += 6;
$noc += 2;
} else
{
$n++;
}