php通过正则提取页面指定内容实例

作者:袖梨 2022-06-25

例子代码如下,可常用于采集哦、

 代码如下 复制代码


1、获取页面标题

//提取标题
            preg_match('/(?<title>.*?)<\/title>/i', $html, $titleArr);<br />             $title = $titleArr['title'];<br /> 2、获取body主体内容,并将背景图片提取出来替换成其他图片地址</p> <p>/**<br />  * 获取BODY主体区域内容<br />  * @param $html<br />  * @param $urlRoot<br />  * @return mixed<br />  */<br /> function getBody($html,$urlRoot = null){<br />     //提取BODY主体<br />     preg_match('/<!--body-->(.*?)<!--body-->/is ', $html, $bodyArr);<br />     if(!$bodyArr){<br />         preg_match('/<body.*?>(.*?)<\/body>/is ', $html, $bodyArr);<br />     }<br />     $body = $bodyArr[1];<br />     //替换img文件<br />     $body =  preg_replace('/(<[img|IMG].*src=[\'|"])(\.\.\/)*(img.[^\'||^"]+)/',"$1$urlRoot$3",$body);<br />     //替换html文件内的css背景图片<br />     $body =  preg_replace('~\b(background(-image)?\s*:(.*?)\(\s*[\'|"]?)(\.\.\/)*(img.*?)?\s*\)~i',"$1$urlRoot$5)",$body);<br />     return $body;<br /> }<br /> 3、提取页面Description内容</p> <p>function getDescription($html){<br />     // Get the 'content' attribute value in a <meta name="description" ... /><br />     $matches = array();<br />  <br />     // Search for <meta name="description" content="Buy my stuff" /><br />     preg_match('/<meta.*?name=("|\')description("|\').*?content=("|\')(.*?)("|\')/i', $html, $matches);<br />     if (count($matches) > 4) {<br />         return trim($matches[4]);<br />     }<br />  <br />     // Order of attributes could be swapped around: <meta content="Buy my stuff" name="description" /><br />     preg_match('/<meta.*?content=("|\')(.*?)("|\').*?name=("|\')description("|\')/i', $html, $matches);<br />     if (count($matches) > 2) {<br />         return trim($matches[2]);<br />     }<br />  <br />     // No match<br />     return null;<br /> }<br /> 4、替换css文件的背景图片地址</p> <p>/**<br />  * 获取CSS内容<br />  * @param $cssCnt<br />  * @param $urlRoot<br />  * @return mixed<br />  */<br /> function getCss($cssCnt,$urlRoot =null){<br />     //匹配包含 img文件夹的相对路径图片 (含义绝对路径的不包含在其中)<br />     //匹配替换不一定准确,因为只是将 含义 ../ 的地址转为url 而没有考虑 ../../ 之类的层级关系<br />     $css =  preg_replace('~\b(background(-image)?\s*:(.*?)\(\s*[\'|"]?)(\.\.\/)*(img.*?)?\s*\)~i',"$1$urlRoot$5)",$cssCnt);<br />     //添加css前缀<br />     $css =  preg_replace('/\b.(.*?)[,|{]/',"pat .$0",$cssCnt);<br />     //TODO 压缩css<br />     return $css;<br /> }</p> <p><br />  </p> </td> </tr> </table> <p>从上面例子来看其实都是非常的简单就是批有规律的标签为开始与结束节点,这样我们可以获取这两个字符之类的内容也就是我们要提取的内容了哦,只是在中间有字符或空格的一些处理了哦。</p></td> </tr> </table> </div> </div> </section> <section class="wrap-box"> <div class="g-tit"> <h2>相关文章</h2> </div> <ul class="s-list nobord notop"> <li> <a href="/art-424885.htm" class="s-card"> <div class="s-card-l"> <p class="tit">《弓箭传说2》新手玩法介绍</p> <div class="info"> <span class="person">游戏攻略</span> <span class="time">2025-01-16</span> </div> </div> <div class="s-card-pic"> <img src="/images/lazy.gif" data-src="/uploads/20250116/logo_67885ccc1016b1.jpg" alt="《弓箭传说2》新手玩法介绍" /> </div> </a> </li> <li> <a href="/art-424884.htm" class="s-card"> <div class="s-card-l"> <p class="tit">《地下城与勇士:起源》断桥烟雨多买多送活动内容一览</p> <div class="info"> <span class="person">游戏攻略</span> <span class="time">2025-01-16</span> </div> </div> <div class="s-card-pic"> <img src="/images/lazy.gif" data-src="/uploads/20250116/logo_67885ccab2c0d1.jpg" alt="《地下城与勇士:起源》断桥烟雨多买多送活动内容一览" /> </div> </a> </li> <li> <a href="/art-424883.htm" class="s-card"> <div class="s-card-l"> <p class="tit">《差不多高手》醉拳龙技能特点分享</p> <div class="info"> <span class="person">游戏攻略</span> <span class="time">2025-01-16</span> </div> </div> <div class="s-card-pic"> <img src="/images/lazy.gif" data-src="/uploads/20250116/logo_67885cc95d7771.png" alt="《差不多高手》醉拳龙技能特点分享" /> </div> </a> </li> <li> <a href="/art-424882.htm" class="s-card"> <div class="s-card-l"> <p class="tit">《鬼谷八荒》毕方尾羽解除限制道具推荐</p> <div class="info"> <span class="person">游戏攻略</span> <span class="time">2025-01-16</span> </div> </div> <div class="s-card-pic"> <img src="/images/lazy.gif" data-src="/uploads/20250116/logo_67885cc72bf131.jpg" alt="《鬼谷八荒》毕方尾羽解除限制道具推荐" /> </div> </a> </li> <li> <a href="/art-424881.htm" class="s-card"> <div class="s-card-l"> <p class="tit">《地下城与勇士:起源》阿拉德首次迎新春活动内容一览</p> <div class="info"> <span class="person">游戏攻略</span> <span class="time">2025-01-16</span> </div> </div> <div class="s-card-pic"> <img src="/images/lazy.gif" data-src="/uploads/20250116/logo_67885cc65f3d31.jpg" alt="《地下城与勇士:起源》阿拉德首次迎新春活动内容一览" /> </div> </a> </li> <li> <a href="/art-424880.htm" class="s-card"> <div class="s-card-l"> <p class="tit">《差不多高手》情圣技能特点分享</p> <div class="info"> <span class="person">游戏攻略</span> <span class="time">2025-01-16</span> </div> </div> <div class="s-card-pic"> <img src="/images/lazy.gif" data-src="/uploads/20250116/logo_67885cc510df11.png" alt="《差不多高手》情圣技能特点分享" /> </div> </a> </li> </ul> </section> <section class="wrap-box"> <div class="g-tit"> <h2>精彩推荐</h2> </div> <ul class="card-box"> <li class="card3"> <a href="/app/103771.htm" target="_self" class="figure"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885f433fee31.png" alt="敢达决战官方正版 安卓版v6.7.9" /> </div> <p class="figure-head">敢达决战官方正版 安卓版v6.7.9</p> <span class="figure-btn">下载</span> </a> </li> <li class="card3"> <a href="/app/103744.htm" target="_self" class="figure"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885eec6139b1.png" alt="敢达决战 安卓版v6.7.9" /> </div> <p class="figure-head">敢达决战 安卓版v6.7.9</p> <span class="figure-btn">下载</span> </a> </li> <li class="card3"> <a href="/app/103743.htm" target="_self" class="figure"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885eeb651851.png" alt="像素火影骨架佐助 (Perseverance Fire Shadow)手机版v1.16" /> </div> <p class="figure-head">像素火影骨架佐助 (Perseverance Fire Shadow)手机版v1.16</p> <span class="figure-btn">下载</span> </a> </li> <li class="card3"> <a href="/app/103738.htm" target="_self" class="figure"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885edd379f41.jpg" alt="要塞英雄 安卓版v33.20.0-39082670-Android" /> </div> <p class="figure-head">要塞英雄 安卓版v33.20.0-39082670-Android</p> <span class="figure-btn">下载</span> </a> </li> </ul> <ul class="card-box-b"> <li class="card10"> <a href="/app/103786.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885f6d3f5ea1.png" alt="梦想城镇vivo最新版本 安卓版v12.0.1" /> </div> <div class="figure-cont"> <p class="figure-head">梦想城镇vivo最新版本 安卓版v12.0.1</p> <div class="figure-desc"> <span>模拟经营</span> <span>梦想城镇vivo最新版本 安卓版v12.0.1</span> </div> <div class="figure-desc"> <p>梦想城镇vivo版是这款卡通风模拟经营类手游的渠道服版本,玩</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> <li class="card10"> <a href="/app/103779.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885f59a51221.png" alt="怦然心动的瞬间 安卓版v1.0" /> </div> <div class="figure-cont"> <p class="figure-head">怦然心动的瞬间 安卓版v1.0</p> <div class="figure-desc"> <span>模拟经营</span> <span>怦然心动的瞬间 安卓版v1.0</span> </div> <div class="figure-desc"> <p>怦然心动的瞬间是一款真人向的恋爱互动游戏,在游戏中玩家将扮演</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> <li class="card10"> <a href="/app/103777.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885f559695c1.png" alt="曼尼汉堡店游戏 安卓版v1.0.3" /> </div> <div class="figure-cont"> <p class="figure-head">曼尼汉堡店游戏 安卓版v1.0.3</p> <div class="figure-desc"> <span>模拟经营</span> <span>曼尼汉堡店游戏 安卓版v1.0.3</span> </div> <div class="figure-desc"> <p>曼尼汉堡店是一款非常好玩的精品恐怖类型冒险游戏,在这款游戏中</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> <li class="card10"> <a href="/app/103776.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885f54c0a131.png" alt="现代总统模拟器去广告版 安卓版v1.0.46" /> </div> <div class="figure-cont"> <p class="figure-head">现代总统模拟器去广告版 安卓版v1.0.46</p> <div class="figure-desc"> <span>模拟经营</span> <span>现代总统模拟器去广告版 安卓版v1.0.46</span> </div> <div class="figure-desc"> <p>现代总统模拟器是一款休闲养成类游戏,可能对于不少的玩家来说都</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> <li class="card10"> <a href="/app/103775.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="/images/lazy.gif" data-src="https://img.111cn.net/uploads/20250116/logo_67885f5421d761.png" alt="现代总统模拟器付费完整版 安卓版v1.0.46" /> </div> <div class="figure-cont"> <p class="figure-head">现代总统模拟器付费完整版 安卓版v1.0.46</p> <div class="figure-desc"> <span>模拟经营</span> <span>现代总统模拟器付费完整版 安卓版v1.0.46</span> </div> <div class="figure-desc"> <p>现代总统模拟器高级版在商店是需要付费的,相对于普通版本,高级</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> </ul> </section> <footer class="foot"> <a href="/" class="logo-icon"> <img src="/mobile/images/logo2.png" alt="一聚教程网"> </a> <p>Copyright © 2010-2022</p> <p>111cn.net All Rights Reserved</p> </footer> <script> var advData = {"img_fixed_pc_adv":"https:\/\/img.111cn.net\/uploads\/20240509\/663c2e9729f58.jpg","img_fixed_mob_adv":"https:\/\/img.111cn.net\/uploads\/20240509\/663c2e8793225.jpg","url_adv":"http:\/\/shop.hushen.cn\/shop\/c\/baojianpin.html","str_adv":"\u864e\u795e\u5546\u57ce\uff1a\u5173\u7231\u7537\u6027\uff0c\u66f4\u61c2\u7537\u4eba\u3002\u89e3\u51b3\u5927\u4f17\u7684\u7537\u8a00\u4e4b\u9690","img_popup_adv":"https:\/\/img.111cn.net\/uploads\/20240509\/663c2e748238d.png","pc_show_img":"2","pc_show_popup":"2","pc_show_video":"2","mob_show_img":"2","mob_show_popup":"2","mob_show_video":"2","close_adv":"https:\/\/img.111cn.net\/uploads\/20240508\/663b20650801e.png","video_adv":"\/pc\/images\/pc-adv.mp4"}; </script> <script src="/jspc/funcmob.js" type="text/javascript"></script> <!-- Google tag (gtag.js) --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-DSRRGRV1TL"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-DSRRGRV1TL'); </script> <div class="back-top" style="display: block;"> <span class="icon-box"> <svg class="icon" viewBox="0 0 1024 1024"> <path d="M213.333333 640h170.666667v256h256v-256h170.666667l-298.666667-341.333333zM170.666667 128h682.666666v85.333333H170.666667z" fill="#0374f3"></path> </svg> </span> </div> </div> <script src="/js/stat.js"></script> </body> </html>