当前位置：首页 > PHP

php实现查重

2026-03-13 04:41:39PHP

PHP实现文本查重的方法

使用PHP实现文本查重可以通过多种方式实现，以下是几种常见的方法：

基于字符串相似度的查重

使用PHP内置的字符串函数或扩展计算文本相似度：

$text1 = "这是要比较的第一段文本";
$text2 = "这是要比较的第二段文本";

similar_text($text1, $text2, $percent);
echo "相似度: ".$percent."%";

基于哈希值的查重

为文本生成哈希值进行比较：

php实现查重

function textHash($text) {
    $text = preg_replace('/\s+/', '', $text); // 去除空白字符
    return md5($text);
}

$hash1 = textHash($text1);
$hash2 = textHash($text2);

if ($hash1 === $hash2) {
    echo "文本完全相同";
}

基于SimHash算法的查重

SimHash适合处理大文本查重：

function simHash($text) {
    $words = preg_split('/\s+/', $text);
    $hash = array_fill(0, 64, 0);

    foreach ($words as $word) {
        $wordHash = crc32($word);
        for ($i = 0; $i < 64; $i++) {
            $bit = ($wordHash >> $i) & 1;
            $hash[$i] += ($bit ? 1 : -1);
        }
    }

    $simhash = '';
    foreach ($hash as $bit) {
        $simhash .= ($bit > 0 ? '1' : '0');
    }

    return $simhash;
}

function hammingDistance($hash1, $hash2) {
    $distance = 0;
    for ($i = 0; $i < strlen($hash1); $i++) {
        if ($hash1[$i] != $hash2[$i]) {
            $distance++;
        }
    }
    return $distance;
}

基于数据库的查重实现

php实现查重

将文本特征存储在数据库中实现批量查重：

// 存储文本特征
function storeTextFeature($pdo, $text, $contentId) {
    $hash = simHash($text);
    $stmt = $pdo->prepare("INSERT INTO content_features (content_id, simhash) VALUES (?, ?)");
    $stmt->execute([$contentId, $hash]);
}

// 查询相似文本
function findSimilarTexts($pdo, $text, $threshold = 3) {
    $hash = simHash($text);
    $stmt = $pdo->prepare("SELECT content_id FROM content_features WHERE BIT_COUNT(simhash ^ ?) <= ?");
    $stmt->execute([$hash, $threshold]);
    return $stmt->fetchAll(PDO::FETCH_COLUMN);
}

提高查重效率的技巧

预处理文本数据可以提高查重准确性：

function preprocessText($text) {
    $text = mb_strtolower($text); // 转为小写
    $text = preg_replace('/[^\p{L}\p{N}\s]/u', '', $text); // 移除标点
    $text = preg_replace('/\s+/', ' ', $text); // 合并空白字符
    return trim($text);
}

使用缓存机制存储计算结果：

function cachedSimHash($text, $cache) {
    $key = 'simhash_'.md5($text);
    if ($cache->has($key)) {
        return $cache->get($key);
    }
    $hash = simHash($text);
    $cache->set($key, $hash, 3600);
    return $hash;
}