当前位置：首页 > PHP

php实现查重

2026-02-15 05:42:51PHP

PHP实现文本查重的方法

文本查重可以通过多种方式实现，以下是几种常见的PHP实现方法：

基于字符串相似度的查重 使用PHP内置函数计算文本相似度：

$text1 = "这是要比较的第一段文本";
$text2 = "这是要比较的第二段文本";

similar_text($text1, $text2, $percent);
echo "相似度: ".$percent."%";

基于SimHash算法的查重 SimHash适合处理大文本查重：

function simhash($text) {
    $tokens = preg_split('/\s+/', $text);
    $hash = array_fill(0, 64, 0);

    foreach($tokens as $token) {
        $tokenHash = hash('md5', $token);
        $binary = '';
        for($i=0; $i<32; $i++) {
            $binary .= str_pad(decbin(hexdec($tokenHash[$i])), 4, '0', STR_PAD_LEFT);
        }

        for($i=0; $i<64; $i++) {
            $hash[$i] += ($binary[$i] == '1') ? 1 : -1;
        }
    }

    $simhash = '';
    foreach($hash as $bit) {
        $simhash .= ($bit > 0) ? '1' : '0';
    }

    return $simhash;
}

function hammingDistance($hash1, $hash2) {
    $distance = 0;
    for($i=0; $i<64; $i++) {
        if($hash1[$i] != $hash2[$i]) {
            $distance++;
        }
    }
    return $distance;
}

基于MySQL全文索引的查重 对于存储在数据库中的文本：

// 创建全文索引表
CREATE TABLE documents (
    id INT AUTO_INCREMENT PRIMARY KEY,
    content TEXT,
    FULLTEXT(content)
) ENGINE=InnoDB;

// PHP查询相似文档
$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'pass');
$stmt = $pdo->prepare("SELECT id, MATCH(content) AGAINST(:search) as score 
                      FROM documents 
                      WHERE MATCH(content) AGAINST(:search) 
                      ORDER BY score DESC LIMIT 10");
$stmt->execute([':search' => $searchText]);
$results = $stmt->fetchAll();

基于TF-IDF算法的查重 需要先计算词频和逆文档频率：

function calculateTfIdf($documents) {
    $tf = [];
    $df = [];
    $idf = [];
    $tfidf = [];

    // 计算TF
    foreach($documents as $docId => $document) {
        $words = preg_split('/\s+/', $document);
        $wordCount = count($words);
        foreach($words as $word) {
            if(!isset($tf[$docId][$word])) {
                $tf[$docId][$word] = 0;
            }
            $tf[$docId][$word]++;
        }
        // 归一化
        foreach($tf[$docId] as $word => $count) {
            $tf[$docId][$word] = $count / $wordCount;
        }
    }

    // 计算DF
    foreach($tf as $docId => $words) {
        foreach($words as $word => $count) {
            if(!isset($df[$word])) {
                $df[$word] = 0;
            }
            $df[$word]++;
        }
    }

    // 计算IDF
    $totalDocs = count($documents);
    foreach($df as $word => $count) {
        $idf[$word] = log($totalDocs / $count);
    }

    // 计算TF-IDF
    foreach($tf as $docId => $words) {
        foreach($words as $word => $tfValue) {
            $tfidf[$docId][$word] = $tfValue * $idf[$word];
        }
    }

    return $tfidf;
}

实际应用建议