php实现查重
PHP实现文本查重方法
基于相似度算法
使用PHP内置函数计算文本相似度,如similar_text()或levenshtein()函数:
$text1 = "这是原始文本内容";
$text2 = "这是修改后的文本内容";
// 相似度百分比计算
similar_text($text1, $text2, $percent);
echo "相似度: ".$percent."%";
// 编辑距离计算
$distance = levenshtein($text1, $text2);
$maxLen = max(strlen($text1), strlen($text2));
echo "相似度: ".(1 - $distance/$maxLen)*100 ."%";
基于哈希指纹算法
采用SimHash算法生成文本指纹:
function simHash($text) {
$tokens = preg_split('/\s+/', $text);
$hash = array_fill(0, 64, 0);
foreach ($tokens as $token) {
$tokenHash = hash('sha256', $token);
$binary = hex2bin($tokenHash);
for ($i = 0; $i < 64; $i++) {
$bit = ($binary[$i >> 3] >> (7 - ($i % 8))) & 1;
$hash[$i] += $bit ? 1 : -1;
}
}
$fingerprint = '';
foreach ($hash as $bit) {
$fingerprint .= $bit > 0 ? '1' : '0';
}
return $fingerprint;
}
function hammingDistance($hash1, $hash2) {
$distance = 0;
for ($i = 0; $i < strlen($hash1); $i++) {
if ($hash1[$i] != $hash2[$i]) {
$distance++;
}
}
return $distance;
}
基于TF-IDF向量化
使用TF-IDF算法将文本向量化后计算余弦相似度:
function calculateTfIdf($documents) {
$tf = [];
$df = [];
$idf = [];
$tfidf = [];
$allTerms = [];
// 计算词频(TF)
foreach ($documents as $docId => $doc) {
$terms = preg_split('/\s+/', $doc);
$termCount = array_count_values($terms);
$tf[$docId] = [];
foreach ($termCount as $term => $count) {
$tf[$docId][$term] = $count / count($terms);
if (!in_array($term, $allTerms)) {
$allTerms[] = $term;
}
}
}
// 计算文档频率(DF)
foreach ($allTerms as $term) {
$df[$term] = 0;
foreach ($documents as $docId => $doc) {
if (isset($tf[$docId][$term])) {
$df[$term]++;
}
}
}
// 计算逆文档频率(IDF)
$totalDocs = count($documents);
foreach ($df as $term => $count) {
$idf[$term] = log($totalDocs / ($count + 1));
}
// 计算TF-IDF
foreach ($tf as $docId => $terms) {
$tfidf[$docId] = [];
foreach ($terms as $term => $value) {
$tfidf[$docId][$term] = $value * $idf[$term];
}
}
return $tfidf;
}
基于MySQL全文检索
利用MySQL的全文检索功能实现高效查重:
// 创建支持全文索引的表
CREATE TABLE documents (
id INT AUTO_INCREMENT PRIMARY KEY,
content TEXT,
FULLTEXT(content)
) ENGINE=InnoDB;
// PHP查询相似文档
$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'password');
$stmt = $pdo->prepare("
SELECT id, MATCH(content) AGAINST(:search IN NATURAL LANGUAGE MODE) AS score
FROM documents
WHERE MATCH(content) AGAINST(:search IN NATURAL LANGUAGE MODE)
ORDER BY score DESC
");
$stmt->execute([':search' => $searchText]);
$results = $stmt->fetchAll(PDO::FETCH_ASSOC);
性能优化建议
对于大规模文本查重系统,建议采用以下优化措施:
- 对文本进行预处理(去除停用词、标点符号、词干提取)
- 使用缓存机制存储常用查询结果
- 考虑使用专业搜索引擎如Elasticsearch
- 对长文本采用分块比对策略
- 建立索引提高查询效率
以上方法可根据实际需求组合使用,简单查重可使用相似度算法,高精度需求建议采用SimHash或TF-IDF算法。







