php实现查重
PHP实现文本查重的方法
文本查重可以通过多种方式实现,以下是几种常见的PHP实现方法:
基于字符串相似度的查重 使用PHP内置函数计算文本相似度:
$text1 = "这是要比较的第一段文本";
$text2 = "这是要比较的第二段文本";
similar_text($text1, $text2, $percent);
echo "相似度: ".$percent."%";
基于SimHash算法的查重 SimHash适合处理大文本查重:
function simhash($text) {
$tokens = preg_split('/\s+/', $text);
$hash = array_fill(0, 64, 0);
foreach($tokens as $token) {
$tokenHash = hash('md5', $token);
$binary = '';
for($i=0; $i<32; $i++) {
$binary .= str_pad(decbin(hexdec($tokenHash[$i])), 4, '0', STR_PAD_LEFT);
}
for($i=0; $i<64; $i++) {
$hash[$i] += ($binary[$i] == '1') ? 1 : -1;
}
}
$simhash = '';
foreach($hash as $bit) {
$simhash .= ($bit > 0) ? '1' : '0';
}
return $simhash;
}
function hammingDistance($hash1, $hash2) {
$distance = 0;
for($i=0; $i<64; $i++) {
if($hash1[$i] != $hash2[$i]) {
$distance++;
}
}
return $distance;
}
基于MySQL全文索引的查重 对于存储在数据库中的文本:
// 创建全文索引表
CREATE TABLE documents (
id INT AUTO_INCREMENT PRIMARY KEY,
content TEXT,
FULLTEXT(content)
) ENGINE=InnoDB;
// PHP查询相似文档
$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'pass');
$stmt = $pdo->prepare("SELECT id, MATCH(content) AGAINST(:search) as score
FROM documents
WHERE MATCH(content) AGAINST(:search)
ORDER BY score DESC LIMIT 10");
$stmt->execute([':search' => $searchText]);
$results = $stmt->fetchAll();
基于TF-IDF算法的查重 需要先计算词频和逆文档频率:
function calculateTfIdf($documents) {
$tf = [];
$df = [];
$idf = [];
$tfidf = [];
// 计算TF
foreach($documents as $docId => $document) {
$words = preg_split('/\s+/', $document);
$wordCount = count($words);
foreach($words as $word) {
if(!isset($tf[$docId][$word])) {
$tf[$docId][$word] = 0;
}
$tf[$docId][$word]++;
}
// 归一化
foreach($tf[$docId] as $word => $count) {
$tf[$docId][$word] = $count / $wordCount;
}
}
// 计算DF
foreach($tf as $docId => $words) {
foreach($words as $word => $count) {
if(!isset($df[$word])) {
$df[$word] = 0;
}
$df[$word]++;
}
}
// 计算IDF
$totalDocs = count($documents);
foreach($df as $word => $count) {
$idf[$word] = log($totalDocs / $count);
}
// 计算TF-IDF
foreach($tf as $docId => $words) {
foreach($words as $word => $tfValue) {
$tfidf[$docId][$word] = $tfValue * $idf[$word];
}
}
return $tfidf;
}
实际应用建议
- 对于小规模文本查重,使用similar_text函数最简单
- 对于大规模文档查重,推荐使用SimHash或TF-IDF算法
- 如果文本存储在数据库中,可以利用数据库的全文检索功能
- 考虑使用缓存机制存储计算结果,提高重复查询效率
每种方法都有其适用场景,选择时应考虑数据规模、性能要求和准确度需求。







