js实现ngram
实现N-gram的JavaScript方法
N-gram是一种从文本中提取连续N个项目的序列的方法,常用于自然语言处理、文本挖掘等领域。以下是几种实现方式:
基于字符串的简单实现
function generateNgrams(text, n) {
const ngrams = [];
for (let i = 0; i <= text.length - n; i++) {
ngrams.push(text.substring(i, i + n));
}
return ngrams;
}
处理单词级别的N-gram
function wordNgrams(text, n) {
const words = text.split(/\s+/);
const ngrams = [];
for (let i = 0; i <= words.length - n; i++) {
ngrams.push(words.slice(i, i + n).join(' '));
}
return ngrams;
}
支持多种N值的扩展实现
function multiNgram(text, minN = 1, maxN = 3) {
const result = {};
const tokens = text.split(/\s+/);
for (let n = minN; n <= maxN; n++) {
result[n] = [];
for (let i = 0; i <= tokens.length - n; i++) {
result[n].push(tokens.slice(i, i + n).join(' '));
}
}
return result;
}
带频率统计的N-gram
function ngramWithFrequency(text, n) {
const ngrams = {};
const words = text.split(/\s+/);
for (let i = 0; i <= words.length - n; i++) {
const gram = words.slice(i, i + n).join(' ');
ngrams[gram] = (ngrams[gram] || 0) + 1;
}
return ngrams;
}
处理标点符号和大小写的改进版
function cleanNgrams(text, n) {
const cleaned = text.toLowerCase().replace(/[^\w\s]/g, '');
const words = cleaned.split(/\s+/).filter(Boolean);
const ngrams = [];
for (let i = 0; i <= words.length - n; i++) {
ngrams.push(words.slice(i, i + n).join(' '));
}
return ngrams;
}
这些实现可以根据具体需求进行调整,比如处理不同语言、添加停用词过滤或实现更复杂的文本预处理步骤。







