当前位置：首页 > JavaScript

js实现分词

2026-02-01 23:48:39JavaScript

分词的基本概念

分词是将连续的文本分割成有意义的词语或标记的过程。在中文中，由于词语之间没有明显的分隔符，分词尤为重要。JavaScript可以通过多种方式实现分词功能，包括使用内置方法、第三方库或自定义算法。

使用第三方库

jieba-js
jieba-js是中文分词的JavaScript实现，基于结巴分词的Python版本。安装方式如下：

npm install nodejieba

示例代码：

js实现分词

const nodejieba = require("nodejieba");
const result = nodejieba.cut("你好世界");
console.log(result); // ["你好", "世界"]

Kuromoji.js
Kuromoji.js是一个日文分词库，但也支持部分中文分词功能。安装方式：

npm install kuromoji

示例代码：

js实现分词

const kuromoji = require("kuromoji");
kuromoji.builder({ dicPath: "node_modules/kuromoji/dict" }).build((err, tokenizer) => {
    const tokens = tokenizer.tokenize("你好世界");
    console.log(tokens.map(t => t.surface_form)); // ["你好", "世界"]
});

自定义简单分词算法

对于简单的需求，可以基于字典实现最大匹配算法。以下是一个示例：

const dictionary = ["你好", "世界", "编程"];
function maxMatch(text, dict) {
    const result = [];
    let start = 0;
    while (start < text.length) {
        let found = false;
        for (let len = Math.min(text.length - start, 5); len >= 1; len--) {
            const word = text.substr(start, len);
            if (dict.includes(word)) {
                result.push(word);
                start += len;
                found = true;
                break;
            }
        }
        if (!found) {
            result.push(text[start]);
            start++;
        }
    }
    return result;
}
console.log(maxMatch("你好世界编程", dictionary)); // ["你好", "世界", "编程"]

使用正则表达式

对于英文或特定格式的文本，正则表达式可以快速分词：

const text = "Hello world! This is a test.";
const words = text.match(/\b\w+\b/g);
console.log(words); // ["Hello", "world", "This", "is", "a", "test"]