当前位置：首页 > Java

java如何分词

2026-03-20 14:38:07Java

Java 分词实现方法

使用第三方库（推荐） Java 生态中有多个成熟的中文分词工具库，例如 HanLP、Ansj、Jieba（Java 版）等。这些库提供了丰富的分词功能和自定义词典支持。

以 HanLP 为例，添加 Maven 依赖：

java如何分词

<dependency>
    <groupId>com.hankcs</groupId>
    <artifactId>hanlp</artifactId>
    <version>portable-1.8.4</version>
</dependency>

基础分词代码示例：

import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.seg.common.Term;

public class SegmentDemo {
    public static void main(String[] args) {
        String text = "自然语言处理技术正在快速发展";
        List<Term> termList = HanLP.segment(text);
        System.out.println(termList);
    }
}

使用 Java 内置功能 对于简单的英文分词，可以使用 Java 内置的字符串处理方法：

java如何分词

String text = "This is a simple example";
String[] words = text.split("\\s+"); // 按空格分割
Arrays.stream(words).forEach(System.out::println);

处理中文的简单方法 如果不需要复杂功能，可以使用正则表达式进行基础中文分词：

String text = "我爱自然语言处理";
String[] chars = text.split("(?<=.)");
System.out.println(Arrays.toString(chars));

自定义分词算法 对于特定需求，可以手动实现最大匹配算法等分词方法：

public List<String> maxMatch(String text, Set<String> dictionary) {
    List<String> result = new ArrayList<>();
    int maxLen = dictionary.stream().mapToInt(String::length).max().orElse(1);

    while (text.length() > 0) {
        int len = Math.min(maxLen, text.length());
        String word = text.substring(0, len);

        while (!dictionary.contains(word)) {
            if (word.length() == 1) break;
            word = word.substring(0, word.length() - 1);
        }

        result.add(word);
        text = text.substring(word.length());
    }

    return result;
}