Skip to content

Latest commit

 

History

History
79 lines (59 loc) · 2.5 KB

File metadata and controls

79 lines (59 loc) · 2.5 KB

Text Normalization System - Regrex Implementation

This is Python Implementation based on Regrex & Rule-based for convert writing words to reading words, researched and developed by Dean Ng.

python regrex

Features

  • Vietnamese text normalization
  • Special character handling
  • Number and currency normalization
  • Date format normalization
  • Support for superscript and subscript characters
  • Complex text pattern recognition
  • Unit and currency handling
  • Roman numeral processing

Installation

conda create --name venv python=3.8
pip install -r requirements.txt
  1. Set up Java environment (required for VnCoreNLP):
  • Install Java JDK (version 21 or compatible)
  • Set JAVA_HOME environment variable to your JDK installation path

Usage

from cores.normalizer import TextNormalizer

# Initialize the normalizer with VnCoreNLP model path
text_normalizer = TextNormalizer("./exps/vncorenlp/")

# Normalize text
text = "1. Những ngân hàng đang có lãi suất cho vay bình quân cao như Liên Việt, Bản Việt, Kiên Long với lãi suất từ 8,07 $ -  8,94$..."
normalized_text = text_normalizer(text)

Project Structure

text-normalization/
├── constants/         # Character sets and constants
├── cores/            # Core normalization logic
├── exps/             # Experiment configurations
├── logs/             # Log files
├── utils/            # Utility functions
├── example.py        # Usage examples
└── requirements.txt  # Project dependencies

Reference

NOTE: I may forgot repo that I use to reference for this repo, please create issue ticket and I will update it. Thank you very much

Cite

If you find this repository useful for your research, please cite:

@misc{deanng_2025,
    author = {Dean Nguyen},
    title = {Vietnamese Text Normalization},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/ducnt18121997/text-normalization}}
}