中文
PDFreformertool 是一个基于大语言模型(LLM)的 PDF 文档翻译工具,旨在解决现有翻译工具在专业场景下的不足。它支持多语言翻译,精准保留文档格式,适用于技术文档、法律合同、商务报告等复杂场景。项目利用 LLM 的高级语义处理能力,结合 pymupdf 和 pdfplumber 等库,提取 PDF 内容并生成高质量翻译结果。翻译数据存储在 MongoDB 中,同时实验性支持 HDF5(T5.py)作为替代存储方案。
English
PDFreformertool is a PDF document translation tool powered by large language models (LLM), designed to address the shortcomings of existing translation tools in professional scenarios. It supports multilingual translation with precise format preservation, suitable for technical documents, legal contracts, and business reports. Leveraging LLM's advanced semantic processing, combined with libraries like pymupdf and pdfplumber, it extracts PDF content and delivers high-quality translations. Translation data is stored in MongoDB, with experimental HDF5 support (T5.py) as an alternative.
在全球化和多语言交流需求激增的背景下,文档翻译工具的使用日益广泛。然而,现有商业工具(如 DeepL、Google Translate)在翻译质量和格式保留方面存在显著不足:
- 翻译质量:无法满足技术文档、法律合同等专业场景的语义准确性和领域化需求。
- 格式问题:复杂文档的表格错位、图片丢失、段落散乱、字体不一致等问题频发。
- 效率低下:用户需手动调整排版,浪费时间和精力。
PDFreformertool 提供了一种高效解决方案,通过 LLM 和文档处理技术,实现高质量翻译和格式精准保留。
- 精准文本提取:利用
pdfplumber和pymupdf高效提取 PDF 内容。 - 高质量翻译:通过 OpenAI 或 Azure OpenAI API 提供专业、多语言翻译。
- 格式保留:基于
python-docx和docxtpl,确保翻译后文档格式一致。 - 数据存储:支持 MongoDB 存储翻译数据,实验性支持 HDF5(
T5.py)。 - 灵活配置:通过
Tconfig.py自定义翻译主题、API 配置等。
以下是项目所需的 Python 库:
rich
docxtpl
pdf2docx
python-docx
pymongo
h5py ---> 可选
openai
pymupdf
pdfplumber
通过以下命令安装依赖:
pip install -r requirements.txt注意:需安装 Python 3.8+ 和 MongoDB。HDF5(T5.py)为实验性功能,可能不稳定。
- 安装 Python 3.8 或以上版本。
- 安装 MongoDB(或使用实验性 HDF5 存储,详见
T5.py)。 - 获取大语言模型的 API Key 和 URL(如 OpenAI 或 Azure OpenAI)。
- 克隆仓库:
git clone https://github.com/jiananlan/PDFreformertool.git cd PDFreformertool - 安装依赖:
pip install -r requirements.txt
- 安装 MongoDB(参考 MongoDB 官方文档)。
- 编辑
Tconfig.py:- 设置 LLM 的 API Key 和 URL。
- 配置翻译主题(如目标语言)。
- 若使用 Azure OpenAI(如 ChatGPT),将
enable_chatgpt设为True,并在T24.py中提供对应的 URL 和 API Key。
- 在
Tmain.py中更新 PDF 输入文件的路径为实际地址。
运行以下命令启动程序:
python Tmain.py- 输入 PDF:读取用户指定的 PDF 文件。
- 文本提取:通过
pdfplumber和pymupdf提取文本内容。 - 翻译处理:调用 LLM API 进行高质量翻译。
- 格式处理:利用
python-docx和docxtpl重构文档格式。 - 数据存储:翻译数据存储至 MongoDB 或实验性 HDF5。
- 输出:生成格式保留的翻译文档。
本项目采用 AGPL-3.0 许可证,符合 pymupdf 库要求。详情见 LICENSE 文件。
欢迎贡献代码!请遵循以下步骤:
- Fork 本仓库。
- 创建功能分支(
git checkout -b feature/YourFeature)。 - 提交更改(
git commit -m 'Add YourFeature')。 - 推送分支(
git push origin feature/YourFeature)。 - 创建 Pull Request。
代码需符合 PEP 8 规范,并附带测试。
Q: 为什么 HDF5 支持不稳定?
A: T5.py 是 MongoDB 的实验性替代方案,可能存在数据兼容性或性能问题。
Q: 支持哪些语言?
A: 支持 LLM API 提供的所有语言,具体取决于使用的模型。
Q: 如何调试运行错误?
A: 检查 Tconfig.py 中的 API 配置,确保 MongoDB 正常运行,验证 PDF 文件路径。
如有问题或建议,请在 GitHub Issues 提交。
PDFreformertool is a PDF document translation tool powered by large language models (LLM), addressing the limitations of existing translation tools in professional scenarios. It supports multilingual translation with precise format preservation, ideal for technical documents, legal contracts, and business reports. Using LLM's advanced semantic processing and libraries like pymupdf and pdfplumber, it extracts PDF content and delivers high-quality translations. Data is stored in MongoDB, with experimental HDF5 support (T5.py).
With the surge in globalization and multilingual communication, document translation tools are increasingly essential. However, mainstream commercial tools (e.g., DeepL, Google Translate) have notable shortcomings:
- Translation Quality: Inadequate for professional domains like technical documents or legal contracts due to poor semantic accuracy.
- Formatting Issues: Problems such as misaligned tables, missing images, scattered paragraphs, and inconsistent fonts.
- Inefficiency: Users must manually adjust formatting, wasting time and effort.
PDFreformertool offers an efficient solution, combining LLM and document processing to achieve high-quality translation and precise format retention.
- Accurate Text Extraction: Efficiently extracts content using
pdfplumberandpymupdf. - High-Quality Translation: Provides professional, multilingual translation via OpenAI or Azure OpenAI API.
- Format Preservation: Uses
python-docxanddocxtplto maintain document formatting. - Data Storage: Supports MongoDB for translation data, with experimental HDF5 support (
T5.py). - Flexible Configuration: Customize translation themes and API settings via
Tconfig.py.
The project requires the following Python libraries:
rich
docxtpl
pdf2docx
python-docx
pymongo
h5py ---> optional
openai
pymupdf
pdfplumber
Install dependencies using:
pip install -r requirements.txtNote: Requires Python 3.8+ and MongoDB. HDF5 (T5.py) is experimental and may be unstable.
- Install Python 3.8 or higher.
- Install MongoDB (or use experimental HDF5 storage, see
T5.py). - Obtain an API Key and URL for a large language model (e.g., OpenAI or Azure OpenAI).
- Clone the repository:
git clone https://github.com/jiananlan/PDFreformertool.git cd PDFreformertool - Install dependencies:
pip install -r requirements.txt
- Install MongoDB (refer to MongoDB official documentation).
- Edit
Tconfig.py:- Set the API Key and URL for the large language model.
- Configure the translation theme (e.g., target language).
- If using Azure OpenAI (e.g., ChatGPT), set
enable_chatgpttoTrueand provide the URL and API Key inT24.py.
- Update the PDF input file path in
Tmain.pyto the actual file location.
Run the following command to start the program:
python Tmain.py- Input PDF: Reads the specified PDF file.
- Text Extraction: Extracts text using
pdfplumberandpymupdf. - Translation Processing: Performs high-quality translation via LLM API.
- Format Processing: Reconstructs document format using
python-docxanddocxtpl. - Data Storage: Stores translation data in MongoDB or experimental HDF5.
- Output: Generates a translated document with preserved formatting.
This project is licensed under the AGPL-3.0 License, as required by pymupdf. See the LICENSE file for details.
Contributions are welcome! Follow these steps:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/YourFeature). - Commit changes (
git commit -m 'Add YourFeature'). - Push to the branch (
git push origin feature/YourFeature). - Create a Pull Request.
Code should adhere to PEP 8 standards and include tests.
Q: Why is HDF5 support unstable?
A: T5.py is an experimental alternative to MongoDB and may have compatibility or performance issues.
Q: Which languages are supported?
A: Supports all languages provided by the LLM API, depending on the model used.
Q: How to debug runtime errors?
A: Verify API settings in Tconfig.py, ensure MongoDB is running, and check the PDF file path.
For questions or suggestions, open an issue on GitHub Issues to contact me.
