A clean, modular system for processing Confluence exports and uploading to Outline via API. Features a three-phase workflow with separation of concerns, comprehensive error handling, and resumable operations.
The system uses a clean architecture approach with four distinct phases:
- Extract ZIPs - Extract Confluence export ZIP files to input directories
- Process Input - Extract complete space structure from
index.html - Extract Content - Convert HTML to clean markdown without breadcrumbs
- API Upload - Create collections and pages with proper parent-child relationships
# Setup
git clone <repository>
cd IS
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your Outline API credentials:
# OUTLINE_API_URL=https://your-outline-instance.com/api
# OUTLINE_API_TOKEN=your_api_token_here# Extract Confluence export ZIP files
python main.py extract-zips- Extracts ZIP files from
zips/directory toinput/directories - Uses secure extraction with safety checks against zip bombs
- Creates properly named directories (e.g.,
Export-135853/) - Skip this step if you already have extracted directories in
input/
# Extract structure from all exports in input/
python main.py process-input- Scans
input/Export-*/directories for Confluence exports - Extracts complete hierarchical structure from
index.htmlfiles using DOM parser - Creates
{space_key}.jsonfiles inoutput/(e.g.,is.json,gi.json) - Review and edit these JSON files before proceeding to next phase
# Extract markdown content for all spaces
python main.py extract-content
# Or process specific spaces
python main.py extract-content --spaces is gi- Converts HTML files to clean markdown
- Removes breadcrumbs and duplicate titles (captured in Phase 1)
- Populates
md_contentfields in JSON files - Handles attachments and embedded content
# Upload specific spaces to Outline
python main.py api-upload --spaces is gi
# Force mode - update existing documents (bypasses 'created' status)
python main.py api-upload --spaces is gi --force- Creates collections and pages via Outline API
- Maintains proper parent-child relationships using UUIDs
- Updates JSON files with creation status and UUIDs
- Resumable: Skips already created items if upload is interrupted
- Force mode: Updates existing documents with latest content
- Collection deduplication: Handles duplicate collection names with user interaction
- Comprehensive retry logic: Handles rate limiting and network errors with exponential backoff
# Show status of all spaces
python main.py status
# Reset upload status to retry failed uploads
python main.py reset --spaces is gi
# Clean reset - remove all processed files (keeps ZIPs)
python main.py point-zero# 0. Place Confluence export ZIP files in zips/
cp "Confluence space export.zip" zips/
# 1. Extract ZIP files to input directories
python main.py extract-zips
# 2. Process all input directories
python main.py process-input
# 3. Review generated JSON files in output/
ls output/*.json
# 4. Extract content
python main.py extract-content
# 5. Upload to API (requires credentials)
export OUTLINE_API_URL="https://your-outline.com/api"
export OUTLINE_API_TOKEN="your-token"
python main.py api-upload --spaces is gi
# 6. Update existing documents with latest content (force mode)
python main.py api-upload --spaces is gi --force
# 6. Check final status
python main.py statusConfluenceToOutline/
βββ zips/ # Confluence export ZIP files
β βββ Confluence-space-export-135853.html.zip
β βββ Confluence-space-export-204041.html.zip
βββ input/ # Extracted export directories
β βββ Export-135853/
β β βββ IS/ # Space directory with index.html
β βββ Export-204041/
β βββ GI/
βββ output/ # Generated JSON files
β βββ is.json # Space structure + content
β βββ gi.json
βββ libs/ # Core libraries
β βββ space_processor.py # Main processing engine
β βββ api_upload_manager.py # API upload handling
β βββ dom_hierarchy_parser.py # HTML structure parser
β βββ zip_extractor.py # Safe ZIP extraction
β βββ ...
βββ archive/ # Historical data and experiments
βββ main.py # Primary CLI interface
export OUTLINE_API_URL="https://your-outline-instance.com/api"
export OUTLINE_API_TOKEN="your-api-token-here"Note: The system also supports the legacy OUTLINE_API_KEY environment variable for backward compatibility.
python main.py api-upload \
--spaces is gi \
--api-url "https://your-outline.com/api" \
--api-token "your-token"- Safe extraction with zip bomb protection and path traversal prevention
- Size limits to prevent excessive disk usage
- Automatic directory naming from ZIP filenames
- Preserves ZIP files for re-extraction if needed
- Complete hierarchy extraction from
index.htmlusing DOM parser - Handles malformed HTML and complex nested structures
- Captures all 200+ pages with proper parent-child relationships
- Supports multiple spaces in single workflow
- Smart markdown conversion from HTML
- Breadcrumb removal - no more navigation clutter
- Title deduplication - titles captured from structure, not content
- Advanced attachment handling - complete image and file support with Outline compatibility
- Automatic URL parameter removal - strips
?width=760and other Confluence query parameters - Templated format conversion - uses clean
{attachment/path}system for UUID replacement - Perfect Outline compatibility - converts to proper
/api/attachments.redirect?id=UUIDformat - Unlinked attachment detection - automatically adds attachment sections for orphaned files
- Comprehensive metadata tracking - preserves content type, original names, and UUIDs
- Two-phase upload workflow - creates attachment records then uploads to secure storage
- Proper markdown image syntax - maintains alt text and sizing information
- UUID tracking for created collections and pages
- Status persistence in JSON files
- Skip already created items on retry
- Progress reporting with completion percentages
- Comprehensive error handling with detailed logging
- Advanced rate limiting with exponential backoff and server header respect
- Force mode operations for updating existing content and collections
- Interactive conflict resolution for duplicate collection names
- Clean separation of concerns for maintainability
- Flexible CLI with intuitive commands
- Centralized configuration with validation and type safety
- Environment variable support with command-line overrides
- Complete attachment workflows tested with 7+ production spaces
If you have an existing setup using the old workflow:
python main.py extract-zips
python main.py process-input
python main.py extract-content --spaces is gi
python main.py api-upload --spaces is gipython -m pytest test_data/- Ensure space has
index.htmlwith navigation structure - Place in
input/Export-*/SPACE_KEY/directory - Run
process-inputto extract structure - Review generated JSON in
output/ - Process normally with
extract-contentandapi-upload
- Custom parsers: Extend
DomHierarchyParserinlibs/ - Content processing: Modify
SpaceProcessor.html_to_markdown() - API integration: Enhance
ApiUploadManagerfor new endpoints
"No zip files found in zips/ directory"
- Place your Confluence export ZIP files in the
zips/directory - Ensure files have
.zipextension - Use
python main.py extract-zipsto process them
"No spaces found to process"
- Ensure exports are in
input/Export-*/SPACE_KEY/format - Check that
index.htmlexists in space directory
"Space file not found"
- Run
process-inputfirst to create JSON files - Check
output/directory for generated files
"Failed to create collection/page"
- Verify API credentials and network connectivity
- Check Outline instance is accessible
- Review rate limiting (default 0.1s delay between requests)
- It's recommended to disable rate limiting or setting the request value very high for this process.
"Upload interrupted"
- Use
python main.py reset --spaces <key>to reset status - Or continue with same command - already created items will be skipped
"Attachments not displaying correctly"
- Re-run
extract-contentto update attachment URLs with latest format - Use
--forcemode to update existing pages:python main.py api-upload --spaces <key> --force - Check that attachment files exist in
input/Export-*/SPACE/attachments/directories
"Images showing as broken links"
- Ensure images were successfully uploaded (check JSON for
"uploaded": true) - Verify Outline instance supports the attachment redirect endpoint
- Use force mode to refresh all attachment URLs
# Enable verbose logging
export LOG_LEVEL=DEBUG
python main.py statusGNU AFFERO GENERAL PUBLIC LICENSE
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
For issues and questions:
- Check the troubleshooting section above
- Review logs in the terminal output
- Use
python main.py statusto check current state - Open an issue in the repository