Skip to content

refactorau/sharepoint-api-download

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SharePoint Site Downloader

A Python tool to download entire SharePoint sites via Microsoft Graph API. Supports both Device Code (delegated) and Client Credentials (application) authentication flows, and can generate standalone static HTML sites from the downloaded content.

License: MIT

Features

  • Generic: Works with any SharePoint Online site URL
  • Authentication: MSAL Device Code (delegated) or Client Credentials (application) flows
  • Complete Download: Recursively downloads all document libraries, Site Pages, Site Assets, Style Library, and Master Page Gallery
  • Static Site Generation: Converts downloaded SharePoint content into standalone HTML sites
  • Resilient: Auto-retries on throttling (HTTP 429/503), resumes partially downloaded files
  • Structure Preservation: Local folder tree mirrors SharePoint hierarchy
  • Image Handling: Downloads and fixes image references for offline viewing

Prerequisites

  • Python 3.9+
  • A Microsoft Entra ID App Registration with Microsoft Graph permissions

Quick Start

  1. Clone and install
git clone <repository-url>
cd sharepoint-api-download
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
  1. Create Azure Entra ID App Registration
  • Go to Azure Portal → Microsoft Entra ID → App registrations → New registration
  • Name: "SharePoint Downloader"; Supported account types: single tenant (or your choice)
  • Note the Application (client) ID and Directory (tenant) ID

Important: Use Azure Portal (portal.azure.com), not Microsoft 365 admin center

  1. Configure API permissions

For Device Code (delegated):

  • Microsoft Graph → Delegated permissions: Sites.Read.All, Files.Read.All
  • Click "Grant admin consent" (required for org-wide sites)

For Client Credentials (application):

  • Microsoft Graph → Application permissions: Sites.Read.All, Files.Read.All
  • Add a client secret (Certificates & secrets → New client secret) and note the value
  • Click "Grant admin consent"
  1. Configure environment

Copy .env.example to .env and fill values:

cp env.example .env

Edit .env with your values:

TENANT_ID=your-tenant-id-here
CLIENT_ID=your-client-id-here
CLIENT_SECRET=your-client-secret-here
SITE_URL=https://yourtenant.sharepoint.com/sites/YourSiteName
AUTH_FLOW=application
OUTPUT_DIR=./downloads
CONCURRENCY=4
  1. Run the downloaderc Simple way (recommended):
./run.sh

Manual way:

# Application auth (no prompts)
python -m sharepoint_downloader.cli \
  --site-url "https://yourtenant.sharepoint.com/sites/YourSiteName" \
  --output ./downloads \
  --auth application \
  --tenant-id "$TENANT_ID" \
  --client-id "$CLIENT_ID" \
  --client-secret "$CLIENT_SECRET" \
  --generate-static

# Device auth (requires browser sign-in)
python -m sharepoint_downloader.cli \
  --site-url "https://yourtenant.sharepoint.com/sites/YourSiteName" \
  --output ./downloads \
  --auth device \
  --tenant-id "$TENANT_ID" \
  --client-id "$CLIENT_ID" \
  --generate-static

Static Site Generation

The --generate-static flag converts downloaded SharePoint content into a standalone HTML site:

  • Converts ASPX pages to clean HTML
  • Fixes image references to work offline
  • Creates an index page with links to all pages
  • Removes SharePoint-specific styling and dependencies
  • Generates a static_site/ directory with the standalone site
python -m sharepoint_downloader.cli --help

Options:

  • --site-url: Full SharePoint site URL
  • --output: Local directory to write files
  • --library: Optional library name filter (can be repeated); default: all
  • --auth: device (default) or application
  • --tenant-id, --client-id, --client-secret: Auth config (can also come from env)
  • --concurrency: Parallel downloads (default 4)
  • --skip-existing: Skip files that already exist with same size
  • --generate-static: Generate standalone HTML site from downloaded content

How it works

  1. Site Resolution: Resolve site ID from the URL via GET /v1.0/sites/{hostname}:/sites/{path}
  2. Drive Discovery: List document libraries via GET /v1.0/sites/{site-id}/drives
  3. Content Traversal: Recursively enumerate folders/files via GET /v1.0/drives/{drive-id}/items/{item-id}/children
  4. File Download: Download files via GET /v1.0/drives/{drive-id}/items/{item-id}/content with retries and chunking
  5. Static Generation: Convert ASPX pages to HTML and fix asset references for offline viewing

Troubleshooting

  • 401/403: Verify permissions and admin consent; ensure correct auth flow
  • 404 site not found: Check SITE_URL host and path
  • Throttling: The downloader auto-retries with backoff; you can lower --concurrency
  • Empty downloads: Ensure you have Sites.Read.All and Files.Read.All permissions with admin consent

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Python tool to download entire SharePoint sites via Microsoft Graph API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published