Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .claude/settings.local.json
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@
"Bash(SKIP_COVERAGE_MINIMUMS=true AWS_REGION=us-east-1 bundle exec rspec:*)",
"Bash(env SKIP_COVERAGE_MINIMUMS=true AWS_REGION=us-east-1 bundle exec rspec:*)",
"Skill(simplecov)",
"Bash(then grep -A 5 \"covered_percent\\|app.rb\\|request_validator\\|response_builder\\|s3_url_parser\\|url_validator\\|webhook_notifier\" coverage/index.html)"
"Bash(then grep -A 5 \"covered_percent\\|app.rb\\|request_validator\\|response_builder\\|s3_url_parser\\|url_validator\\|webhook_notifier\" coverage/index.html)",
"SlashCommand(/run-prompt:*)"
],
"deny": [],
"ask": []
Expand Down
17 changes: 9 additions & 8 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,14 +99,14 @@ The Lambda function is configured with:

### POST /convert

Converts a PDF to images.
Converts a PDF to images and delivers them as a zip file.

**Request Body:**

```json
{
"source": "https://s3.amazonaws.com/bucket/input.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...",
"destination": "https://s3.amazonaws.com/bucket/output/?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...",
"destination": "https://s3.amazonaws.com/bucket/output.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...",
"webhook": "https://example.com/webhook",
"unique_id": "client-123"
}
Expand All @@ -119,15 +119,14 @@ Converts a PDF to images.
- **Client control**: Clients generate URLs with their own AWS credentials, maintaining data sovereignty
- **Audit trail**: All S3 access is logged under the client's AWS account

**Note on destination URL:** The destination URL should be a pre-signed PUT URL for a zip file (e.g., `output.zip`), not a folder path. The service will create a zip file containing all converted images.

**Response:**

```json
{
"message": "PDF conversion and upload completed",
"images": [
"https://s3.amazonaws.com/bucket/output/client-123-0.png?...",
"https://s3.amazonaws.com/bucket/output/client-123-1.png?..."
],
"message": "PDF conversion and zip upload completed",
"images": "https://s3.amazonaws.com/bucket/output.zip",
"unique_id": "client-123",
"status": "completed",
"pages_converted": 2,
Expand All @@ -139,4 +138,6 @@ Converts a PDF to images.
}
```

**Note:** The service processes PDFs synchronously and returns the converted images in the response. If a webhook URL is provided, a notification is also sent asynchronously (fire-and-forget) upon completion.
**Zip File Contents:** The zip file contains PNG images named as `{unique_id}-0.png`, `{unique_id}-1.png`, etc., corresponding to each page of the PDF.

**Note:** The service processes PDFs synchronously and returns the zip file URL in the response. If a webhook URL is provided, a notification is also sent asynchronously (fire-and-forget) upon completion.
24 changes: 14 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ print(f"Authorization: Bearer {token}")

### Step 6: Test Your Deployment

Create pre-signed S3 URLs for source (PDF) and destination (images), then call the API:
Create pre-signed S3 URLs for source (PDF) and destination (zip file), then call the API:

```bash
# Example using curl (replace with your actual URLs and token)
Expand All @@ -149,12 +149,14 @@ curl -X POST https://your-api-endpoint.amazonaws.com/Prod/convert \
-H "Content-Type: application/json" \
-d '{
"source": "https://s3.amazonaws.com/your-bucket/input.pdf?X-Amz-...",
"destination": "https://s3.amazonaws.com/your-bucket/output/?X-Amz-...",
"destination": "https://s3.amazonaws.com/your-bucket/output.zip?X-Amz-...",
"webhook": "https://your-webhook-endpoint.com/notify",
"unique_id": "test-123"
}'
```

**Note:** The destination URL should be a pre-signed PUT URL for a `.zip` file, not a folder. The service will upload a single zip file containing all converted PNG images.

For instructions on generating pre-signed S3 URLs, see the [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html).

### Testing Scripts
Expand Down Expand Up @@ -241,14 +243,14 @@ sam delete --stack-name content_processing # Delete the deployed stack

### POST /convert

Converts a PDF to images.
Converts a PDF to images and delivers them as a zip file.

**Request Body:**

```json
{
"source": "https://s3.amazonaws.com/bucket/input.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...",
"destination": "https://s3.amazonaws.com/bucket/output/?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...",
"destination": "https://s3.amazonaws.com/bucket/output.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...",
"webhook": "https://example.com/webhook",
"unique_id": "client-123"
}
Expand All @@ -261,15 +263,14 @@ Converts a PDF to images.
- **Client control**: Clients generate URLs with their own AWS credentials, maintaining data sovereignty
- **Audit trail**: All S3 access is logged under the client's AWS account

**Note on destination URL:** The destination URL should be a pre-signed PUT URL for a zip file (e.g., `output.zip`), not a folder path. The service will create a zip file containing all converted images.

**Response:**

```json
{
"message": "PDF conversion and upload completed",
"images": [
"https://s3.amazonaws.com/bucket/output/client-123-0.png?...",
"https://s3.amazonaws.com/bucket/output/client-123-1.png?..."
],
"message": "PDF conversion and zip upload completed",
"images": "https://s3.amazonaws.com/bucket/output.zip",
"unique_id": "client-123",
"status": "completed",
"pages_converted": 2,
Expand All @@ -281,7 +282,9 @@ Converts a PDF to images.
}
```

**Note:** The service processes PDFs synchronously and returns the converted images in the response. If a webhook URL is provided, a notification is also sent asynchronously (fire-and-forget) upon completion.
**Zip File Contents:** The zip file contains PNG images named as `{unique_id}-0.png`, `{unique_id}-1.png`, etc., corresponding to each page of the PDF.

**Note:** The service processes PDFs synchronously and returns the zip file URL in the response. If a webhook URL is provided, a notification is also sent asynchronously (fire-and-forget) upon completion.

## Architecture

Expand Down Expand Up @@ -311,6 +314,7 @@ The Lambda function uses these environment variables:
- **aws-sdk-secretsmanager (~> 1)**: AWS SDK for secure key retrieval
- **json (~> 2.9)**: JSON parsing and generation
- **ruby-vips (~> 2.2)**: Ruby bindings for libvips image processing library
- **rubyzip (~> 2.3)**: Zip file creation and manipulation
- **async (~> 2.6)**: Asynchronous processing for batch uploads

### Testing
Expand Down
3 changes: 3 additions & 0 deletions pdf_converter/Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ gem 'aws-sdk-secretsmanager', '~> 1'
# PDF to image conversion
gem 'ruby-vips', '~> 2.2'

# Zip file creation
gem 'rubyzip', '~> 2.3'

# Async processing for batch uploads
gem 'async', '~> 2.6'

Expand Down
2 changes: 2 additions & 0 deletions pdf_converter/Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,7 @@ GEM
simplecov (>= 0.22.0)
tty-which (~> 0.5.0)
virtus (~> 2.0)
rubyzip (2.4.1)
sexp_processor (4.17.4)
simplecov (0.22.0)
docile (~> 1.1)
Expand Down Expand Up @@ -214,6 +215,7 @@ DEPENDENCIES
rubocop (~> 1.81)
ruby-vips (~> 2.2)
rubycritic (~> 4.9)
rubyzip (~> 2.3)
simplecov (~> 0.22)
webmock (~> 3.19)

Expand Down
26 changes: 13 additions & 13 deletions pdf_converter/app.rb
Original file line number Diff line number Diff line change
Expand Up @@ -66,21 +66,21 @@ def process_pdf_conversion(request_body, start_time, response_builder)
page_count = images.size
puts "PDF converted successfully: #{page_count} pages"

# Upload images
upload_result = ImageUploader.new.upload_images_from_files(request_body['destination'], images)
return handle_failure(upload_result, response_builder, 'Image upload', output_dir) unless upload_result[:success]
# Upload images as zip file
upload_result = ImageUploader.new.upload_images_from_files(request_body['destination'], images, unique_id)
return handle_failure(upload_result, response_builder, 'Zip upload', output_dir) unless upload_result[:success]

uploaded_urls = upload_result[:uploaded_urls]
puts "Images uploaded successfully: #{uploaded_urls.size} files"
zip_url = upload_result[:zip_url]
puts "Zip file uploaded successfully: #{zip_url}"

# Send webhook notification
notify_webhook(request_body['webhook'], unique_id, uploaded_urls, page_count, start_time)
notify_webhook(request_body['webhook'], unique_id, zip_url, page_count, start_time)

# Clean up and return success
FileUtils.rm_rf(output_dir)
response_builder.success_response(
unique_id: unique_id,
uploaded_urls: uploaded_urls,
zip_url: zip_url,
page_count: page_count,
metadata: conversion_result[:metadata]
)
Expand All @@ -104,23 +104,23 @@ def handle_failure(result, response_builder, operation, output_dir = nil)
#
# @param webhook_url [String, nil] Webhook URL
# @param unique_id [String] Unique identifier
# @param uploaded_urls [Array<String>] Uploaded image URLs
# @param zip_url [String] URL of the uploaded zip file
# @param page_count [Integer] Number of pages
# @param start_time [Float] Processing start time
def notify_webhook(webhook_url, unique_id, uploaded_urls, page_count, start_time)
def notify_webhook(webhook_url, unique_id, zip_url, page_count, start_time)
return unless webhook_url

send_webhook(webhook_url, unique_id, uploaded_urls, page_count, start_time)
send_webhook(webhook_url, unique_id, zip_url, page_count, start_time)
end

# Sends webhook notification asynchronously (non-blocking).
#
# @param webhook_url [String] The URL to send the notification to
# @param unique_id [String] Unique identifier for this conversion
# @param uploaded_urls [Array<String>] Array of uploaded image URLs
# @param zip_url [String] URL of the uploaded zip file
# @param page_count [Integer] Number of pages converted
# @param start_time [Float] Start time of the conversion process
def send_webhook(webhook_url, unique_id, uploaded_urls, page_count, start_time)
def send_webhook(webhook_url, unique_id, zip_url, page_count, start_time)
notifier = WebhookNotifier.new
end_time = Time.now.to_f
processing_time_ms = ((end_time - start_time) * 1000).to_i
Expand All @@ -129,7 +129,7 @@ def send_webhook(webhook_url, unique_id, uploaded_urls, page_count, start_time)
webhook_url: webhook_url,
unique_id: unique_id,
status: 'completed',
images: uploaded_urls,
images: zip_url,
page_count: page_count,
processing_time_ms: processing_time_ms
)
Expand Down
95 changes: 28 additions & 67 deletions pdf_converter/app/image_uploader.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
require 'async/semaphore'
require_relative '../lib/retry_handler'
require_relative '../lib/url_utils'
require_relative '../lib/zip_builder'

# ImageUploader handles uploading images to S3 using pre-signed URLs
# with proper error handling, retries, and concurrent upload support
Expand Down Expand Up @@ -47,10 +48,11 @@ def upload(url, content, content_type = 'image/png')
error_result('Invalid URL format')
rescue StandardError => e
# Provide better error message for 403 errors
if e.message.include?('403')
error_message = e.message
if error_message.include?('403')
error_result('Access denied - URL may be expired or invalid')
else
error_result("Upload failed: #{e.message}")
error_result("Upload failed: #{error_message}")
end
end

Expand Down Expand Up @@ -82,28 +84,41 @@ def upload_batch(urls, images, content_type = 'image/png')
end

# Sort results by index to maintain order
results.sort_by! { |r| r[:index] }
results.sort_by! { |result| result[:index] }

successful = results.count { |r| r[:success] }
successful = results.count { |result| result[:success] }
log_info("Batch upload completed: #{successful}/#{results.size} successful")

results
end

# Uploads image files to S3 destination using pre-signed URL
# @param destination_url [String] Pre-signed S3 destination URL
# Uploads image files to S3 destination as a zip file using pre-signed URL
# @param destination_url [String] Pre-signed S3 destination URL for the zip file
# @param image_paths [Array<String>] Array of image file paths
# @return [Hash] Result with :success, :uploaded_urls, :etags, or :error
def upload_images_from_files(destination_url, image_paths)
base_uri = parse_destination_url(destination_url)
image_urls, image_contents = prepare_images_for_upload(image_paths, base_uri)
# @param unique_id [String] Unique identifier for naming images in the zip
# @return [Hash] Result with :success, :zip_url, :etag, or :error
def upload_images_from_files(destination_url, image_paths, unique_id)
log_info("Creating zip file with #{image_paths.size} images")

upload_results = upload_batch(image_urls, image_contents, 'image/png')
process_upload_results(upload_results, image_urls)
# Create zip file in memory
zip_content = ZipBuilder.create_from_images(image_paths, unique_id)

log_info("Zip file created, size: #{zip_content.bytesize} bytes")

# Upload zip file to S3
upload_result = upload(destination_url, zip_content, 'application/zip')

return { success: false, error: upload_result[:error] } unless upload_result[:success]

{
success: true,
zip_url: UrlUtils.strip_query_params([destination_url]).first,
etag: upload_result[:etag]
}
rescue StandardError => e
{
success: false,
error: "Upload error: #{e.message}"
error: "Zip upload error: #{e.message}"
}
end

Expand Down Expand Up @@ -165,58 +180,4 @@ def log_info(message)
def log_error(message)
@logger&.error(message) || puts("ERROR: #{message}")
end

# Parses the destination URL and returns a base URI with proper path.
#
# @param destination_url [String] Destination URL
# @return [URI] Base URI with normalized path
def parse_destination_url(destination_url)
uri = URI.parse(destination_url)
uri_path = uri.path
uri.path = uri_path.end_with?('/') ? uri_path : "#{uri_path}/"
uri
end

# Prepares image URLs and contents for batch upload.
#
# @param image_paths [Array<String>] Image file paths
# @param base_uri [URI] Base URI for uploads
# @return [Array<Array>] Two arrays: URLs and contents
def prepare_images_for_upload(image_paths, base_uri)
image_urls = []
image_contents = []

image_paths.each_with_index do |image_path, index|
image_uri = base_uri.dup
image_uri.path = "#{base_uri.path}page-#{index + 1}.png"

image_urls << image_uri.to_s
image_contents << File.read(image_path, mode: 'rb')
end

[image_urls, image_contents]
end

# Processes upload results and returns success or failure hash.
#
# @param upload_results [Array<Hash>] Upload results
# @param image_urls [Array<String>] Image URLs
# @return [Hash] Result with :success, :uploaded_urls, :etags, or :error
def process_upload_results(upload_results, image_urls)
failed_uploads = upload_results.reject { |result| result[:success] }

if failed_uploads.any?
error_messages = failed_uploads.map { |result| result[:error] }.uniq.join(', ')
return {
success: false,
error: "Failed to upload #{failed_uploads.size} images: #{error_messages}"
}
end

{
success: true,
uploaded_urls: UrlUtils.strip_query_params(image_urls),
etags: upload_results.map { |result| result[:etag] }
}
end
end
10 changes: 6 additions & 4 deletions pdf_converter/app/jwt_authenticator.rb
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,9 @@ def authenticate(headers)
log_debug('Authentication successful')
{ authenticated: true, payload: validation_result[:payload] }
else
log_error("Authentication failed: #{validation_result[:error]}")
{ authenticated: false, error: validation_result[:error] }
error_message = validation_result[:error]
log_error("Authentication failed: #{error_message}")
{ authenticated: false, error: error_message }
end
end

Expand Down Expand Up @@ -110,11 +111,12 @@ def retrieve_secret
def build_client_config
config = { region: ENV['AWS_REGION'] || 'us-east-1' }

return config unless ENV['AWS_ENDPOINT_URL']
endpoint_url = ENV['AWS_ENDPOINT_URL']
return config unless endpoint_url

# Configure for LocalStack testing environment
config.merge(
endpoint: ENV['AWS_ENDPOINT_URL'],
endpoint: endpoint_url,
access_key_id: ENV['AWS_ACCESS_KEY_ID'] || 'test',
secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'] || 'test',
ssl_verify_peer: false
Expand Down
Loading
Loading