Skip to content

Latest commit

 

History

History
executable file
·
456 lines (343 loc) · 10.3 KB

File metadata and controls

executable file
·
456 lines (343 loc) · 10.3 KB

Outlook Handler - Pagination Fix

The Problem

Just like Gmail, the Outlook handler had pagination limits that prevented it from processing all matching emails!

Before (Limited)

params = {
    '$filter': filter_query,
    '$top': 100  # Only first 100 emails!
}

With large mailboxes, this meant:

  • ❌ Only 100 promotions processed (even if you have 1000+)
  • ❌ Only 100 job alerts processed
  • ❌ Only 100 social media notifications processed
  • ❌ Missing thousands of emails that should be cleaned up

After (Complete)

while True:
    # Process batch of 100
    # Get next page link
    next_link = data.get('@odata.nextLink')
    if not next_link:
        break  # All pages processed!

Now processes ALL matching emails across all pages! 🎉


How Microsoft Graph Pagination Works

Microsoft Graph API uses OData pagination:

Initial Request

GET https://graph.microsoft.com/v1.0/me/messages?$top=100&$filter=...

Response Includes Next Link

{
  "value": [ /* 100 messages */ ],
  "@odata.nextLink": "https://graph.microsoft.com/v1.0/me/messages?$skip=100&$top=100&$filter=..."
}

Follow Next Link

GET https://graph.microsoft.com/v1.0/me/messages?$skip=100&$top=100&$filter=...

Continue Until No More Pages

{
  "value": [ /* last batch of messages */ ]
  // No @odata.nextLink = we're done!
}

What Was Fixed

1. Delete Emails (_delete_emails_by_filter)

Before:

# Only gets first 100 emails
response = requests.get(url, headers=headers, params=params)
messages = response.json().get('value', [])
# Process 100 messages and stop

After:

while True:
    # Get batch of 100
    if next_link:
        response = requests.get(next_link, headers=headers)
    else:
        response = requests.get(url, headers=headers, params=params)
    
    messages = data.get('value', [])
    # Process all messages in batch
    
    # Get next page
    next_link = data.get('@odata.nextLink')
    if not next_link:
        break  # All pages done!

2. Archive Emails (_archive_emails_by_filter)

Same pagination logic added to archiving operations.

3. Find Important Emails (_find_important_emails)

Now continues fetching until:

  • 100 important emails found (limit for calendar events), OR
  • No more pages available

Updated Methods

All three core methods now support pagination:

Method Before After
_delete_emails_by_filter First 100 only All matching emails
_archive_emails_by_filter First 100 only All matching emails
_find_important_emails First 50 only Up to 100 (with pagination)

Progress Tracking

Just like Gmail, you'll now see real-time progress:

Searching for old promotional emails...
  Processing batch of 100 messages (total found so far: 100)...
    Progress: 100 messages processed...
  Processing batch of 100 messages (total found so far: 200)...
    Progress: 200 messages processed...
  Processing batch of 100 messages (total found so far: 300)...
    Progress: 300 messages processed...
  Processing batch of 87 messages (total found so far: 387)...
  Would delete 387 old promotional emails

Performance Comparison

Before (Limited)

Outlook Processing Summary:
  Promotions deleted: 100        ← Hit limit!
  Job alerts deleted: 100        ← Hit limit!
  Social media deleted: 100      ← Hit limit!
  App acknowledgements: 100      ← Hit limit!

After (Complete)

Outlook Processing Summary:
  Promotions deleted: 428        ← All found!
  Job alerts deleted: 419        ← All found!
  Social media deleted: 892      ← All found!
  App acknowledgements: 156      ← All found!

API Rate Limits

Microsoft Graph Limits

  • Throttling: 10,000 requests per 10 minutes per user
  • Typical operations: 1-2 requests per email (fetch + delete/archive)
  • For 1000 emails: ~2000 requests = well under limit

Processing Time Estimates

  • Small mailbox (500 emails): 1-2 minutes
  • Medium mailbox (2000 emails): 5-10 minutes
  • Large mailbox (10,000 emails): 30-60 minutes

Rate Limit Handling

If you hit rate limits, you'll see:

Error fetching old promotions: 429

429 = Too Many Requests

Current implementation will:

  • Log the error
  • Stop processing that category
  • Continue with other categories

Future enhancement: Could add automatic retry with exponential backoff.


Batch Size

Outlook API processes in batches of 100:

'$top': 100  # Max per page for Outlook

Why 100?

  • Microsoft Graph recommended batch size
  • Good balance between:
    • Network overhead (fewer requests)
    • Memory usage (not too many at once)
    • Progress visibility (see updates frequently)

Could increase to 500?

  • No! Graph API typically limits to 100-200
  • Requests with $top > 999 will fail
  • 100 is the sweet spot

Testing the Fix

1. Run Dry-Run Mode

cd ~/email-manager
source venv/bin/activate
python email_manager.py --provider outlook

2. Watch for Pagination

Look for output like:

Processing batch of 100 messages (total found so far: 100)...
Processing batch of 100 messages (total found so far: 200)...
Processing batch of 100 messages (total found so far: 300)...

If you see this pattern, pagination is working!

3. Check Final Numbers

Before fix:

Promotions deleted: 100  ← Suspiciously round
Job alerts deleted: 100   ← Suspiciously round

After fix:

Promotions deleted: 387  ← Real number!
Job alerts deleted: 219   ← Real number!

Code Changes Summary

Added to Each Method:

  1. Next link tracking
next_link = None
total_found = 0
  1. Pagination loop
while True:
    if next_link:
        response = requests.get(next_link, headers=headers)
    else:
        response = requests.get(url, headers=headers, params=params)
    
    # Process batch
    
    next_link = data.get('@odata.nextLink')
    if not next_link:
        break
  1. Progress logging
total_found += len(messages)
logger.info(f"Processing batch of {len(messages)} (total: {total_found})...")

if processed % 100 == 0:
    logger.info(f"Progress: {processed} messages processed...")

Comparison: Gmail vs Outlook Pagination

Gmail (Uses pageToken)

results = self.service.users().messages().list(
    userId='me',
    q=query,
    maxResults=500,
    pageToken=page_token  # Gmail's pagination token
).execute()

page_token = results.get('nextPageToken')

Outlook (Uses @odata.nextLink)

response = requests.get(url, headers=headers, params={
    '$filter': filter_query,
    '$top': 100
})

data = response.json()
next_link = data.get('@odata.nextLink')  # Outlook's next page URL

Both achieve the same result: Process ALL matching emails!


Troubleshooting

Issue: Still seeing exactly 100 results

Check:

  1. Are you using the updated handler?

    grep "@odata.nextLink" ~/email-manager/outlook_handler.py

    Should return matches!

  2. Do you actually have more than 100?

    • Run query manually to verify
    • Check different date ranges

Issue: Taking too long

This is normal for large mailboxes!

  • Outlook API processes ~2-5 emails/second
  • 1000 emails = 3-8 minutes
  • Be patient and watch the progress logs

Issue: 429 Rate Limit Error

You're processing too fast!

  • Outlook has rate limits
  • Wait a few minutes
  • Run again
  • Consider processing one category at a time

Issue: Network timeout

Long-running operation timed out:

  • Your network connection dropped
  • Run again - it will continue where filters allow
  • Consider shorter date ranges

Implementation Notes

Why Not Use $skip?

You might think: "Why not use $skip to paginate?"

# DON'T DO THIS:
for skip in range(0, 10000, 100):
    params = {'$skip': skip, '$top': 100}

Problems:

  1. Inefficient: Each request still scans from the start
  2. Unreliable: Emails can be deleted mid-pagination, causing skips
  3. Not recommended: Microsoft explicitly says use @odata.nextLink

Use @odata.nextLink instead:

  1. Efficient: Server maintains cursor position
  2. Reliable: Consistent results even if data changes
  3. Recommended: Official Microsoft best practice

Why Process in Loop?

while True:
    # Process batch
    if not next_link:
        break

Alternative (BAD):

# Get all emails first, then process
all_emails = []
while next_link:
    all_emails.extend(get_batch())
# Now process all_emails

Why the loop approach is better:

  1. Memory efficient: Don't load 10,000 emails into memory
  2. Faster feedback: See progress immediately
  3. Failure recovery: If it crashes, some work is done
  4. Rate limit friendly: Can add delays between batches

Summary

Before Fix

  • ❌ Only processed first 100 emails per category
  • ❌ Missed thousands of emails in large mailboxes
  • ❌ No progress tracking
  • ❌ Suspiciously round numbers (100, 100, 100)

After Fix

  • ✅ Processes ALL matching emails
  • ✅ Handles mailboxes of any size
  • ✅ Real-time progress tracking
  • ✅ Accurate totals (387, 219, 892, etc.)

What You Get

Before: "Found 100 old promotions"
After:  "Processing batch of 100... (total: 100)
        Processing batch of 100... (total: 200)
        Processing batch of 100... (total: 300)
        Processing batch of 87... (total: 387)
        Would delete 387 old promotions"

Update Instructions

cd ~/email-manager

# Backup current file
cp outlook_handler.py outlook_handler.py.backup

# Copy updated file
cp /mnt/user-data/outputs/outlook_handler.py ~/email-manager/

# Test with dry-run
source venv/bin/activate
python email_manager.py --provider outlook

# Look for pagination in logs
grep "Processing batch" email_manager.log

# When confident
python email_manager.py --provider outlook --live

Benefits

🎉 Complete email processing - No more missed emails
🎉 Progress visibility - Know exactly what's happening
🎉 Large mailbox support - Handles thousands of emails
🎉 Better statistics - Accurate counts, not limits

Your Outlook account will now be just as thoroughly cleaned as Gmail! 🚀