Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 222 additions & 0 deletions PAGINATION_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
# Pagination Performance Analysis for PR #33

## Summary

After investigating the pagination changes in PR #33, I've identified three performance issues that cause syncs to be slower, especially noticeable in certain scenarios.

## Issues Identified

### 1. Cache Invalidation (First Sync After Update) ⚠️

**Impact**: Severe slowdown on first sync after updating

**Root Cause**: The pagination implementation changed all API endpoint URLs by adding query parameters:
- **Before**: `/user/following`, `/users/bob/starred`
- **After**: `/user/following?page=1&per_page=100`, `/users/bob/starred?page=1&per_page=100`

**Problem**: The GitHub client uses path-based ETag caching (github/client.go:168-173). When paths change, all cached ETags become useless. This means:
- The first sync after updating to PR #33 hits GitHub's API for **every endpoint** instead of returning 304 Not Modified
- No benefit from conditional requests on first sync
- For users following 100 people, this is 1 + (3 × 100) = **301 API calls** instead of potentially just a few

**Evidence**: See cache lookup in github/client.go:168-173

### 2. Extra API Call for Exactly 100 Items ⚠️

**Impact**: Moderate slowdown for users with exactly 100 items in any category

**Root Cause**: The pagination logic in `getPaginated()` (github/client.go:301-333) continues fetching when it gets exactly `perPage` items:

```go
// If we got fewer results than per_page, this is the last page
if pageSlice.Len() < perPage {
break
}
```

**Problem**: When a result set has exactly 100 items (or any multiple of 100), the code:
1. Fetches page 1 → gets 100 items
2. Sees 100 == 100 (perPage), so continues
3. Fetches page 2 → gets 0 items
4. Only then stops

**Example Scenario**: A user who follows exactly 100 people:
- Expected: 1 API call
- Actual: 2 API calls (100% increase)

**Per-User Impact**: For each followed user, we call 3 methods:
- `GetStarredReposByUsername`
- `GetOwnedReposByUsername`
- `GetRecentEvents`

If any of these returns exactly 100 items, that's an extra API call. For 100 followed users, this could add up to 300 extra API calls in the worst case.

**Evidence**: Test in github/client_test.go:637-690 explicitly expects 2 requests for exactly 100 items (see comment on line 687)

### 3. No Incremental Caching for Multi-Page Results

**Impact**: Minor but cumulative

**Root Cause**: GitHub only returns Link headers for pagination, not stable ETags across page boundaries.

**Problem**: Users with > 100 items in any category can't benefit from conditional requests for pages beyond the first. Each page is a new URL that's never been cached.

**Example**: A user with 250 followed users:
- Page 1: `/user/following?page=1&per_page=100` (can be cached)
- Page 2: `/user/following?page=2&per_page=100` (new path, no cache benefit)
- Page 3: `/user/following?page=3&per_page=100` (new path, no cache benefit)

## Real-World Impact

### Single User Scenario (following 5 people)

**Before PR #33**:
- Fetch followed users: 1 API call
- For each of 5 users: 3 API calls
- Total: 16 API calls
- With ETag caching on subsequent syncs: potentially just 16 × 304 responses

**After PR #33** (first sync):
- Fetch followed users: 1 API call (cache miss due to path change)
- For each of 5 users: 3 API calls (all cache misses)
- Total: 16 API calls **with no cache hits** → full data transfers

**After PR #33** (subsequent syncs):
- Same as before (16 calls), but paths are now cached again
- Performance returns to normal after first sync

### User with Exactly 100 Followed Users

**Before PR #33**:
- Total API calls: ~301 (1 + 3×100)

**After PR #33**:
- Followed users endpoint: 2 calls (100 users, then check page 2)
- Per user: potentially 2 calls if starred repos = 100, 2 if owned repos = 100, 2 if events = 100
- Worst case: 2 + (6 × 100) = **602 API calls** (100% increase!)
- Typical case: 2 + (4 × 100) = **402 API calls** (33% increase)

## Proposed Fixes

### Fix #1: Preserve Cache Keys (Migration Strategy)

**Option A - Maintain Backward Compatibility**:
- Store cache entries for both old and new path formats during transition
- Check both `/user/following` and `/user/following?page=1&per_page=100` caches
- Gradually migrate over time

**Option B - Accept One-Time Cache Miss**:
- Document that first sync after update will be slower
- This is a one-time cost and resolves itself automatically

**Recommendation**: Option B with documentation. The cache miss is a one-time event and fixing it adds complexity.

### Fix #2: Optimize Exactly-100 Detection ⭐ PRIORITY

**Current Logic**:
```go
if pageSlice.Len() < perPage {
break
}
```

**Proposed Fix**: Use GitHub's Link header to detect if there's a next page, instead of assuming there is when we get exactly `perPage` items.

**Alternative Fix** (simpler): Special case handling for exactly perPage:
```go
// If we got fewer than per_page, this is the last page
if pageSlice.Len() < perPage {
break
}

// If we got exactly per_page items, peek at the next page
// but only if we haven't made too many requests
if pageSlice.Len() == perPage {
// Could check Link header here instead of making another request
// GitHub provides: Link: <url?page=2>; rel="next"
}
```

**Impact**: Reduces API calls by up to 50% for users with exactly 100 items in any category.

### Fix #3: Implement Link Header Pagination ⭐ BEST SOLUTION

**Proper Approach**: Parse GitHub's `Link` header to determine if there are more pages:

```go
// Check for Link header with rel="next"
linkHeader := resp.Header.Get("Link")
hasNext := strings.Contains(linkHeader, `rel="next"`)

if !hasNext || pageSlice.Len() == 0 {
break
}
```

**Benefits**:
- Eliminates extra API calls for exactly 100 items
- More reliable than guessing based on result count
- Standard GitHub API practice
- No assumptions about page size

**Impact**: Fixes issue #2 completely, reducing unnecessary API calls by up to 300 for users following many people.

## Recommendation

1. **Immediate**: Implement Link header pagination (Fix #3) - this is the proper solution and fixes issue #2 completely
2. **Document**: Note that first sync after update will be slower due to cache invalidation (issue #1)
3. **Consider**: Future optimization for multi-page caching (issue #3) only if it becomes a bottleneck

## Test Case to Add

```go
func TestGetPaginatedNoExtraCallForExactly100Items(t *testing.T) {
// Test that we don't make an extra API call when exactly 100 items exist
// Should use Link header to detect end of pagination
repos := make([]Repository, 100)
for i := 0; i < 100; i++ {
repos[i] = Repository{ID: int64(i), Name: fmt.Sprintf("repo%d", i)}
}

requestCount := 0
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestCount++
page := r.URL.Query().Get("page")

if page == "1" {
w.Header().Set("Content-Type", "application/json")
// No Link header with rel="next" means this is the last page
json.NewEncoder(w).Encode(repos)
} else {
t.Errorf("should not request page %s when page 1 has no next link", page)
}
}))
defer server.Close()

c := NewClient("token", WithBaseURL(server.URL))
result, err := c.GetStarredRepos(context.Background())

require.NoError(t, err)
assert.Equal(t, 100, len(result))
assert.Equal(t, 1, requestCount, "should only make 1 request when Link header indicates no next page")
}
```

## Files to Modify

1. **github/client.go**: Update `getPaginated()` to check Link header (lines 286-338)
2. **github/client.go**: Update `get()` method to return Link header or parse it
3. **github/client_test.go**: Update test expectations (line 687-689)
4. **github/client_test.go**: Add new test for Link header pagination

## API Call Comparison

| Scenario | Before PR #33 | After PR #33 (current) | After Fix |
|----------|---------------|------------------------|-----------|
| 5 followed users (first sync) | 16 | 16 | 16 |
| 5 followed users (cached) | 16 (mostly 304s) | 16 (mostly 304s after first) | 16 (mostly 304s) |
| 100 followed users, all with ~50 items each | 301 | 301 | 301 |
| 100 followed users, all with exactly 100 items | 301 | 602 | 301 |
| **100 followed users, 50 with exactly 100 items** | **301** | **451** | **301** |

**Worst case impact**: Currently adding **150 extra API calls** (50% increase) for users in the exactly-100-items scenario.
Loading
Loading