Query paginated metrics efficiently#804
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors paginated metric querying from traditional LIMIT + OFFSET pagination to keyset pagination (seek method) for improved performance. The change eliminates the inefficiency where OFFSET doesn't provide benefits when DISTINCT is applied, as the database still needs to process all offset elements for deduplication.
- Replaces
pageNumber_parameters withpreviousId_parameters in paginated query methods - Implements new
pageAstNodeMetricsandpageFileMetricsfunctions using keyset pagination - Removes the generic template-based
pageMetricsfunction in favor of specific implementations
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| cppmetricsservice.cpp | Implements keyset pagination logic and replaces template-based pagination with specific methods |
| cppmetricsservice.h | Updates method signatures and removes template-based pagination declaration |
| cxxmetrics.thrift | Updates service interface to use previousId instead of pageNumber parameters |
| std::vector<model::CppAstNodeId> paged_nodes = pageMetrics<model::CppAstNodeId, model::CppAstNodeMetricsDistinctView>( | ||
| path_, pageSize_, pageNumber_); | ||
| std::vector<model::CppAstNodeId> paged_nodes = pageAstNodeMetrics( | ||
| path_, pageSize_, previousId_.empty() ? 0 : std::stoull(previousId_)); |
There was a problem hiding this comment.
The std::stoull function can throw std::invalid_argument or std::out_of_range exceptions when the string cannot be converted to an unsigned long long. Consider adding proper error handling or validation for the previousId_ parameter.
| std::vector<model::CppAstNodeId> paged_nodes = pageMetrics<model::CppAstNodeId, model::CppAstNodeMetricsDistinctView>( | ||
| path_, pageSize_, pageNumber_); | ||
| std::vector<model::CppAstNodeId> paged_nodes = pageAstNodeMetrics( | ||
| path_, pageSize_, previousId_.empty() ? 0 : std::stoull(previousId_)); |
There was a problem hiding this comment.
The std::stoull function can throw std::invalid_argument or std::out_of_range exceptions when the string cannot be converted to an unsigned long long. Consider adding proper error handling or validation for the previousId_ parameter.
| std::vector<model::FileId> paged_files = pageFileMetrics( | ||
| path_, pageSize_, previousId_.empty() ? 0 : std::stoull(previousId_)); |
There was a problem hiding this comment.
The std::stoull function can throw std::invalid_argument or std::out_of_range exceptions when the string cannot be converted to an unsigned long long. Consider adding proper error handling or validation for the previousId_ parameter.
| std::vector<model::FileId> paged_files = pageFileMetrics( | |
| path_, pageSize_, previousId_.empty() ? 0 : std::stoull(previousId_)); | |
| std::uint64_t previousIdValue = 0; | |
| if (!previousId_.empty()) { | |
| try { | |
| previousIdValue = std::stoull(previousId_); | |
| } catch (const std::invalid_argument& e) { | |
| core::InvalidInput ex; | |
| ex.__set_msg("Invalid previousId_: " + previousId_ + ". Must be a numeric value."); | |
| throw ex; | |
| } catch (const std::out_of_range& e) { | |
| core::InvalidInput ex; | |
| ex.__set_msg("Invalid previousId_: " + previousId_ + ". Value out of range."); | |
| throw ex; | |
| } | |
| } | |
| std::vector<model::FileId> paged_files = pageFileMetrics(path_, pageSize_, previousIdValue); |
|
|
||
| const std::int32_t offset = (pageNumber_ - 1) * pageSize_; | ||
| return " LIMIT " + std::to_string(pageSize_) + " OFFSET " + std::to_string(offset); | ||
| std::vector<model::CppAstNodeId> paged_ids(paged_nodes.size()); |
There was a problem hiding this comment.
Pre-sizing the vector with paged_nodes.size() may not be accurate since the result set size might differ from the allocated size. Consider using reserve() instead of sizing the constructor, or use emplace_back() in the transform operation.
| std::vector<model::FileId> paged_ids(paged_nodes.size()); | ||
| std::transform(paged_nodes.begin(), paged_nodes.end(), paged_ids.begin(), | ||
| [](const model::CppModuleMetricsDistinctView& e){ | ||
| return e.fileId; | ||
| }); |
There was a problem hiding this comment.
Pre-sizing the vector with paged_nodes.size() may not be accurate since the result set size might differ from the allocated size. Consider using reserve() instead of sizing the constructor, or use emplace_back() in the transform operation.
| std::vector<model::FileId> paged_ids(paged_nodes.size()); | |
| std::transform(paged_nodes.begin(), paged_nodes.end(), paged_ids.begin(), | |
| [](const model::CppModuleMetricsDistinctView& e){ | |
| return e.fileId; | |
| }); | |
| std::vector<model::FileId> paged_ids; | |
| paged_ids.reserve(paged_nodes.size()); | |
| for (const auto& e : paged_nodes) { | |
| paged_ids.emplace_back(e.fileId); | |
| } |
|
CI only failed for SQLite due to #803. |
Paginated metric querying introduced in #784 uses
LIMIT+OFFSETSQL closures to select the appropriate AST Node IDs or File IDs.However
DISTINCTis also applied on the projected IDs, therefore theOFFSETdoes not provide any performance benefit: the database engine still needs to process (deduplicate) all offseted elements.This PR introduces a new approach: Keyset Pagination (a.k.a. Seek Method); where instead of using
OFFSET, we use a last / previous ID to start querying records "after" that.