-
-
Notifications
You must be signed in to change notification settings - Fork 2k
MDEV-31956 SSD based InnoDB buffer pool extension #4510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 11.8
Are you sure you want to change the base?
Conversation
|
|
ee7d993 to
bc5b76d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I review this deeper, could you please fix the build and the test innodb.ext_buf_pool on Microsoft Windows?
99921cf to
64eb035
Compare
| @@ -0,0 +1 @@ | |||
| --innodb-buffer-pool-size=21M --innodb-extended-buffer-pool-size=1M | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, the local SSD is larger than the RAM, and therefore it would be more interesting to test a scenario where the buffer pool is smaller than the extended buffer pool. I understand that our current CI setup is not suitable for any performance testing. But, currently we only cover this functionality in debug-instrumented executables.
Would it be possible to write some sort of a test, or extend an existing one (such as encryption.innochecksum) with a variant that would use the minimal innodb_buffer_pool_size and a larger innodb_extended_buffer_pool_size? Currently, that test is running with innodb_buffer_pool_size=64M.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test flushes small amount of pages into extended buffer pool file, that's why there was not any reason to set a big file size. Probably it makes sense to develop separate test, we can set the minimal innodb_buffer_pool_size and a larger innodb_extended_buffer_pool_size, but what exactly are we going to test in that separate test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that it would be useful to mimic a typical deployment scenario where we expect some pages to be written to the local SSD and read back from there. Sure, mtr tests are limited in size and not good for performance testing. Should there be any race conditions, having a test run on multiple platforms for every push would help catch them over time.
db5856a to
c78f5ac
Compare
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a partial review.
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another incomplete review.
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some more comments. I’m doing this review in batches so that you can work concurrently on addressing some of these comments. Everything seems to be relatively minor stuff.
| fil_space_t *space; | ||
| if (ext_buf()) | ||
| { | ||
| ut_d(fil_space_t *debug_space=) space= | ||
| fil_space_t::get(buf_page->id().space()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to declare the variable space this early? As far as I can tell, it’s only truly needed if !ext_buf().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case when ext_buf() returns true we invoke fil_space_t::get(...), what increases the space's reference counter. We need to decrease it, and we decrease it at the end of IORequest::read_complete().
| /* We must hold buf_pool.mutex while releasing the block, so that | ||
| no other thread can access it before we have freed it. */ | ||
| mysql_mutex_lock(&buf_pool.mutex); | ||
| buf_page->write_complete_release(buf_page->state()); | ||
| buf_LRU_free_page(buf_page, true, ext_buf_page()); | ||
| mysql_mutex_unlock(&buf_pool.mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any alternative to this, such as holding an exclusive latch on the buf_pool.page_hash slice?
Please add a FIXME comment to implement deferred, batched freeing, similar to the one we do in the normal LRU flushing/eviction in the page cleaner thread.
This would be an obvious performance bottleneck us when there is some write activity to the external buffer pool going on. The IORequest::write_complete() is expected to be called from multiple threads concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we already have the bottleneck for temporary tables, see this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the FIXME comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the FIXME comment. Yes, we have a similar existing bottleneck, but it is only affecting the temporary tablespace, which hopefully will very seldomly be subjected to any LRU flushing. My expectation is that temporary tables are very short-lived, and hence any modified pages will soon be marked as freed in the tablespace, and therefore buf_page_t::flush() would normally skip any writes altogether:
buf_pool.release_freed_page(this);
return false;This new bottleneck would affect all writes to the external buffer pool file.
47b10e0 to
9a06eec
Compare
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still yet to review the change to buf0flu.cc, buf0lru.cc, buf0rea.cc.
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I only have buf0flu.cc left.
| auto fio= space->io(IORequest(IORequest::READ_SYNC), | ||
| auto fio= | ||
| init_page_result.in_ext_buffer_pool() | ||
| ? fil_io_t{fil_system.ext_bp_io( | ||
| *bpage, *init_page_result.ext_buf_page, | ||
| IORequest::READ_SYNC, nullptr, len, dst), | ||
| nullptr} | ||
| : space->io(IORequest(IORequest::READ_SYNC), | ||
| os_offset_t{page_id.page_no()} * len, len, dst, bpage); | ||
| *err= fio.err; | ||
| thd_wait_end(thd); | ||
| if (init_page_result.in_ext_buffer_pool()) | ||
| { | ||
| mysql_mutex_lock(&buf_pool.mutex); | ||
| buf_pool.free_ext_page(*init_page_result.ext_buf_page); | ||
| mysql_mutex_unlock(&buf_pool.mutex); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are checking for init_page_result.ext_buf_page twice. Maybe it’s acceptable.
Thinking aloud: Could we free the block before reading it? As far as I can tell, the call buf_pool.ext_page_offset() inside fil_system.ext_bp_io() is not dereferencing the object. The answer is no, because if we freed the block first, then it could be reallocated for something else and the underlying file block could be overwritten before we get a chance to read the desired contents.
Also thinking aloud: Could we avoid the buf_pool.mutex here, and avoid having the buf_pool.ext_free list as well? We would simply identify freed blocks with an atomic acquire/release version of id_.is_corrupted() and id_.set_corrupted(). In that way, freeing blocks would be fast, but allocation would have to sweep the entire extended buffer pool until an empty slot is found. There could perhaps be a "cursor" to quickly locate the next available empty slot. Maybe you could add a FIXME or TODO comment about this somewhere, for example next to the definition of buf_pool.ext_free? This would also address my earlier comment about the new scalability bottleneck in IORequest::read_complete().
| *err= bpage->read_complete(*fio.node, recv_sys.recovery_on); | ||
| *err= bpage->read_complete(init_page_result.in_ext_buffer_pool() | ||
| ? *UT_LIST_GET_FIRST(space->chain) | ||
| : *fio.node, | ||
| recv_sys.recovery_on); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we did not replace IORequest::node with a union, but just added another field ext_buf_page_t *const ext_buf_page? In that way, we could avoid this change altogether.
| { | ||
| buf_pool.free_ext_page(*ext_buf_page); | ||
| ext_buf_page= nullptr; | ||
| } | ||
| mysql_mutex_unlock(&buf_pool.mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted elsewhere, if we removed buf_pool.ext_free, this could be done outside the critical section of buf_pool.mutex. But, since this is an unlikely code path (the extended buffer pool was recently disabled due to an I/O error), I think that we can live with this.
|
|
||
| bpage= static_cast<buf_page_t*>(ut_zalloc_nokey(sizeof *bpage)); | ||
| bpage= | ||
| static_cast<buf_page_t *>(ut_zalloc_nokey(sizeof *bpage)); | ||
|
|
||
| // TODO: do we need to init it for compressed pages? I think no. | ||
| page_zip_des_init(&bpage->zip); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the page_zip_des_t::clear() is redundant here, and the entire macro declaration could be removed. This is because ut_zalloc_nokey() is already zero-initializing the entire bpage. The bpage->zip.clear() would zero-initialize everything except the buffer-fix field. Can you push this code removal as a separate pull request against 10.11?
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completed my first round of review. I would suggest that you reply to each comment that you intend to fully address, and mark those comments as resolved.
There are some "nice to have" comments that can be addressed later. I would request to leave them as unresolved, so that we can easily find them when working on improving the performance.
Many of my comments are rather trivial to address and do not prevent this from being approved. After all, this has been already subjected to stress testing and a number of bugs have already been fixed.
| union | ||
| { | ||
| /** File descriptor */ | ||
| fil_node_t *const node_ptr= nullptr; | ||
| /** External buffer pool page if the request is for external buffer pool | ||
| file, i.e. the lowest bit of bpage_ptr is set, nullptr otherwise */ | ||
| ext_buf_page_t *const ext_buf_page_ptr; | ||
| }; | ||
|
|
||
| /** Page to be written on write operation, the lowest bit shows if the | ||
| request is for external buffer pool or not */ | ||
| buf_page_t *const bpage_ptr= nullptr; | ||
|
|
||
| public: | ||
| /** Page to be written on write operation */ | ||
| buf_page_t *const bpage= nullptr; | ||
|
|
||
| /** Memory to be used for encrypted or page_compressed pages */ | ||
| buf_tmp_buffer_t *const slot= nullptr; | ||
|
|
||
| /** File descriptor */ | ||
| fil_node_t *const node= nullptr; | ||
| buf_page_t *bpage() const noexcept | ||
| { | ||
| return reinterpret_cast<buf_page_t *>( | ||
| reinterpret_cast<ptrdiff_t>(bpage_ptr) & ~ptrdiff_t(1)); | ||
| }; | ||
|
|
||
| bool ext_buf() const noexcept | ||
| { | ||
| return reinterpret_cast<ptrdiff_t>(bpage_ptr) & 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of replacing node with a union and repurposing a bit of bpage we could leave those data members alone and simply add an ext_buf_page that would normally be nullptr. The sizeof(IORequest) is only 32 bytes on 64-bit systems, including 32 bits of padding around type. It should be no problem at all to let it grow by a single pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the size of IORequest is 32 bytes, as well as MAX_AIO_USERDATA_LEN, i.e. aiocb::m_userdata buffer size where the IORequest object should be fit in. We can, of course, increase MAX_AIO_USERDATA_LEN by one more sizeof(void*) as well as write_slots and read_slots arrays size. But it will cost a memcpy.
| if (UNIV_UNLIKELY(type != buf_page_t::PERSISTENT) && UNIV_LIKELY(!error)) | ||
| { | ||
| if (type == buf_page_t::EXT_BUF) | ||
| { | ||
| ut_d(if (DBUG_IF("ib_ext_bp_count_io_only_for_t")) { | ||
| if (fil_space_t *space= fil_space_t::get(bpage->id_.space())) | ||
| { | ||
| auto space_name= space->name(); | ||
| if (space_name.data() && | ||
| !strncmp(space_name.data(), "test/t.ibd", space_name.size())) | ||
| { | ||
| ++buf_pool.stat.n_pages_written_to_ebp; | ||
| } | ||
| space->release(); | ||
| } | ||
| } else) | ||
| ++buf_pool.stat.n_pages_written_to_ebp; | ||
| } | ||
| /* We must hold buf_pool.mutex while releasing the block, so that | ||
| no other thread can access it before we have freed it. */ | ||
| mysql_mutex_lock(&buf_pool.mutex); | ||
| bpage->write_complete(persistent, error, state); | ||
| buf_LRU_free_page(bpage, true); | ||
| bpage->write_complete(type, error, state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a normal non-atomic counter here, and increment it inside bpage->write_complete(), inside the critical section of buf_pool.mutex? I think it should be more efficient. Also, if we had an IORequest::node always present, we could simplify the debug injection and just refer to node->space in the debug injection.
I think that we’d better assign bpage->oldest_modification() to a local variable and pass it to buf_page_t::write_complete(), instead of passing type. It’s an atomic field, therefore we should load it only once. Inside that function we should assert that the value still matches, and then identify the type of the block (0=external buffer pool, 1=impossible, 2=temporary tablespace, >2=write to persistent file) based on that.
| ut_ad(lsn | ||
| ? lsn >= oldest_modification() || oldest_modification() == 2 | ||
| : (space->is_temporary() || space->is_being_imported())); | ||
| ut_ad(to_ext_buf || | ||
| (lsn ? lsn >= oldest_modification() || oldest_modification() == 2 | ||
| : (space->is_temporary() || space->is_being_imported()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert that space->is_temporary() will never be written to_ext_buf?
| space->reacquire(); | ||
| if (!to_ext_buf) | ||
| space->reacquire(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we omitted this condition, we could guarantee that IORequest::node (see my other comments) is valid until the write has been completed. I don’t think that we should attempt to optimize for the unlikely case that the tablespace is being dropped while a write to the external buffer pool is in progress. It’s better to keep it simple.
storage/innobase/buf/buf0flu.cc
Outdated
| else if (space->is_stopping_writes()) | ||
| { | ||
| space->release(); | ||
| space= nullptr; | ||
| no_space: | ||
| if (flush_to_ebp && !bpage->oldest_modification()) { | ||
| bpage->lock.u_unlock(true); | ||
| buf_LRU_free_page(bpage, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we always held a space reference, also for flush_to_ebp? It is true that we are not writing to space->chain but to the external buffer pool file, but some things would be simpler if we can assume that the tablespace won’t disappear while the write is in progress.
e19de53 to
054d899
Compare
buf_dblwr_t::create(): Create the doublewrite buffer in a single atomic mini-transaction. Do not write any log records for initializing any doublewrite buffer pages, in order to avoid recovery failure with innodb_log_archive=ON starting from the very beginning. The mtr.commit() in buf_dblwr_t::create() was observed to comprise 295 mtr_t::m_memo entries: 1 entry for the fil_system.sys_space and the rest split between page 5 (TRX_SYS) and page 0 (allocation metadata). We are nowhere near the sux_lock::RECURSIVE_MAX limit of 65535 per page descriptor. Reviewed by: Thirunarayanan Balathandayuthapani Tested by: Saahil Alam (cherry picked from commit e2c63a7)
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This looks OK to me from a correctness point of view. From the performance point of view, there are open issues, and we will need to do some additional testing and possibly further tweaks to fix some of the suspected bottlenecks.
Here are some final comments. The first one fil0fil.cc, regarding the suggested dummy ext_bp_node you should leave unresolved. The rest I think should be trivial and safe enough to address.
| } | ||
| else | ||
| space= node_ptr->space; | ||
|
|
||
| if (!buf_page) | ||
| { | ||
| ut_ad(!srv_read_only_mode); | ||
| ut_ad(!ext_buf()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if (!buf_page) implies !ext_buf(), we could reorder the conditions so that there would be slightly fewer conditional branches. Currently we have the following:
if (ext_buf())
{ /* … */ }
else
space= node_ptr->space;
if (!buf_page)We could refactor this as follows:
if (!buf_page)
{
ut_ad(!ext_buf());
space= node_ptr->space;
/* … */
}
else if (!ext_buf())
space= node_ptr->space;
else { /* … */ }We might even move the assignment of space before the if, if we backed it with an additional check like this:
static_assert(offsetof(fil_node_t, space) < sizeof(ext_buf_page_t), "avoid out-of-bounds read");
space= node_ptr->space; /* this accesses a wrong union field if ext_buf() */It might be cleaner to grow sizeof(IORequest) by a beyond its current limit and unconditionally allocate a node. Then we could introduce a dummy fil_node_t for the external buffer pool file, and implement the predicate differently:
bool ext_buf() const noexcept { return node == &ext_bp_node; }| space->complete_write(); | ||
| if (!ext_buf()) | ||
| space->complete_write(); | ||
| func_exit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition could be omitted if the body of the if (ext_buf()) branch simply ended in goto func_exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should call buf_page_write_complete() for extended buffer pool page too. So we can't end up the if (ext_buf()) with the goto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we could duplicate that call in the ext_buf() code path. It should not involve too many instructions to call a function with 2 parameters, one of which (*this) should have already been passed to us in the same register.
In fact, we might even create a special clone of the function buf_page_write_complete() for the ext_buf() case, but I’m not suggesting that right now.
| // FIXME: currently every second page is flushed, consider more | ||
| // suitable algorithm there | ||
| if (state != buf_page_t::FREED && fil_system.ext_buf_pool_enabled() && | ||
| !buf_pool.done_flush_list_waiters_count && | ||
| (ut_d(buf_pool.force_LRU_eviction_to_ebp ||)((++free_or_flush) & 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there should be a good chance that the address of buf_pool is already located in a register, while fil_system is not. Therefore, we’d better swap the 2nd and 3rd condition.
Should the condition state != buf_page_t::FREED actually be state < buf_page_t::UNFIXED? I don’t think it would ever make sense to write out data pages that are marked as freed (not attached to any data payload that is stored in the database), no matter if another thread might currently be holding a buffer-fix on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the condition
state != buf_page_t::FREEDactually bestate < buf_page_t::UNFIXED?
I am sure that state != buf_page_t::FREED is an equivalent of state < buf_page_t::UNFIXED. Did you mean state >= buf_page_t::UNFIXED ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really meant state < buf_page_t::UNFIXED. The special thing about FREED is that the page has been declared as free in the tablespace, but not freed from the buf_pool.LRU yet. For persistent data pages, we must not discard pages from the buffer pool before the corresponding log record for the FREE_PAGE entry has been durably written to the log.
Perhaps you were confused by some preceding code:
auto state= bpage->state();
ut_ad(state >= buf_page_t::FREED);
ut_ad(bpage->in_LRU_list);
bool flush_to_ebp= false;
if (!bpage->oldest_modification())
{
evict:
if (state != buf_page_t::FREED &&
(state >= buf_page_t::READ_FIX || (~buf_page_t::LRU_MASK & state)))
continue;We first assert that the block descriptor is mapped to a page of a tablespace, and not in any of the non-file-page states NOT_USED, MEMORY, REMOVE_HASH. The condition after evict: only accepts exact matches of state == buf_page_t::FREED because a buffer-fix such as state == buf_page_t::FREED + 1 must protect the block from being evicted from the buf_pool.LRU and the buf_pool.page_hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take a deeper look at this. We are not writing out anything here yet:
if (state != buf_page_t::FREED && fil_system.ext_buf_pool_enabled() &&
!buf_pool.done_flush_list_waiters_count &&
(ut_d(buf_pool.force_LRU_eviction_to_ebp ||)((++free_or_flush) & 1)))
{
ut_ad(!opt_bootstrap);
flush_to_ebp= true;
goto flush_to_ebp;
}The possible page write would occur later in this function:
if (flush_to_ebp)
{
/* The page latch will be released in io callback */
if (!bpage->flush(space, true))
{
buf_LRU_free_page(bpage, true);
bpage->lock.u_unlock(true);
++n->evicted;
continue;
}
}At the start of buf_page_t::flush() we check the bpage->state() again and skip any write, just invoking buf_pool.release_freed_page(this).
So, we know that we are not going to write anything for any state < buf_page_t::UNFIXED. why should we enter that code path at all? Because you moved some old code in an else block, we can’t change that condition so easily.
And I think you are right asking if I meant state >= buf_page_t::UNFIXED. As far as I can tell, the following could work:
if (state >= buf_page_t::UNFIXED && !buf_pool.done_flush_list_waiters_count &&
fil_system.ext_buf_pool_enabled() &&
(ut_d(buf_pool.force_LRU_eviction_to_ebp ||)((++free_or_flush) & 1)))
{
ut_ad(!opt_bootstrap);
flush_to_ebp= true;
goto flush_to_ebp;
}
if (state != buf_page_t::FREED &&
(state >= buf_page_t::READ_FIX || (~buf_page_t::LRU_MASK & state)))
continue;
#if !defined(DBUG_OFF)
free_page:
#endif
buf_LRU_free_page(bpage, true);
++n->evicted;In that way, we’d only attempt to write non-freed pages to the external buffer pool. Anything that is between FREED and UNFIXED-1 will skip the new code path, and anything that is clean and FREED with no buffer-fix will be evicted from the buffer pool.
So, in the end, the issue with the current code is that for state > buf_page_t::FREED && state < buf_page_t::UNFIXED we will unnecessarily execute some more code, instead of just doing continue above. It’s up to you if you think that this is trivial enough to address today.
In one of the practical cloud MariaDB setups, a server node accesses its datadir over the network, but also has a fast local SSD storage for temporary data. The content of such temporary storage is lost when the server container is destroyed. The commit uses this ephemeral fast local storage (SSD) as an extension of the portion of InnoDB buffer pool (DRAM) that caches persistent data pages. This cache is separated from the persistent storage of data files and ib_logfile0 and ignored during backup. The following system variables were introduced: innodb_extended_buffer_pool_size - the size of external buffer pool file, if it equals to 0, external buffer pool will not be used; innodb_extended_buffer_pool_path - the path to external buffer pool file. If innodb_extended_buffer_pool_size is not equal to 0, external buffer pool file will be created on startup. Only clean pages will be flushed to external buffer pool file. There is no need to flush dirty pages, as such pages will become clean after flushing, and then will be evicted when they reach the tail of LRU list. The general idea of this commit is to flush clean pages to external buffer pool file when they are evicted. A page can be evicted either by transaction thread or by background thread of page cleaner. In some cases transaction thread is waiting for page cleaner thread to finish its job. We can't do flushing in external buffer pool file when transaction threads are waithing for eviction, that would heart performance. That's why the only case for flushing is when page cleaner thread evicts pages in background and there are no waiters. For this purprose buf_pool_t::done_flush_list_waiters_count variable was introduced, we flush evicted clean pages only if the variable is zeroed. Clean pages are evicted in buf_flush_LRU_list_batch() to keep some amount of pages in buffer pool's free list. That's why we flush every second page to external buffer pool file, otherwise there could be not enought amount of pages in free list to let transaction threads to allocate buffer pool pages without page cleaner waiting. This might be not a good solution, but this is enought for prototyping. External buffer pool page is introduced to store information in buffer pool page hash about the certain page can be read from external buffer pool file. The first several members of such page must be the same as the members of internal page. External page frame must be equal to the certain value to disthinguish external page from internal one. External buffer pages are preallocated on startup in external pages array. We could get rid of the frame in external page, and check if the page's address belongs to the array to distinguish external and internal pages. There are also external pages free and LRU lists. When some internal page is decided to be flushed in external buffer pool file, a new external page is allocated eighter from the head of external free list, or from the tail of external LRU list. Both lists are protected with buf_pool.mutex. It makes sense, because a page is removed from internal LRU list during eviction under buf_pool.mutex. Then internal page is locked and the allocated external page is attached to io request for external buffer pool file, and when write request is completed, the internal page is replaced with external one in page hash, external page is pushed to the head of external LRU list and internal page is unlocked. After internal page was removed from external free list, it was not placed in external LRU, and placed there only after write completion, so the page can't be used by the other threads until write is completed. Page hash chain get element function has additional template parameter, which notifies the function if external pages must be ignored or not. We don't ignore external pages in page hash in two cases, when some page is initialized for read and when one is reinitialized for new page creating. When an internal page is initialized for read and external page with the same page id is found in page hash, the internal page is locked, the external page in replaced with newly initialized internal page in the page hash chain, the external page is removed from external LRU list and attached to io request to external buffer pool file. When the io request is completed, external page is returned to external free list, internal page is unlocked. So during read external page absents in both external LRU and free lists and can't be reused. When an internal page is initialized for new page creating and external pages with the same page id is found in page hash, we just remove external page from the page hash chain and external LRU list and push it to the head of external free list. So the external page can be used for future flushing. The pages are flushed to and read from external buffer pool file with the same manner as they are flushed to their spaces, i.e. compressed and encrypted pages stay compressed and encrypted in external buffer pool file. Reviewed by: Marko Mäkelä
054d899 to
e185715
Compare
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this addresses the comments that I made today.
This has also been pushed as 4cf93f9 to another branch main-MDEV-31956, which is based on the intended target branch. This must still be subjected to some performance testing.
In one of the practical cloud MariaDB setups, a server node accesses its
datadir over the network, but also has a fast local SSD storage for
temporary data. The content of such temporary storage is lost when the
server container is destroyed.
Description
TODO: fill description here
Release Notes
TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.
How can this PR be tested?
TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
Basing the PR against the correct MariaDB version
mainbranch.PR quality check