-
Notifications
You must be signed in to change notification settings - Fork 6
mm/page_alloc: fix min_free_kbytes calculation regarding ZONE_MOVABLE #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jamieNguyenNVIDIA
wants to merge
49
commits into
NVIDIA-BaseOS-6:main
Choose a base branch
from
jamieNguyenNVIDIA:free_kbytes_calc_fix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
mm/page_alloc: fix min_free_kbytes calculation regarding ZONE_MOVABLE #11
jamieNguyenNVIDIA
wants to merge
49
commits into
NVIDIA-BaseOS-6:main
from
jamieNguyenNVIDIA:free_kbytes_calc_fix
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Ian May <ian.may@canonical.com>
…dversion" This reverts commit 47d27f2. We need to revert this to avoid regressing any modules used in Jammy. Signed-off-by: Ian May <ian.may@canonical.com>
This reverts commit b2638e9. This feature is not targeted for Jammy. Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2038972 Properties: no-test-build Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
… a pasid support BugLink: https://bugs.launchpad.net/bugs/2031320 When an iommu_domain is set to IOMMU_DOMAIN_IDENTITY, the driver would skip the allocation of a CD table and set the CONFIG field of the STE to STRTAB_STE_0_CFG_BYPASS. This works well for devices that only have one substream, i.e. PASID disabled. However, there could be a use case, for a pasid capable device, that allows bypassing the translation at the default substream while still enabling the pasid feature, which means the driver should not skip the allocation of a CD table nor simply bypass the CONFIG field. Instead, the S1DSS field should be set to STRTAB_STE_1_S1DSS_BYPASS and the SHCFG field should be set to STRTAB_STE_1_SHCFG_INCOMING. Add s1dss in struct arm_smmu_s1_cfg, to allow a configuration in the finalise() to support this use case. Also, according to "13.5 Summary of attribute/permission configuration fields" in the reference manual, the SHCFG field value is irrelevant. So, set the SHCFG field of the STE always to STRTAB_STE_1_SHCFG_INCOMING for simplification. Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Pritesh Raithatha <praithatha@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Nicolin Chen <nicolinc@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Ian May <ian.may@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
…stflint and nvidia-fs Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2045591 Properties: no-test-build Signed-off-by: Ian May <ian.may@canonical.com>
The current version of nvidia-fs(2.17.4-0) is failing to compile with the 6.5 kernel. Drop inclusion until an updated version is available. Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
…rnel BugLink: https://bugs.launchpad.net/bugs/2043132 With this change, the NVMe and NVMeOF driver would be enabled to support GPUDirectStorage(GDS). The change is around nvme/nvme rdma map_data() and unmap_data(), where the IO request is first intercepted to check for GDS pages and if it is a GDS page then the request is served by GDS driver component called nvidia-fs, else the request would be served by the standard NVMe driver code Signed-off-by: Sourab Gupta <sougupta@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2043132 With this change, the NFS driver would be enabled to support GPUDirectStorage(GDS). The change is around frwr_map and frwr_unmap in the NFS driver, where the IO request is first intercepted to check for GDS pages and if it is a GDS page then the request is served by GDS driver component called nvidia-fs, else the request would be served by the standard NFS driver code. Signed-off-by: Sourab Gupta <sougupta@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
This reverts commit b2638e9. This feature is not targeted for Jammy. Signed-off-by: Brad Figg <bfigg@nvidia.com> Signed-off-by: Ian May <ian.may@canonical.com>
…ng packages that do not apply to this kernel. Signed-off-by: Brad Figg <bfigg@nvidia.com> Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com> Signed-off-by: Ian May <ian.may@canonical.com>
Add support for exposing rprovides data for standalone modules too. Switch to exposing provides as a shared debian/substvar file and use that in the templates. Signed-off-by: Brad Figg <bfigg@nvidia.com> Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com> Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com> Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2047983 Properties: no-test-build Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2049537 Add support of "Thermal fast Sampling Period (_TFP)" for passive cooling. As per the ACPI specification (ACPI 6.5, Section 11.4.17 "_TFP (Thermal fast Sampling Period)", _TFP overrides _TSP ("Thermal Sampling Period" if both are present in a Thermal zone. Signed-off-by: Jeff Brasen <jbrasen@nvidia.com> Co-developed-by: Sumit Gupta <sumitg@nvidia.com> Signed-off-by: Sumit Gupta <sumitg@nvidia.com> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Sumit Gupta <sumitg@nvidia.com> (backported from commit a2ee758 linux-next) Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2049537 Current implementation of processor_thermal performs software throttling in fixed steps of "20%" which can be too coarse for some platforms. We observed some performance gain after reducing the throttle percentage. Change the CPUFREQ thermal reduction percentage and maximum thermal steps to be configurable. Also, update the default values of both for Nvidia Tegra241 (Grace) SoC. The thermal reduction percentage is reduced to "5%" and accordingly the maximum number of thermal steps are increased as they are derived from the reduction percentage. Signed-off-by: Srikar Srimath Tirumala <srikars@nvidia.com> Co-developed-by: Sumit Gupta <sumitg@nvidia.com> Signed-off-by: Sumit Gupta <sumitg@nvidia.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Sumit Gupta <sumitg@nvidia.com> (cherry picked from commit 310293a linux-next) Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
…arlier BugLink: https://bugs.launchpad.net/bugs/2049537 Allocate and device assign 'struct etmv4_drvdata' earlier during the driver probe, ensuring that it can be retrieved in power management based runtime callbacks if required. This will also help in dropping iomem base address argument from the function etm4_probe() later. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: coresight@lists.linaro.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230710062500.45147-2-anshuman.khandual@arm.com (cherry picked from commit 3095e90 linux) Signed-off-by: Tushar Dave <tdave@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2049537 'struct etm4_drvdata' itself can carry the base address before etm4_probe() gets called. Just drop that redundant argument from etm4_probe(). Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: coresight@lists.linaro.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Reviewed-by: James Clark <james.clark@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230710062500.45147-3-anshuman.khandual@arm.com (cherry picked from commit 4e3b9a6 linux) Signed-off-by: Tushar Dave <tdave@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2049537 Coresight device pid can be retrieved from its iomem base address, which is stored in 'struct etm4x_drvdata'. This drops pid argument from etm4_probe() and 'struct etm4_init_arg'. Instead etm4_check_arch_features() derives the coresight device pid with a new helper coresight_get_pid(), right before it is consumed in etm4_hisi_match_pid(). Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: coresight@lists.linaro.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230710062500.45147-4-anshuman.khandual@arm.com (cherry picked from commit 5a1c709 linux) Signed-off-by: Tushar Dave <tdave@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2049537 Add support for handling MMIO based devices via platform driver. We need to make sure that : 1) The APB clock, if present is enabled at probe and via runtime_pm ops 2) Use the ETM4x architecture or CoreSight architecture registers to identify a device as CoreSight ETM4x, instead of relying a white list of "Peripheral IDs" The driver doesn't get to handle the devices yet, until we wire the ACPI changes to move the devices to be handled via platform driver than the etm4_amba driver. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: coresight@lists.linaro.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230710062500.45147-5-anshuman.khandual@arm.com (cherry picked from commit 73d779a linux) Signed-off-by: Tushar Dave <tdave@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2049537 Drop ETM4X ACPI ID from the AMBA ACPI device list, and instead just move it inside the new ACPI devices list detected and used via platform driver. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Len Brown <lenb@kernel.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: linux-acpi@vger.kernel.org Cc: coresight@lists.linaro.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> (for ACPI specific changes) Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Tanmay Jagdale <tanmay@marvell.com> Link: https://lore.kernel.org/r/20230710062500.45147-7-anshuman.khandual@arm.com (cherry picked from commit 134124a linux) Signed-off-by: Tushar Dave <tdave@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2049537 The peer_memory_client scheme allows a driver to register with the ib_umem system that it has the ability to understand user virtual address ranges that are not compatible with get_user_pages(). For instance VMAs created with io_remap_pfn_range(), or other driver special VMA. For ranges the interface understands it can provide a DMA mapped sg_table for use by the ib_umem, allowing user virtual ranges that cannot be supported by get_user_pages() to be used as umems for RDMA. This is designed to preserve the kABI, no functions or structures are changed, only new symbols are added: ib_register_peer_memory_client ib_unregister_peer_memory_client ib_umem_activate_invalidation_notifier ib_umem_get_peer And a bitfield in struct ib_umem uses more bits. This interface is compatible with the two out of tree GPU drivers: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/drivers/gpu/drm/amd/amdkfd/kfd_peerdirect.c https://github.com/Mellanox/nv_peer_memory/blob/master/nv_peer_mem.c Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Ian May <ian.may@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2049653 Properties: no-test-build Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
The current calculation of min_free_kbytes only uses ZONE_DMA and ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] will also divide part of min_free_kbytes.This will cause the min watermark of ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE. __GFP_HIGH and PF_MEMALLOC allocations usually don't need movable zone pages, so just like ZONE_HIGHMEM, cap pages_min to a small value in __setup_per_zone_wmarks(). On my testing machine with 16GB of memory (transparent hugepage is turned off by default, and movablecore=12G is configured) The following is a comparative test data of watermark_min no patch add patch ZONE_DMA 1 8 ZONE_DMA32 151 709 ZONE_NORMAL 233 1113 ZONE_MOVABLE 1434 128 min_free_kbytes 7288 7326 Link: https://lkml.kernel.org/r/20230625031656.23941-1-liuq131@chinatelecom.cn Signed-off-by: liuq <liuq131@chinatelecom.cn> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 416ef04 linux) Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Signed-off-by: Jan Vesely <jvesely@nvidia.com>
Contributor
Author
|
Internal bug: 4452840 Test case:
It will fail with the following output without this change: It should pass with this patch applied. |
Contributor
Author
|
This was broken in 6.3+ kernels that contain the following patch: |
nvidia-bfigg
pushed a commit
that referenced
this pull request
Apr 24, 2024
BugLink: https://bugs.launchpad.net/bugs/2059068 commit f546c42 upstream. [BUG] There is a bug report that, on a ext4-converted btrfs, scrub leads to various problems, including: - "unable to find chunk map" errors BTRFS info (device vdb): scrub: started on devid 1 BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096 BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056 This would lead to unrepariable errors. - Use-after-free KASAN reports: ================================================================== BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0 Read of size 8 at addr ffff8881013c9040 by task btrfs/909 CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023 Call Trace: <TASK> dump_stack_lvl+0x43/0x60 print_report+0xcf/0x640 kasan_report+0xa6/0xd0 __blk_rq_map_sg+0x18f/0x7c0 virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff] virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff] blk_mq_flush_plug_list.part.0+0x780/0x860 __blk_flush_plug+0x1ba/0x220 blk_finish_plug+0x3b/0x60 submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] __x64_sys_ioctl+0xbd/0x100 do_syscall_64+0x5d/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f47e5e0952b - Crash, mostly due to above use-after-free [CAUSE] The converted fs has the following data chunk layout: item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80 length 86016 owner 2 stripe_len 65536 type DATA|single For above logical bytenr 2214744064, it's at the chunk end (2214658048 + 86016 = 2214744064). This means btrfs_submit_bio() would split the bio, and trigger endio function for both of the two halves. However scrub_submit_initial_read() would only expect the endio function to be called once, not any more. This means the first endio function would already free the bbio::bio, leaving the bvec freed, thus the 2nd endio call would lead to use-after-free. [FIX] - Make sure scrub_read_endio() only updates bits in its range Since we may read less than 64K at the end of the chunk, we should not touch the bits beyond chunk boundary. - Make sure scrub_submit_initial_read() only to read the chunk range This is done by calculating the real number of sectors we need to read, and add sector-by-sector to the bio. Thankfully the scrub read repair path won't need extra fixes: - scrub_stripe_submit_repair_read() With above fixes, we won't update error bit for range beyond chunk, thus scrub_stripe_submit_repair_read() should never submit any read beyond the chunk. Reported-by: Rongrong <i@rong.moe> Fixes: e02ee89 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Tested-by: Rongrong <i@rong.moe> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [ Use min_t() to fix a compiling error due to difference types ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Portia Stephens <portia.stephens@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
nvidia-bfigg
pushed a commit
that referenced
this pull request
May 25, 2024
BugLink: https://bugs.launchpad.net/bugs/2059068 commit f546c42 upstream. [BUG] There is a bug report that, on a ext4-converted btrfs, scrub leads to various problems, including: - "unable to find chunk map" errors BTRFS info (device vdb): scrub: started on devid 1 BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096 BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056 This would lead to unrepariable errors. - Use-after-free KASAN reports: ================================================================== BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0 Read of size 8 at addr ffff8881013c9040 by task btrfs/909 CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023 Call Trace: <TASK> dump_stack_lvl+0x43/0x60 print_report+0xcf/0x640 kasan_report+0xa6/0xd0 __blk_rq_map_sg+0x18f/0x7c0 virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff] virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff] blk_mq_flush_plug_list.part.0+0x780/0x860 __blk_flush_plug+0x1ba/0x220 blk_finish_plug+0x3b/0x60 submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] __x64_sys_ioctl+0xbd/0x100 do_syscall_64+0x5d/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f47e5e0952b - Crash, mostly due to above use-after-free [CAUSE] The converted fs has the following data chunk layout: item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80 length 86016 owner 2 stripe_len 65536 type DATA|single For above logical bytenr 2214744064, it's at the chunk end (2214658048 + 86016 = 2214744064). This means btrfs_submit_bio() would split the bio, and trigger endio function for both of the two halves. However scrub_submit_initial_read() would only expect the endio function to be called once, not any more. This means the first endio function would already free the bbio::bio, leaving the bvec freed, thus the 2nd endio call would lead to use-after-free. [FIX] - Make sure scrub_read_endio() only updates bits in its range Since we may read less than 64K at the end of the chunk, we should not touch the bits beyond chunk boundary. - Make sure scrub_submit_initial_read() only to read the chunk range This is done by calculating the real number of sectors we need to read, and add sector-by-sector to the bio. Thankfully the scrub read repair path won't need extra fixes: - scrub_stripe_submit_repair_read() With above fixes, we won't update error bit for range beyond chunk, thus scrub_stripe_submit_repair_read() should never submit any read beyond the chunk. Reported-by: Rongrong <i@rong.moe> Fixes: e02ee89 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Tested-by: Rongrong <i@rong.moe> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [ Use min_t() to fix a compiling error due to difference types ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Portia Stephens <portia.stephens@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current calculation of min_free_kbytes only uses ZONE_DMA and ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] will also divide part of min_free_kbytes.This will cause the min watermark of ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE.
__GFP_HIGH and PF_MEMALLOC allocations usually don't need movable zone pages, so just like ZONE_HIGHMEM, cap pages_min to a small value in __setup_per_zone_wmarks().
On my testing machine with 16GB of memory (transparent hugepage is turned off by default, and movablecore=12G is configured) The following is a comparative test data of watermark_min
ZONE_DMA 1 8
ZONE_DMA32 151 709
ZONE_NORMAL 233 1113
ZONE_MOVABLE 1434 128
min_free_kbytes 7288 7326
Link: https://lkml.kernel.org/r/20230625031656.23941-1-liuq131@chinatelecom.cn
Reviewed-by: "Huang, Ying" ying.huang@intel.com
(cherry picked from commit 416ef04 linux)