I want to implement the combiner and think the method.
; Function Attrs: convergent nounwind
define spir_kernel void @dma_loads(i32 %width, i32 %height, i8 addrspace(1)* %in, i8 addrspace(1)* %out) local_unnamed_addr #0 !kernel_arg_addr_space !3 !kernel_arg_access_qual !4 !kernel_arg_type !5 !kernel_arg_base_type !5 !kernel_arg_type_qual !6 !kernel_arg_name !7 {
%sub = add nsw i32 %height, -1
%cmp23 = icmp sgt i32 %height, 2
br i1 %cmp23, label %.lr.ph.preheader, label %._crit_edge
.lr.ph.preheader: ; preds = %0
br label %.lr.ph
._crit_edge: ; preds = %.lr.ph, %0
ret void
.lr.ph: ; preds = %.lr.ph.preheader, %.lr.ph
%y.024 = phi i32 [ %inc, %.lr.ph ], [ 1, %.lr.ph.preheader ]
%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%add = add i32 %mul, %width
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2
%div = udiv <16 x i8> %call, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
%div4 = udiv <16 x i8> %call2, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
%add5 = add nuw <16 x i8> %div4, %div
%div6 = udiv <16 x i8> %call3, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
%add7 = add <16 x i8> %add5, %div6
tail call spir_func void @_Z8vstore16Dv16_hjPU3AS1h(<16 x i8> %add7, i32 %mul, i8 addrspace(1)* %out) #2
%inc = add nuw nsw i32 %y.024, 1
%cmp = icmp slt i32 %inc, %sub
br i1 %cmp, label %.lr.ph, label %._crit_edge
}
I think the checking regular intervals is challenging. The symbolic execution can be used.
%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%add = add i32 %mul, %width
%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2
dma_load(i32 %x.093, i8 addrspace(1)* %in, 3 /*= rows*/, 16/*= columns*/)
%call = vpm_load
%call2 = vpm_load
%call3 = vpm_load
When we compile following OpenCL code which calls
vload16three times withvc4c --asm -O3 -o dma_loads.asm dma_loads.cl, VC4C outputs the following assembly(dma_loads.txt). This contains three DMA loads, but these can be combined into one DMA load.dma_loads.txt
I want to implement the combiner and think the method.
At each block in CFG and LLVM IR
vload16(actually_Z7vload16jPU3AS1Kh).vload16.I think the checking regular intervals is challenging. The symbolic execution can be used.
Example
Collect
vload16(and address variables)Addresses
%mul - %width%mul%mul + %widthThese are regular intervals (
%width), then these are combined (I should create new functiondma_loadandvpm_load).