Thank you for the contribution.
In the paper, you mentioned that DistServe does not consider preemption. During experiments/benchmarking, how do you control the request rates and the number of token generated for each request to make sure the decode GPU doesn't hit the memory limit? Thanks.
Thank you for the contribution.
In the paper, you mentioned that DistServe does not consider preemption. During experiments/benchmarking, how do you control the request rates and the number of token generated for each request to make sure the decode GPU doesn't hit the memory limit? Thanks.