about training set

Hello! Thank you very much for your work! I am currently facing some issues. The training set I made according to the original paper never meets the number of 8000k phrases mentioned in the paper, can you provide me with the code to make the training set or the training set?