Confusing FullShuffle FIXME comments

lavenderchiang · January 3, 2024, 8:16pm

Based on the code, I assume FullShuffle does global shuffle because it permutes indices of all chunks and assign them to each rank in the world size.

However, this FIXME comment seems to suggest that shuffling is only done within nodes?

FIXME: Shuffling should be done only within the nodes to benefit
from cache if the dataset doesn’t fit on the node.

Reading backward, it’s saying
If the dataset doesn’t fit on the node, then shuffling should only be done within the node.

This doesn’t make sense to me. If the dataset doesn’t fit on the node, then shouldn’t we try to distribute it across multiple nodes?

awaelchli · January 3, 2024, 10:30pm

Hi @lavenderchiang
It’s a good question. I think the comment probably tries to hint at the fact that if you shuffle all indices globally, and then distribute them, then each process would have to download chunks that it doesn’t fully use. In other words, there will be multiple different ranks that consume the same chunk file within a node, and this would lead to processes needing to download several chunks multiple times. If the dataset fits entirely into the machine, then this wouldn’t be an issue because you would only get cache hits after all chunks have been downloaded once. But if the dataset does not fit, then in the worst case shuffle, you would end up with a lot of chunk chunks being downloaded, deleted and then later on re-downloaded.

I can’t say much about the design decisions, this is the work of Thomas Chaton (@tchaton). Feel free to ping him on slack or discord if you have more questions about it. It would be best if he explains his ideas for alternative shuffling strategies.

lavenderchiang · January 4, 2024, 2:17pm

Thank you so much for your prompt response Adrian! I’ve joined Discord

thomasgridai · January 16, 2024, 1:05pm

Hi @lavenderchiang

When streaming a dataset from remote storage, there is a speed gain from caching the chunks if the dataset fits on the machine at the cost of losing shuffling quality (we are using a quasi-random).

Right now, we are assigning the chunks and sub intervals to each worker. In the case of multi node training, the workers belongs to different nodes. In order to avoid re-download on the second epoch, we can make sure the second shuffling is done only within the chunks associated to the current node.

Example:

10 chunks and 2 nodes. Let’s say we associate 0: [1-5] and 1: [5-10] (this would be random too) for epoch 1 with the following shuffling [2, 4, 0, 1, 3] and [5, 7, 9, 8, 6].

For the second epoch, we can re-shuffle those chunks while avoiding cache miss as much as possible: [4, 0, 1, 2, 3] and [6, 9, 7, 8, 5].

I hope it helps.

Out of curiosity, have you tried the StreamingDataset ?