Dear @ Hannibal046,
For 1, Currently, model parallelism is supported in Lightning but only for Sequential Model.
You could convert your encoder / decoder into one and use our Model Parallelism beta feature: https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#sequential-model-parallelism-with-checkpointing.
If you are working on multiple gpus, ddp_sharded can help too as it will shard gradients across gpus and reduce memory footprint.
For 2, did you use plugings=ddp_sharded
. Could you share an image with memory peak with and without for your own model ?
Best,
T.C