RuntimeError: MPS backend out of memory (MPS allocated: 58.82 GB, other allocations: 23.73 MB, max allowed: 61.20 GB). Tried to allocate 3.45 GB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure) Meet the above error when training models with fastai code ```python from fastai.vision.all import * def label_func(x): return x.parent.name def run(): path = Path("screenshots/") fnames = get_image_files(path) print(f"Total Images:{len(fnames)}") dls = ImageDataLoaders.from_path_func(path, fnames, label_func,bs=40, num_workers=0) learn = vision_learner(dls, resnet18, metrics=error_rate) print("Loaded") learn.fine_tune(4, base_lr=1.0e-02) learn.export('model.pkl') if __name__ == '__main__': run() ``` How can I fix it

2024-02-28by@tim&@ChatGPT.Model: gpt-4
tim

RuntimeError: MPS backend out of memory (MPS allocated: 58.82 GB, other allocations: 23.73 MB, max allowed: 61.20 GB). Tried to allocate 3.45 GB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure)

Meet the above error when training models with fastai

code

from fastai.vision.all import *

def label_func(x): return x.parent.name

def run():
    path = Path("screenshots/")
    fnames = get_image_files(path)
    print(f"Total Images:{len(fnames)}")


    dls = ImageDataLoaders.from_path_func(path, fnames, label_func,bs=40, num_workers=0)
    learn = vision_learner(dls, resnet18, metrics=error_rate)
    print("Loaded")
    learn.fine_tune(4, base_lr=1.0e-02)

    learn.export('model.pkl')


if __name__ == '__main__':
    run()

How can I fix it

Avatar
Avatar

The error you're encountering indicates that the Metal Performance Shaders (MPS) backend of PyTorch on macOS is running out of memory. This backend is used for GPU acceleration on Mac devices with Apple Silicon (M1, M1 Pro, M1 Max, M1 Ultra, M2, etc.).

Here are a few suggestions to address this issue:

  1. Reduce Batch Size: The batch size of 40 might be too high for your GPU's memory. Try reducing the batch size to decrease the memory required for each training step. For instance, try bs=16 or bs=8.

    dls = ImageDataLoaders.from_path_func(path, fnames, label_func, bs=16, num_workers=0)
    
  2. Use a Smaller Model: If reducing the batch size doesn't help, you might want to try using a smaller model than resnet18 that requires less memory.

  3. Free Up Memory: Make sure no other processes are consuming GPU memory. Close other applications or notebooks that might be using the GPU.

  4. Clear PyTorch Cache: You can manually clear the PyTorch cache by calling torch.cuda.empty_cache() (though this might not be directly applicable to MPS).

  5. Adjust MPS Memory Settings: As the error message suggests, you can set the environment variable PYTORCH_MPS_HIGH_WATERMARK_RATIO to 0.0. This disables the upper limit for memory allocations, but beware that it may cause system instability or crashes if the system runs out of memory completely.

    You can set the environment variable in your Python script before importing any libraries:

    import os
    os.environ['PYTORCH_MPS_HIGH_WATERMARK_RATIO'] = '0.0'
    from fastai.vision.all import *
    

    Or you can set it in your shell before running the Python script:

    export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
    python your_script.py
    
  6. Monitor Memory Usage: While training, monitor the memory usage to see if your adjustments are effective in preventing out-of-memory errors.

  7. Upgrade PyTorch: Ensure you are using the latest version of PyTorch, as there could be memory management improvements in newer versions.

  8. System Stability: If you choose to disable the memory limit as suggested in the error by setting PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0, be prepared to handle potential system instability. Make sure to save your work before running the model training.

Start with the first suggestion and work your way down the list until you find a solution that works for your specific case. If you continue to have issues, you may need to consider training on a machine with more GPU memory or using a cloud-based solution.

tim

reducing batch size helped, thanks

Avatar
Avatar

You're welcome! I'm glad to hear that reducing the batch size helped resolve the memory issue. Training deep learning models can be resource-intensive, and finding the right balance for your specific hardware is key to running your experiments smoothly.

If you encounter similar issues in the future or need to train larger models, remember to consider the other tips as well, such as optimizing model size, freeing up memory, and monitoring usage during training.

If you have any more questions or need further assistance, feel free to ask. Happy training!