Understanding and Using Model Checkpoints

During the training process, we save checkpoints that represent the current state of your model. These checkpoints are crucial for preserving training progress and allow you to resume training or use the model at various stages of its development.

Types of Checkpoints

We offer two types of checkpoints:

Regular Checkpoints
LoRA Checkpoints

Regular Checkpoints

Regular checkpoints contain the full state of the model, including all parameters.

Advantages:
- Can be used directly with vLLM for inference
- Represent the complete model state
Use Cases:
- When you need to deploy the full model
- For scenarios where you don’t need the flexibility of LoRA

LoRA (Low-Rank Adaptation) Checkpoints

LoRA checkpoints contain only the adaptations made to the base model, significantly reducing the checkpoint size.

Advantages:
- Smaller file size
- Can be combined with different base models
- Allows for mixing multiple adaptations
Use Cases:
- When you want to fine-tune a model for multiple tasks
- For efficient storage and transfer of model adaptations

Using LoRA Checkpoints

LoRA checkpoints can be utilized in two main ways:

Merged with Base Model:
- Combine the LoRA adaptations with the original base model to create a full model
- Useful when you want to deploy a single, adapted model
Used with Base Model and Multiple Adapters:
- Keep the base model separate and apply one or more LoRA adaptations at runtime
- Can be done using vLLM or the LoRAX library
- Allows for dynamic mixing of different adaptations

Checkpoint Metrics

Each checkpoint includes metrics such as eval_loss, which helps you identify the best-performing checkpoint. Lower eval_loss generally indicates better model performance on the evaluation dataset.

Downloading Checkpoints

To download a checkpoint directly to your GPU server:

Navigate to the dashboard
Find the desired checkpoint
Copy the provided CLI command
Run the command on your server to download and extract the checkpoint

The CLI command will use a pre-signed URL to securely download the checkpoint and automatically extract it for you.

Example command structure:

curl -L "https://presigned-url-to-your-checkpoint.com" | tar -xzvf -

This command downloads the checkpoint using the pre-signed URL and extracts it in one step.

Best Practices

Regularly save checkpoints during long training runs
Compare eval_loss metrics to choose the best checkpoint
For production, consider using regular checkpoints with vLLM for simplicity
For research or multi-task scenarios, leverage LoRA checkpoints for flexibility

By understanding and effectively using checkpoints, you can streamline your model development process, easily experiment with different model versions, and efficiently deploy your trained models.

Get Started

Core Concepts

Guides

Checkpoints

Understanding and Using Model Checkpoints

Types of Checkpoints

Regular Checkpoints

LoRA (Low-Rank Adaptation) Checkpoints

Using LoRA Checkpoints

Checkpoint Metrics

Downloading Checkpoints

Best Practices

Get Started

Core Concepts

Guides

​Understanding and Using Model Checkpoints

​Types of Checkpoints

​Regular Checkpoints

​LoRA (Low-Rank Adaptation) Checkpoints

​Using LoRA Checkpoints

​Checkpoint Metrics

​Downloading Checkpoints

​Best Practices

Understanding and Using Model Checkpoints

Types of Checkpoints

Regular Checkpoints

LoRA (Low-Rank Adaptation) Checkpoints

Using LoRA Checkpoints

Checkpoint Metrics

Downloading Checkpoints

Best Practices