目錄

1140318 meeting

SSSD_CP 儲存庫探勘

前言

在閱讀並測試 SSSD_CP 儲存庫時,由於該 SSSD 分支需要使用 linux 、 conda 及 docker,便先於安裝。因為操作系統為 windows ,故選擇安裝 windows subsystem linux (wsl) 作為操作環境。在測試過程中發現,因 windows 的換行符號為 \r\n , 通稱 CRLF ;但 linux 、 unix 和 mac 等系統用的是 \n 作為換行符號, 通稱 LF,故在執行中時,如果經過 windows 系統編輯過的檔案就無法順利於 wsl 中執行(因為多了 \r 導致檔案路徑報錯)。

因此,我們便需要使用 dos2unix 這個 linux 套件將文件、檔案中的換行符號轉換為 linux 可接受的字元,才能運行 .sh 腳本。

執行訓練

以下為採用 mujoco.npy 資料集進行訓練的設定檔案。

model.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
wavenet:
  # WaveNet model parameters
  input_channels: 14  # Number of input channels
  output_channels: 14  # Number of output channels
  residual_layers: 36  # Number of residual layers
  residual_channels: 256  # Number of channels in residual blocks
  skip_channels: 256  # Number of channels in skip connections

  # Diffusion step embedding dimensions
  diffusion_step_embed_dim_input: 128  # Input dimension
  diffusion_step_embed_dim_hidden: 512  # Middle dimension
  diffusion_step_embed_dim_output: 512  # Output dimension

  # Structured State Spaces sequence model (S4) configurations
  s4_max_sequence_length: 100  # Maximum sequence length
  s4_state_dim: 64  # State dimension
  s4_dropout: 0.0  # Dropout rate
  s4_bidirectional: true  # Whether to use bidirectional layers
  s4_use_layer_norm: true  # Whether to use layer normalization

diffusion:
  # Diffusion model parameters
  T: 200  # Number of diffusion steps
  beta_0: 0.0001  # Initial beta value
  beta_T: 0.02  # Final beta value

其中,我只更改 input_channelsoutput_channels 使其輸入與輸出符合資料集的 features 數量。以及更改了 s4_max_sequence_length 以符合時間點。

training.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Training configuration
batch_size: 80  # Batch size
output_directory: "/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/results/checkpoint"  # Output directory for checkpoints and logs
ckpt_iter: "max"  # Checkpoint mode (max or min)
iters_per_ckpt: 1000  # Checkpoint frequency (number of epochs)
iters_per_logging: 1000  # Log frequency (number of iterations)
n_iters: 60000  # Maximum number of iterations
learning_rate: 0.0002  # Learning rate

# Additional training settings
only_generate_missing: true  # Generate missing values only
use_model: 2  # Model to use for training
masking: "forecast"  # Masking strategy for missing values
missing_k: 24  # Number of missing values

# Data paths
data:
  train_path: "/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD/datasets/Mujoco/train_mujoco.npy"  # Path to training data

以下是成功執行訓練腳本的提示詞。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
user@LAPTOP-KOPTLCHM:/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD_CP$ conda activate sssd
(sssd) user@LAPTOP-KOPTLCHM:/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD_CP$ ./scripts/diffusion/training_job.sh -m configs/model.yaml -t configs/training.yaml
Script is running from: /mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD_CP/scripts/diffusion
Intializing conda
Activating Conda Env: sssd
[Execution - Training]
/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD_CP/scripts/diffusion/train.py --model_config configs/model.yaml --training_config configs/training.yaml
2025-03-16 16:51:31,034 - sssd.utils.logger - INFO - Model spec: {'wavenet': {'input_channels': 14, 'output_channels': 14, 'residual_layers': 36, 'residual_channels': 256, 'skip_channels': 256, 'diffusion_step_embed_dim_input': 128, 'diffusion_step_embed_dim_hidden': 512, 'diffusion_step_embed_dim_output': 512, 's4_max_sequence_length': 100, 's4_state_dim': 64, 's4_dropout': 0.0, 's4_bidirectional': True, 's4_use_layer_norm': True}, 'diffusion': {'T': 200, 'beta_0': 0.0001, 'beta_T': 0.02}}
2025-03-16 16:51:31,034 - sssd.utils.logger - INFO - Training spec: {'batch_size': 80, 'output_directory': '/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/results/checkpoint', 'ckpt_iter': 'max', 'iters_per_ckpt': 1000, 'iters_per_logging': 1000, 'n_iters': 60000, 'learning_rate': 0.0002, 'only_generate_missing': True, 'use_model': 2, 'masking': 'forecast', 'missing_k': 24, 'data': {'train_path': '/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD/datasets/Mujoco/train_mujoco.npy'}}
2025-03-16 16:51:31,190 - sssd.utils.logger - INFO - Using 1 GPUs!
2025-03-16 16:51:31,287 - sssd.utils.logger - INFO - Output directory /mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/results/checkpoint/T200_beta00.0001_betaT0.02
2025-03-16 16:51:42,974 - sssd.utils.logger - INFO - Current time: 2025-03-16 16:51:42
2025-03-16 16:51:44,226 - sssd.utils.logger - INFO - No valid checkpoint model found, start training from initialization.
2025-03-16 16:51:44,227 - sssd.utils.logger - INFO - Start the 1 iteration
  3%|███▌                                                                                                                   | 3/100 [01:46<56:43, 35.09s/it

要注意的是,此專案必須使用 Python 3.10.13 (配合專案內 PyTorch 的版本),並且要事先安裝 miniconda 以及 docker ,並執行 /envs/conda/build_conda_env.sh/envs/vm/install_on_ubuntu.sh 腳本,以設定執行環境。

請記得在執行前要先進入 conda 的 sssd 環境,否則會報錯。

1
2
3
4
5
(base) user@LAPTOP-KOPTLCHM:/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD_CP$ conda deactivate
user@LAPTOP-KOPTLCHM:/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD_CP$ ./scripts/diffusion/training_job.sh -m configs/model.yaml -t configs/training.yaml
Script is running from: /mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD_CP/scripts/diffusion
/mnt/d/Code/sssd_cp_learning_and_testing/learning_and_testing/SSSD_CP/scripts/diffusion/../../envs/conda/utils.sh: line 12: conda: command not found
No Conda environment found matching
提示
如果無法成功執行,或是提示找不到 sssd 模組,建議重新進入 wsl ,並重新進入 conda 的 sssd 環境。

觀察

這裡的 missing_k 表示為

1
missing_k (int): The number of the last elements to be predicted.

觀察 /sssd/core/utils.py 後我們可以發現其中增加了對 missing values 的處理,如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def get_mask_forecast(sample: torch.Tensor, k: int) -> torch.Tensor:
    """
    Get mask of same segments (black-out missing) across channels based on k.

    Args:
        sample (torch.Tensor): Tensor of shape [# of samples, # of channels].
        k (int): Number of missing values.

    Returns:
        torch.Tensor: Mask of sample's shape where 0's indicate missing values to be imputed, and 1's indicate preserved values.
    """
    mask = torch.ones_like(sample)  # Initialize mask with all ones

    # Calculate the indices of missing values
    s_nan = torch.arange(mask.shape[0] - k, mask.shape[0])

    # Apply mask for each channel
    for channel in range(mask.shape[1]):
        mask[s_nan, channel] = 0

    return mask

訓練結果

因單一 iteration 就要訓練一小時(在我的電腦上),故

https://raw.githubusercontent.com/Josh-test-lab/website-assets-repository/refs/heads/main/posts/1140318%20meeting/To%20be%20continued.jpg
下集待續

運行環境

  • 作業系統:Windows 11 24H2
  • 子系統:Windows subsystem Linux (Ubuntu)
  • Miniconda
  • Docker
  • CUDA 12.8 driver (Running version 12.1 in code)
  • 程式語言:Python 3.10.13 for Linux

延伸學習

參考資料