1141007 meeting

發佈於 2025-10-05 更新於 2025-10-06 約 4121 字預計閱讀 9 分鐘次閱讀

本次實驗其目的在於修改原 SSSD 模型的訓練迭代方式，使其在迭代時計算 autoFRK 的結果，並依此計算誤差調整模型。其流程如下：

流程圖

在構建程式碼中，由於技術限制，此處將會先經由 Python 計算後儲存至 .npy 格式檔案，再由 R 腳本讀取並運算。在此過程，因涉及檔案格式與計算語言的轉換，導致 torch 模組無法追蹤及繪製計算圖，因而無法進行梯度運算，造成 SSSD 模型參數更新失敗。為解決此問題，以下在 autoFRK 層後新增一層全連接層，讓 SSSD 的預測值能經由全連接層的線性運算，對應到 autoFRK 的填補結果，如下所示：

此解方並非長久之計，且在上述預測中，多經一層轉換會造成更多的 loss ，反而對 SSSD 參數更新造成反效果。因此，應盡速建立 autoFRK 的 torch 版本，讓梯度的計算圖可以追蹤，才能保證誤差的計算與參數的更新。

修改

此次修改的程式碼如下：

/configs/training.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Training configuration
batch_size: 80  # Batch size
output_directory: "./results/NYISO_4/Zone/NYISO_4_MILLWD_test"  # Output directory for checkpoints and logs
ckpt_iter: "max"  # Checkpoint mode (max or min)
iters_per_ckpt: 1000  # Checkpoint frequency (number of epochs)
iters_per_logging: 1000  # Log frequency (number of iterations)
n_iters: 60000  # Maximum number of iterations
learning_rate: 0.0002  # Learning rate

# Additional training settings
only_generate_missing: true  # Generate missing values only
use_model: 2  # Model to use for training
masking: "forecast"  # Masking strategy for missing values
missing_k: 24  # Number of missing values

# Data paths
data:
  train_path: "./datasets/NYISO/test-normalization/MILLWD_train.npy"  # Path to training data, for known locations

# autoFRK config
enable_spatial_prediction: true  # Enable spatial prediction step
n_cores: 4  # Number of CPU cores to use (int)
autoFRK_period: 100  # Frequency of autoFRK updates (in how many iterations)
location_path: "./datasets/NYISO/test-normalization/MILLWD_known_location.npy"  # Path to known locations

/scripts/diffusion/train.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import argparse
import os
from typing import Optional, Union

import torch
import yaml

from sssd.core.model_specs import MODEL_PATH_FORMAT, setup_model
from sssd.data.utils import get_dataloader
from sssd.training.trainer import DiffusionTrainer
from sssd.utils.logger import setup_logger
from sssd.utils.utils import calc_diffusion_hyperparams, display_current_time

LOGGER = setup_logger()


def fetch_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-m",
        "--model_config",
        type=str,
        default="configs/model.yaml",
        help="Model configuration",
    )
    parser.add_argument(
        "-t",
        "--training_config",
        type=str,
        default="configs/training.yaml",
        help="Training configuration",
    )
    return parser.parse_args()


def setup_output_directory(
    model_config: dict,
    training_config: dict,
) -> str:
    # Build output directory
    local_path = MODEL_PATH_FORMAT.format(
        T=model_config["diffusion"]["T"],
        beta_0=model_config["diffusion"]["beta_0"],
        beta_T=model_config["diffusion"]["beta_T"],
    )
    output_directory = os.path.join(training_config["output_directory"], local_path)

    if not os.path.isdir(output_directory):
        os.makedirs(output_directory)
        os.chmod(output_directory, 0o775)
    LOGGER.info("Output directory %s", output_directory)
    return output_directory


def run_job(
    model_config: dict,
    training_config: dict,
    device: Optional[Union[torch.device, str]],
) -> None:
    output_directory = setup_output_directory(model_config, training_config)
    dataloader = get_dataloader(
        training_config["data"]["train_path"],
        batch_size=training_config.get("batch_size"),
        device=device,
    )

    diffusion_hyperparams = calc_diffusion_hyperparams(
        **model_config["diffusion"], device=device
    )
    net = setup_model(training_config["use_model"], model_config, device)

    LOGGER.info(display_current_time())
    trainer = DiffusionTrainer(
        dataloader=dataloader,
        diffusion_hyperparams=diffusion_hyperparams,
        net=net,
        device=device,
        output_directory=output_directory,
        ckpt_iter=training_config.get("ckpt_iter"),
        n_iters=training_config.get("n_iters"),
        iters_per_ckpt=training_config.get("iters_per_ckpt"),
        iters_per_logging=training_config.get("iters_per_logging"),
        learning_rate=training_config.get("learning_rate"),
        only_generate_missing=training_config.get("only_generate_missing"),
        masking=training_config.get("masking"),
        missing_k=training_config.get("missing_k"),
        batch_size=training_config.get("batch_size"),
        enable_spatial_prediction=training_config.get("enable_spatial_prediction", True),  # New
        n_cores=training_config.get("n_cores"),  # New
        autoFRK_period=training_config.get("autoFRK_period"),  # New
        location_path=os.path.abspath(training_config["location_path"]),  # New
        logger=LOGGER,
    )
    trainer.train()

    LOGGER.info(display_current_time())


if __name__ == "__main__":
    args = fetch_args()

    with open(args.model_config, "rt") as f:
        model_config = yaml.safe_load(f.read())
    with open(args.training_config, "rt") as f:
        training_config = yaml.safe_load(f.read())

    LOGGER.info(f"Model spec: {model_config}")
    LOGGER.info(f"Training spec: {training_config}")

    if torch.cuda.device_count() > 0:
        LOGGER.info(f"Using {torch.cuda.device_count()} GPUs!")
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    run_job(model_config, training_config, device)

/sssd/training/trainer.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
import logging
import os
from typing import Any, Dict, Optional, Union
import subprocess  # New
import numpy as np  # New
import yaml  # New

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm

from sssd.core.model_specs import MASK_FN
from sssd.training.utils import training_loss
from sssd.utils.logger import setup_logger
from sssd.utils.utils import find_max_epoch, sampling  # New

LOGGER = setup_logger()


class DiffusionTrainer:
    """
    Train Diffusion Models

    Args:
        dataloader (DataLoader): The training dataloader.
        diffusion_hyperparams (Dict[str, Any]): Hyperparameters for the diffusion process.
        net (nn.Module): The neural network model to be trained.
        device (torch.device): The device to be used for training.
        output_directory (str): Directory to save model checkpoints.
        ckpt_iter (Optional[int, str]): The checkpoint iteration to be loaded; 'max' selects the maximum iteration.
        n_iters (int): Number of iterations to train.
        iters_per_ckpt (int): Number of iterations to save checkpoint.
        iters_per_logging (int): Number of iterations to save training log and compute validation loss.
        learning_rate (float): Learning rate for training.
        only_generate_missing (int): Option to generate missing portions of the signal only.
        masking (str): Type of masking strategy: 'mnr' for Missing Not at Random, 'bm' for Blackout Missing, 'rm' for Random Missing.
        missing_k (int): K missing time steps for each feature across the sample length.
        batch_size (int): Size of each training batch.
        logger (Optional[logging.Logger]): Logger object for logging, defaults to None.
    """

    def __init__(
        self,
        dataloader: DataLoader,
        diffusion_hyperparams: Dict[str, Any],
        net: nn.Module,
        device: Optional[Union[torch.device, str]],
        output_directory: str,
        ckpt_iter: Union[str, int],
        n_iters: int,
        iters_per_ckpt: int,
        iters_per_logging: int,
        learning_rate: float,
        only_generate_missing: int,
        masking: str,
        missing_k: int,
        batch_size: int,
        enable_spatial_prediction: bool,  # New
        n_cores: Union[int, str],  # New
        autoFRK_period: int,  # New
        location_path: str,  # New
        logger: Optional[logging.Logger] = None,
    ) -> None:
        self.dataloader = dataloader
        self.diffusion_hyperparams = diffusion_hyperparams
        self.net = nn.DataParallel(net).to(device)
        self.device = device
        self.output_directory = output_directory
        self.ckpt_iter = ckpt_iter
        self.n_iters = n_iters
        self.iters_per_ckpt = iters_per_ckpt
        self.iters_per_logging = iters_per_logging
        self.learning_rate = learning_rate
        self.only_generate_missing = only_generate_missing
        self.masking = masking
        self.missing_k = missing_k
        self.writer = SummaryWriter(f"{output_directory}/log")
        self.batch_size = batch_size
        self.optimizer = torch.optim.Adam(self.net.parameters(), lr=self.learning_rate)
        self.enable_spatial_prediction = enable_spatial_prediction  # New
        self.n_cores = n_cores  # New
        self.autoFRK_period = autoFRK_period  # New
        self.location_path = location_path  # New
        self.real_data = self.dataloader.dataset.tensors[0].to(self.device)  # New
        self.real_data_shape = self.real_data.shape  # New
        self.logger = logger or LOGGER

        if self.masking not in MASK_FN:
            raise KeyError(f"Please enter a correct masking, but got {self.masking}")

    def _load_checkpoint(self) -> None:
        if self.ckpt_iter == "max":
            self.ckpt_iter = find_max_epoch(self.output_directory)
        if self.ckpt_iter >= 0:
            try:
                model_path = os.path.join(
                    self.output_directory, f"{self.ckpt_iter}.pkl"
                )
                checkpoint = torch.load(model_path, map_location="cpu")

                self.net.load_state_dict(checkpoint["model_state_dict"])
                if "optimizer_state_dict" in checkpoint:
                    self.optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

                self.logger.info(
                    f"Successfully loaded model at iteration {self.ckpt_iter}"
                )
            except Exception as e:
                self.ckpt_iter = -1
                self.logger.error(f"No valid checkpoint model found. Error: {e}")
        else:
            self.ckpt_iter = -1
            self.logger.info(
                "No valid checkpoint model found, start training from initialization."
            )

    def _save_model(self, n_iter: int) -> None:
        if n_iter > 0 and n_iter % self.iters_per_ckpt == 0:
            torch.save(
                {
                    "model_state_dict": self.net.state_dict(),
                    "optimizer_state_dict": self.optimizer.state_dict(),
                },
                os.path.join(self.output_directory, f"{n_iter}.pkl"),
            )

    def _update_mask(self, batch: torch.Tensor) -> torch.Tensor:
        transposed_mask = MASK_FN[self.masking](batch[0], self.missing_k)
        return (
            transposed_mask.permute(1, 0)
            .repeat(batch.size()[0], 1, 1)
            .to(self.device, dtype=torch.float32)
        )
    
    def _sssd_prediction_step(self,
                              ) -> torch.Tensor:
        
        # SSSD prediction
        LOGGER.info(f"Start SSSD prediction step")
        all_generated = []
        with torch.no_grad():
            for (batch,) in tqdm(self.dataloader, desc=f"{self.n_iter}-th predicting TS"):
                batch = batch.to(self.device)
                mask = self._update_mask(batch)
                batch = batch.permute(0, 2, 1)

                generated_series = (
                    sampling(
                        net=self.net,
                        size=batch.shape,
                        diffusion_hyperparams=self.diffusion_hyperparams,
                        cond=batch,
                        mask=mask,
                        only_generate_missing=self.only_generate_missing,
                        device=self.device,
                    )
                )

                all_generated.append(generated_series)
                
            sssd_prediction = torch.cat(all_generated, dim=0).permute(1, 2, 0)
            del all_generated
        return sssd_prediction

    def _autoFRK_step(self,
                      sssd_prediction,
                      ) -> torch.Tensor:

        # autoFRK
        LOGGER.info(f"Start autoFRK inference step")
        ## config paths
        sssd_pred_save_path = "sssd_prediction.npy"
        autoFRK_config_path = "autoFRK_config.yaml"
        autoFRK_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "utils", "autoFRK.R")
        autoFRK_result_path = "autoFRK_result.npy"

        ## save sssd prediction and location config
        sssd_prediction = sssd_prediction.detach().cpu().numpy()
        np.save(sssd_pred_save_path, sssd_prediction)
        del sssd_prediction
        autoFRK_config = {
            'ncores': self.n_cores,
            'known_location_path': self.location_path
        }
        with open(autoFRK_config_path, "w") as f:
            yaml.dump(autoFRK_config, f)

        ## run autoFRK
        subprocess.run(["Rscript", autoFRK_path],
                       stdout=subprocess.DEVNULL,   # 忽略標準輸出
                       #stderr=subprocess.DEVNULL    # 忽略錯誤輸出
                       )

        ## load autoFRK inference result
        autoFRK_result = np.load(autoFRK_result_path).transpose(2, 1, 0).astype(np.float32)
        autoFRK_result = torch.from_numpy(autoFRK_result).to(self.device)

        ## clean up
        if os.path.exists(sssd_pred_save_path):
            os.remove(sssd_pred_save_path)
        if os.path.exists(autoFRK_config_path):
            os.remove(autoFRK_config_path)
        if os.path.exists(autoFRK_result_path):
            os.remove(autoFRK_result_path)


        # return
        if autoFRK_result.shape != self.real_data_shape:
            raise ValueError(f"Shape mismatch: autoFRK_result {autoFRK_result.shape} != real_data {self.real_data_shape}")

        return autoFRK_result
    
    def _autoFRK_surrogate_layer(self,
                                 sssd_prediction: torch.Tensor,
                                 autoFRK_result: torch.Tensor,
                                 epochs: int = 0,
                                 lr: float = 1e-3,
                                 element_wise: bool = True,
                                 loss_function: nn.Module = nn.MSELoss(),
                                 ) -> torch.Tensor:
        """
        Surrogate layer: 將 sssd_prediction 逼近 autoFRK_result。

        Args:
            sssd_prediction: Tensor 或 list of Tensors, shape (V,T,L)
            autoFRK_result: Tensor, shape (V,T,L)，用作監督目標
            epochs: int，微調 surrogate 的迭代次數
            lr: float，Adam optimizer learning rate
            element_wise: bool，如果 True，對每個元素使用單獨 scale/bias
            loss_function: torch loss function, 預設使用 MSELoss

        Returns:
            result: Tensor, shape (V,T,L)，經 surrogate 層逼近 autoFRK_result
        """
        LOGGER.info(f"Start autoFRK surrogate step")
        # -----------------------------
        # 取最後一個 batch 或 tensor
        # -----------------------------
        last_sssd = sssd_prediction[-1] if isinstance(sssd_prediction, list) else sssd_prediction

        # -----------------------------
        # 確認 shape 與 real_data_shape 一致
        # -----------------------------
        if (last_sssd.shape != self.real_data_shape) or (self.real_data_shape != autoFRK_result.shape):
            msg = (
                f"Shape mismatch: sssd_prediction {last_sssd.shape}, "
                f"expected {self.real_data_shape}, "
                f"autoFRK_result {None if autoFRK_result is None else autoFRK_result.shape}"
            )
            LOGGER.error(msg)
            raise ValueError(msg)

        V, T, L = self.real_data_shape

        # -----------------------------
        # 初始化 surrogate 參數
        # -----------------------------
        if not hasattr(self, 'surrogate_scale'):
            if element_wise:
                # 每個元素對應一個 scale/bias
                self.surrogate_scale = nn.Parameter(torch.ones((V, T, L), device=self.device))
                self.surrogate_bias = nn.Parameter(torch.zeros((V, T, L), device=self.device))
            else:
                # 每個變數一個 scale/bias
                self.surrogate_scale = nn.Parameter(torch.ones(V, device=self.device))
                self.surrogate_bias = nn.Parameter(torch.zeros(V, device=self.device))

        # -----------------------------
        # 定義 surrogate 前向函數
        # -----------------------------
        def surrogate_forward(x: torch.Tensor) -> torch.Tensor:
            if element_wise:
                # element-wise scale/bias
                return x * self.surrogate_scale + self.surrogate_bias
            else:
                # per-variable scale/bias (broadcast)
                return x * self.surrogate_scale[:, None, None] + self.surrogate_bias[:, None, None]

        # -----------------------------
        # closed-form 初始化（不進梯度圖）
        # -----------------------------
        x = last_sssd.detach()
        y = autoFRK_result.detach()

        if element_wise:
            # 對每個元素單獨計算 scale/bias: y = a*x + b
            a = torch.ones_like(x)
            b = torch.zeros_like(x)
            # 避免除以 0
            mask = x != 0
            a[mask] = y[mask] / x[mask]
            b = y - a * x
        else:
            # 對每個變數計算 scale/bias
            x_flat = x.reshape(V, -1)
            y_flat = y.reshape(V, -1)
            x_mean = x_flat.mean(dim=1, keepdim=True)
            y_mean = y_flat.mean(dim=1, keepdim=True)
            cov = ((x_flat - x_mean) * (y_flat - y_mean)).sum(dim=1)
            var = ((x_flat - x_mean)**2).sum(dim=1) + 1e-8
            a = cov / var
            b = (y_mean.squeeze() - a * x_mean.squeeze())

        with torch.no_grad():
            self.surrogate_scale.copy_(a.to(self.device))
            self.surrogate_bias.copy_(b.to(self.device))

        # -----------------------------
        # 微調 (保持可微分)
        # -----------------------------
        if epochs > 0:
            optimizer = torch.optim.Adam([self.surrogate_scale, self.surrogate_bias], lr=lr)
            with tqdm(range(epochs), desc="[autoFRK surrogate]") as pbar:
                for epoch in pbar:
                    optimizer.zero_grad()
                    surrogate_out = surrogate_forward(last_sssd)
                    loss = loss_function(surrogate_out, autoFRK_result)
                    loss.backward()
                    optimizer.step()
                    pbar.set_postfix({"Loss": f"{loss.item():.6f}"})

        # -----------------------------
        # 前向傳遞 (保持可微)
        # -----------------------------
        result = surrogate_forward(last_sssd)
        return result

    def _train_per_epoch(self) -> torch.Tensor:

        # SSSD training
        for (batch,) in tqdm(self.dataloader, desc=f"{self.n_iter}-th   training TS"):
            batch = batch.to(self.device)
            mask = self._update_mask(batch)
            loss_mask = ~mask.bool()
            loss_function=nn.MSELoss()

            batch = batch.permute(0, 2, 1)
            assert batch.size() == mask.size() == loss_mask.size()

            self.optimizer.zero_grad()
            loss = training_loss(
                model=self.net,
                loss_function=loss_function,
                training_data=(batch, batch, mask, loss_mask),
                diffusion_parameters=self.diffusion_hyperparams,
                generate_only_missing=self.only_generate_missing,
                device=self.device,
            )
            loss.backward()
            self.optimizer.step()

        if self.enable_spatial_prediction and self.n_iter % self.autoFRK_period == 0:
            LOGGER.info(f"Iteration {self.n_iter}: Start Spatial Prediction step")
            sssd_prediction = self._sssd_prediction_step()
            autoFRK_result = self._autoFRK_step(sssd_prediction=sssd_prediction)
            autoFRK_surrogate = self._autoFRK_surrogate_layer(sssd_prediction=sssd_prediction.permute(2, 1, 0),
                                                              autoFRK_result=autoFRK_result,
                                                              loss_function=loss_function,
                                                              epochs=50,
                                                              lr=1e-3
                                                              )

            # compute loss
            self.optimizer.zero_grad()
            loss = loss_function(
                autoFRK_surrogate,
                self.real_data
            )

            # update model
            loss.backward()
            self.optimizer.step()
            LOGGER.info(f"Iteration {self.n_iter}: Spatial Prediction step done, loss: {loss.item()}")

        return loss

    def train(self) -> None:
        self._load_checkpoint()

        n_iter_start = (
            self.ckpt_iter + 2 if self.ckpt_iter == -1 else self.ckpt_iter + 1
        )
        self.logger.info(f"Start the {n_iter_start} iteration")

        for n_iter in range(n_iter_start, self.n_iters + 1):
            self.n_iter = n_iter
            loss = self._train_per_epoch()
            self.writer.add_scalar("Train/Loss", loss.item(), n_iter)
            if n_iter % self.iters_per_logging == 0:
                self.logger.info(f"Iteration: {n_iter} \tLoss: { loss.item()}")
            self._save_model(n_iter)

/sssd/utils/autoFRK.R

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
##########################################
# This script is for inference the value for SSSD to the unknown locations, and the return values are for calculating loss to update time series parameters.
# version: 1141003
# author: Yao-Chih Hsu
##########################################

# library
library(reticulate)
library(dplyr)
library(yaml)
library(parallel)
library(autoFRK)
library(foreach)
np <- import("numpy")

# load config
path <- "autoFRK_config.yaml"
config <- yaml.load_file(path)

# load data
sssd_pred_path  <- "sssd_prediction.npy"
sssd_prediction <- np$load(sssd_pred_path)
sssd_prediction <- py_to_r(sssd_prediction)

# load locations
known_location_path   <- config$known_location_path
known_locs            <- np$load(known_location_path)
known_locs            <- py_to_r(known_locs)

# load n_cores
n_cores <- min(config$n_cores, max(1, detectCores() - 1))

# initial parameters
locs <- known_locs  # for testing, use known locations as unknown locations
n_locs <- if (!is.null(dim(locs))) dim(locs)[1] else length(locs)
result_shape  <- c(dim(sssd_prediction)[1:2], n_locs)
result        <- array(NA, dim = result_shape)
mrts_basis    <- NULL

# get MRTS basis
variable_index  <- 1L
time_index      <- 1L
temp_data       <- sssd_prediction[variable_index, time_index, ]
if (length(temp_data) != nrow(known_locs)) {
  stop("Length of temp_data does not match number of known locations!")
}

cat("Calculating first model to get MRST basis...\n")
model <- autoFRK(data = temp_data, loc = known_locs)
mrts_basis <- model$G
cat("Get MRST basis\n")

# save first result
pred <- predict.FRK(object = model, newloc = locs)
result[variable_index, time_index, ] <- pred$pred.value

# expand tasks
tasks <- expand.grid(
  variable = 1:dim(sssd_prediction)[1],
  ts       = 1:dim(sssd_prediction)[2]
) %>%
  filter(!(variable == variable_index & ts == time_index))

# parallel
cat("Calculating...\n")
if (.Platform$OS.type == "windows") {
  # Windows → doParallel (socket cluster)
  library(doParallel)
  cl <- makeCluster(n_cores)
  registerDoParallel(cl)
} else {
  # Linux/macOS → doMC (fork, 記憶體共享)
  library(doMC)
  registerDoMC(cores = n_cores)
}

# compute all tasks
tryCatch({
  results <- foreach(i = 1:nrow(tasks),
                     .packages = c("autoFRK")
                     ) %dopar% {
    variable_index  <- tasks$variable[i]
    time_index  <- tasks$ts[i]
    temp_data <- sssd_prediction[variable_index, time_index, ]

    # calculate with MRTS basis
    model <- autoFRK(data = temp_data, loc = known_locs, G = mrts_basis)
    pred  <- predict.FRK(object = model, newloc = locs)
    
    # return results
    list(variable = variable_index,
         ts       = time_index,
         pred     = pred$pred.value
         )
  }
}, finally = {
  if (.Platform$OS.type == "windows") {
    stopCluster(cl)
  }
  registerDoSEQ()
})

# results
if (!exists("results")) stop("Parallel computation failed: 'results' does not exist.")
for (res in results) {
  result[res$variable, res$ts, ] <- res$pred
}

# save and back to Python
path <- "autoFRK_result.npy"
result <- r_to_py(result)
np$save(path, result)
cat("Finish calculated autoFRK\n")

結論

在每次迭代時，重新運算 autoFRK 將造成訓練時間大幅上揚。故此處使用每 100 次迭代再重新計算 autoFRK 。

最後的預測如下：

Metric	ALL Locs & All Time	Known Locs & All Time	Unknown Locs & All Time	ALL Locs & Future	Known Locs & Future	Unknown Locs & Future	ALL Locs & Past	Known Locs & Past	Unknown Locs & Past
MSPE	5.819705e+00	5.865499e+00	5.637017e+00	9.913197e+00	9.843337e+00	1.019189e+01	5.650992e+00	5.701553e+00	5.449289e+00
RMSPE	2.412406e+00	2.421879e+00	2.374240e+00	3.148523e+00	3.137409e+00	3.192474e+00	2.377182e+00	2.387793e+00	2.334371e+00
MSPE%	1.217137e+06	9.774202e+05	2.173443e+06	1.990447e+06	2.138321e+06	1.400531e+06	1.185266e+06	9.295740e+05	2.205298e+06
RMSPE%	1.103240e+03	9.886456e+02	1.474260e+03	1.410832e+03	1.462300e+03	1.183440e+03	1.088699e+03	9.641442e+02	1.485025e+03
MAPE	1.732901e+00	1.736590e+00	1.718185e+00	2.333113e+00	2.322068e+00	2.377173e+00	1.708163e+00	1.712459e+00	1.691025e+00
MAPE%	4.062724e+05	4.127679e+05	3.803601e+05	5.223571e+05	5.654363e+05	3.505013e+05	4.014880e+05	4.064757e+05	3.815907e+05

以下為僅使用 SSSD 模型進行預測。

Metric	ALL Locs & All Time	Known Locs & All Time	Unknown Locs & All Time	ALL Locs & Future	Known Locs & Future	Unknown Locs & Future	ALL Locs & Past	Known Locs & Past	Unknown Locs & Past
MSPE	5.838370e+00	5.878368e+00	5.678806e+00	1.041096e+01	1.031577e+01	1.079069e+01	5.649912e+00	5.695481e+00	5.468121e+00
RMSPE	2.416272e+00	2.424535e+00	2.383025e+00	3.226602e+00	3.211818e+00	3.284918e+00	2.376954e+00	2.386521e+00	2.338401e+00
MSPE%	1.207812e+06	9.691622e+05	2.159860e+06	1.923852e+06	2.061953e+06	1.372927e+06	1.178301e+06	9.241232e+05	2.192293e+06
RMSPE%	1.099005e+03	9.844604e+02	1.469646e+03	1.387030e+03	1.435950e+03	1.171720e+03	1.085496e+03	9.613132e+02	1.480639e+03
MAPE	1.735390e+00	1.738232e+00	1.724053e+00	2.384845e+00	2.371213e+00	2.439223e+00	1.708623e+00	1.712144e+00	1.694577e+00
MAPE%	4.046595e+05	4.109202e+05	3.796836e+05	4.924676e+05	5.326206e+05	3.322853e+05	4.010405e+05	4.059044e+05	3.816371e+05

比較兩組結果可得到，將 autoFRK 動態整合入 SSSD 訓練迭代（即新模型）在整體與各分區表現上皆有一致性的小幅提升。

整體準確度（ALL Locs & All Time）
新模型於主要誤差指標（MSPE、RMSPE、MAPE）皆略低於僅使用 SSSD 的模型，整體預測誤差減少，模型穩定性提升。在時間與空間的整體預測範圍內，autoFRK 的引入有效改善了模型對整體資料分布的擬合。
未來時段（Future）表現
新模型的 MSPE 由 1.0411e+01 降至 9.9132e+00，RMSPE 亦由 3.2266 降至 3.1485，顯示其在時間外推（temporal extrapolation）上具更佳的泛化能力。
對 未知地點（Unknown Locs & Future） 而言，RMSPE 與 MAPE 皆下降約 3% 至 5%，代表經由 autoFRK 的空間資訊修正後，模型能更準確地捕捉未觀測區域的結構變異。
過去時段（Past）與已知地點（Known Locs）
新模型在這些區段的改進幅度雖小（多數指標變化小於 1%），但整體誤差皆維持穩定且無退化現象，顯示 autoFRK 的整合未導致過擬合或對既有資料的偏移。
百分比指標（MSPE%、RMSPE%、MAPE%）
所有百分比型誤差指標皆呈現相同趨勢，顯示新模型在不同尺度下的誤差分布均獲得改善，模型在多層次誤差評估中表現更一致。
整體
於每 100 次迭代重新計算 autoFRK 的 混合式 SSSD 模型，成功在不顯著增加計算成本的前提下，提升了預測準確度與未知地點的空間泛化能力。
這表示 autoFRK 在提供空間相關性約束與平滑化損失函數方面，對 SSSD 訓練可能具有正向影響。

參考資料

Zhu X, Xiong Y, Wu M, et al. Weather2K: A Multivariate Spatio-Temporal Benchmark Dataset for Meteorological Forecasting Based on Real-Time Observation Data from Ground Weather Stations[C]//International Conference on Artificial Intelligence and Statistics. PMLR, 2023: 2704-2722.
Juan Lopez Alcaraz 、 Nils Strodthoff（2022）。Diffusion-based time series imputation and forecasting with structured state space models。Transactions on Machine Learning Research。參考自 https://openreview.net/forum?id=hHiIbk7ApW
SSSD（2022）。GitHub。參考自 https://github.com/AI4HealthUOL/SSSD
SSSD_CP（2024）。GitHub。參考自 https://github.com/egpivo/SSSD_CP
Tzeng, S., & Huang, H. C. (2018). Resolution Adaptive Fixed Rank Kriging. Technometrics, 60(2), 198–208. 參考自 https://doi.org/10.1080/00401706.2017.1345701
autoFRK（2024）。GitHub。參考自 https://github.com/egpivo/autoFRK

目錄

1141007 meeting

流程圖

修改

結論

參考資料

相關文章