跳至主要內容

PointBEV

Genhiy...大约 12 分钟PythonBEVdebug

模版及模块设置

本代码的模版来自:https://github.com/ashleve/lightning-hydra-templateopen in new window

Pytorch lightning参考:Pytorch Lightning 完全攻略open in new window

train.py main:

@hydra.main(version_base="1.3", config_path="../configs", config_name="train.yaml")
def main(cfg: DictConfig) -> Optional[float]:
    # apply extra utilities
    # (e.g. ask for tags if none are provided in cfg, print cfg tree, etc.)
    utils.modif_config_based_on_flags(cfg)

    utils.extras(cfg)

    # train the model
    metric_dict, _ = train(cfg)

    # safely retrieve metric value for hydra-based hyperparameter optimization
    metric_value = utils.get_metric_value(
        metric_dict=metric_dict, metric_name=cfg.get("optimized_metric")
    )

    # return optimized metric
    return metric_value

设置debug模式

utils.modif_config_based_on_flags(cfg): (pointbev/utils/launch)读取train.yaml中的flags: 「debug: false val_sparse: false」

Available flags:

  • debug: use mini dataset and debug mode. mini数据集debug模式
  • val_sparse: use sparse validation mode. 稀疏验证模式

这里有个问题,这个代码这里只是把cfg传入了函数,然后函数内对cfg进行了修改。

如:config.train = True,但并没有返回值,cfg能被正确修改吗?答:可以,如果是单一变量,则不行。例如:

a = 1
def fun(a):
    a = 2
fun(a)
print(a) # 1

但是使用 OmegaConf,加载的配置文件字典或对象:

from omegaconf import OmegaConf

config = OmegaConf.load("config.yaml")
print(config)
# {'model': {'name': 'resnet', 'num_layers': 18}, 'training': {'batch_size': 64, 'learning_rate': 0.001}}

def fun(cfg):
    cfg.model.name = 'train'

fun(config)
print(config)
# {'model': {'name': 'train', 'num_layers': 18}, 'training': {'batch_size': 64, 'learning_rate': 0.001}}

设置warning忽略

    # disable python warnings
    if cfg.extras.get("ignore_warnings"):
        log.info("Disabling python warnings! <cfg.extras.ignore_warnings=True>")
        warnings.filterwarnings("ignore")

在python中运行代码经常会遇到的情况是——代码可以正常运行但是会提示警告,有时特别讨厌,尤其是强迫症患者真的是很难过了。那么如何来控制警告输出呢?其实很简单,python通过调用warnings模块中定义的warn()函数来发出警告。我们可以通过警告过滤器进行控制是否发出警告消息:

import warnings
warnings.filterwarnings('ignore')
  • “error” 将匹配警告转换为异常
  • “ignore” 忽略匹配的警告
  • “always” 始终输出匹配的警告
  • “default” 对于同样的警告只输出第一次出现的警告
  • “module” 在一个模块中只输出第一次出现的警告
  • “once” 输出第一次出现的警告,而不考虑它们的位置

多进程只执行一次

    # prompt user to input tags from command line if none are provided in the config
    if cfg.extras.get("enforce_tags"):
        log.info("Enforcing tags! <cfg.extras.enforce_tags=True>")
        rich_utils.enforce_tags(cfg, save_to_file=True)
@rank_zero_only
def enforce_tags(cfg: DictConfig, save_to_file: bool = False) -> None:
    """Prompts user to input tags from command line if no tags are provided in config."""

    if not cfg.get("tags"):
        if "id" in HydraConfig().cfg.hydra.job:
            raise ValueError("Specify tags before launching a multirun!")

        log.warning("No tags provided in config. Prompting user to input tags...")
        tags = Prompt.ask("Enter a list of comma separated tags", default="dev")
        tags = [t.strip() for t in tags.split(",") if t != ""]

        with open_dict(cfg):
            cfg.tags = tags

        log.info(f"Tags: {cfg.tags}")

    if save_to_file:
        with open(Path(cfg.paths.output_dir, "tags.log"), "w") as file:
            rich.print(cfg.tags, file=file)

@rank_zero_only: from pytorch_lightning.utilities import rank_zero_only :

  • Function that can be used as a decorator to enable a function/method being called only on global rank 0.
  • 可用作装饰器的函数,使函数/方法只能在全局rank 0 上调用。也即用于多进程时只调用一次。

装饰器控制异常

@utils.task_wrapper
def train(cfg: DictConfig) -> Tuple[dict, dict]:

执行时间:调用train的时候执行,用于控制执行任务函数时的失败行为。此包装器可用于:

  • 确保即使任务函数引发异常,日志记录器也会关闭(防止多重运行失败)
  • 将异常保存到.log文件中
  • logs/ 文件夹中使用专用文件将运行标记为失败(这样我们以后就能找到并重新运行它)

预处理

实例化参数类

datamodule: LightningDataModule = hydra.utils.instantiate(cfg.data)

hydra.utils.instantiate:目的是实现配置文件中的类或函数的自动实例化,作用是根据提供的配置参数或其他属性来创建一个类或实例,可以将配置参数解析为实例的构造函数参数,并返回实例。比如:

from hydra.utils import instantiate

class MyClass:
    def __init__(self, arg1, arg2):
        self.arg1 = arg1
        self.arg2 = arg2

config = {'class': 'MyClass', 'arg1': 'value1', 'arg2': 'value2'}
obj = instantiate(config)

而在这里:

  • datamodule是返回的类,其为<pointbev.data.datamodule.nuscenes_loader.NuScenesDatamodule object at 0x7f00a18fd350>
  • datamodule: LightningDataModule冒号的作用是提示datamodule的类型,是为了更可读,没有运行中的意义。
  • hydra.utils.instantiate(cfg.data)是将cfg.data中的参数实例化,转为NuScenesDatamodule object

同样的,后续有实例化model类的代码:

model: LightningModule = hydra.utils.instantiate(cfg.model)

准备数据集

    if cfg.get("train"):
        log.info("Starting training!")
        trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.ckpt.path)
        # jump -> /pointbev/data/datamodule/nuscenes_loader.py line 145 setup in class NuScenesDatamodule(pl.LightningDataModule)

读取数据集所用的方法是继承 pl.LightningDataModule 来提供训练、校验、测试的数据:

from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl


class MyDataModule(pl.LightningDataModule):
    def __init__(self):
        super().__init__()

    def prepare_data(self):
        # 在该函数里一般实现数据集的下载等,只有cuda:0 会执行该函数
        # download, split, etc...
        # only called on 1 GPU/TPU in distributed
        pass

    def setup(self, stage):
        # make assignments here (val/train/test split)
        # called on every process in DDP
        # 实现数据集的定义,每张GPU都会执行该函数, stage 用于标记是用于什么阶段
        if stage == 'fit' or stage is None:
            self.train_dataset = MyDataset(self.train_file_path, self.train_file_num, transform=None)
            self.val_dataset = MyDataset(self.val_file_path, self.val_file_num, transform=None)
        if stage == 'test' or stage is None:
            self.test_dataset = MyDataset(self.test_file_path, self.test_file_num, transform=None)

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=1, shuffle=True)

参考资料:Pytorch Lightning框架:使用笔记【LightningModule、LightningDataModule、Trainer、ModelCheckpoint】open in new window

一些变量记录:

  • grid:{'xbound': [-50.0, 50.0, 0.5], 'ybound': [-50.0, 50.0, 0.5], 'zbound': [-10.0, 10.0, 20.0], 'dbound': [4.0, 45.0, 1.0]}存储了x、y、z变量的范围及grid长度。

数据集情况及数据处理

数据存储在class TemporalNuScenesDataset(NuScenesDataset):self.ixes中,比如我们查看self.ixes[0]

{'token': 'b5989651183643369174912bc5641d3b', 'timestamp': 1538984233547259, 'prev': '', 'next': '0bb62a68055249e381b039bf54b0ccf8', 'scene_token': '325cef682f064c55a255f2625c533b75', 'data': {'RADAR_FRONT': '438f2c308c3a48c088b8867c5ea16578', 'RADAR_FRONT_LEFT': '1d9d5e24ff5947ad8365ed42b1ddca32', 'RADAR_FRONT_RIGHT': '8167f5bc42f54a0dbe8de50a6fa224c5', 'RADAR_BACK_LEFT': '384563e2f56c4220bb0d41e420c57634', 'RADAR_BACK_RIGHT': '2183371cd78b4cfe9d49934cba271e2a', 'LIDAR_TOP': '65e07a70e6b5404a87bf34e49d4c0924', 'CAM_FRONT': '524d443c501a4f98a14508c3fb6f6de3', 'CAM_FRONT_RIGHT': 'd6f89460954c43d39ed7c9ac91ab03d0', 'CAM_BACK_RIGHT': 'b3e53998db124133bb9cd832d78d2b11', 'CAM_BACK': 'fd183c135b1f41ea8eb7a3df78d0b1ff', 'CAM_FRONT_LEFT': 'd6986708c5084569bf7a636968070602', 'CAM_BACK_LEFT': '4552459a83ac4259b7592c8d7c87248f'}, 'anns': ['204f3df0559a4f5da0f5f523f680e230', 'c12f1f4ddc44407390f10b007cc24654', ...]}

其中包括本帧的token(帧名)、timestamp(时间戳)、prev(前一帧)、next(后一帧)、scene_token(场景名)、data(雷达和图片数据,也是token)、anns(标注框token)。

掌控学习率

学习率衰减策略参考:pytorch必须掌握的的4种学习率衰减策略open in new window

pytorch_lightning中如何设置学习率衰减?其实跟pytorch是一样的,基本上不需要修改。重写configure_optimizers()函数即可:

# 设置优化器
def configure_optimizers(self):
    weight_decay = 1e-6  # l2正则化系数
    # 假如有两个网络,一个encoder一个decoder
    optimizer = optim.Adam([{'encoder_params': self.encoder.parameters()}, {'decoder_params': self.decoder.parameters()}], lr=learning_rate, weight_decay=weight_decay)
    # 同样,如果只有一个网络结构,就可以更直接了
    optimizer = optim.Adam(my_model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    # 我这里设置2000个epoch后学习率变为原来的0.5,之后不再改变
    scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[2000], gamma=0.5)
    optim_dict = {'optimizer': optimizer, 'lr_scheduler': scheduler}
    return optim_dict

Dataloader

基本内容:

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
            self.traindata,……
        )

读取后跳转:on_after_batch_transfer,用于重写以在批次传输到设备后对批次进行更改或应用批量增强。example:

def on_after_batch_transfer(self, batch, dataloader_idx):
    batch['x'] = gpu_transforms(batch['x'])
    return batch

前向过程

首先进网络PointBeV(Network)的forward函数:(路径:pointbev/models/sampled.py)

  • imgs:图像输入,维度为1*1*6*3*224*480,即B*T*N*C*H*W
  • rots:从摄像机到自车的旋转矩阵,维度为1*1*6*3*3*3,即B*T*N*(3*3)
  • trans:从摄像机到自车的平移向量,维度为1*1*6*3*1,即B*T*N*(3*1)
  • intrins:从摄像机到图像的本征矩阵,维度为1*1*6*3*3,即B*T*N*(3*3)。。
  • bev_aug:增强矩阵,用于使 BEV 围绕自我位置移动,维度为1*1*4*4,即B*T*(4*4)
  • B为batch_size,T为时间序列帧数,N为摄像头数目,一般为6。
 def forward(self, imgs, rots, trans, intrins, bev_aug, egoTin_to_seq, **kwargs):
        (
            dict_shape,
            dict_vox,
            dict_img,
            dict_mat,
        ) = self._common_init_backneck_prepare_vt(
            imgs, rots, trans, intrins, bev_aug, egoTin_to_seq
        )
        # dict_img: 1 6 128 28 60

        sampling_imgs = {
            "lidar": kwargs.get("lidar_img", None),
            "hdmap": kwargs.get("hdmap", None),
        }
        out, masks, tracks = self.forward_coarse_and_fine(
            dict_img,
            dict_mat,
            dict_shape,
            dict_vox,
            sampling_imgs,
        )
        # out: bining: 1 1 1 200 200  offsets: 1 1 2 200 200  centerness: 1 1 1 200 200
        dict_out = {"bev": out}
        dict_out["masks"] = {"bev": masks}
        dict_out["tracks"] = tracks

        return dict_out

图像的特征提取

第一个函数:_common_init_backneck_prepare_vt,在common文件Network(nn.Module)中,用来打包一些基础数据:

Shared among models: 模型共享:

  • dictionary initialization, 字典初始化
  • forward backbone and neck, 前向backbone和neck
  • preparation of view transformation, 准备视图转换
  • extension of the input for the temporal models. 扩展时空模型的输入
img_feats = self.forward_backneck(imgs)

def forward_backneck(self, imgs):
    # Backbone and Neck
    btn = imgs.shape[:3]
    imgs = self._prepare_backneck(imgs) # 就是imgs = rearrange(imgs, "b t n c h w -> (b t n) c h w")
    imgs_feats = self.neck(self.backbone(imgs))
    imgs_feats = self._arrange_backneck(btn, imgs_feats) # 就是img_feats = rearrange(img_feats, "(b t n) c h w -> (b t) n c h w")
    return imgs_feats

其中:

  • self.backbone:EfficientNet(Backbone);输出:torch.Size([6, 56, 28, 60]);torch.Size([6, 160, 14, 30])
  • self.neck:AGPNeck(Backbone);输出:torch.Size([6, 128, 28, 60])
AGPNeck(
  (align_res_layer): AlignRes(
    (layers): ModuleList(
      (0): Identity()
      (1): Upsample(scale_factor=2.0, mode='bilinear')
    )
  )
  (prepare_c_layer): PrepareChannel(
    (layers): Sequential(
      (0): Conv2d(216, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
      (2): ReLU(inplace=True)
      (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (4): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
      (5): ReLU(inplace=True)
    )
    (tail): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
  )
)
  • align_res_layer:将不同分辨率的图层升为相同分辨率。将之前的torch.Size([6, 56, 28, 60]);torch.Size([6, 160, 14, 30])第一个不变,第二个上采样两倍,转换为:torch.Size([6, 56, 28, 60]);torch.Size([6, 160, 28, 60])
  • group_method(分组方法):如何收集上采样图层。group_method=lambda x: torch.cat(x, dim=1)也即将x concat到一起:torch.Size([6, 216, 28, 60])
  • prepare_c_layer:更改上采样图层的通道,以便与网络对齐。输出:torch.Size([6, 128, 28, 60])

BEVFormer CamProjector

def forward(self, dict_mat, dict_shape, dict_vox)

查看输入的字典内容和维度:

print([f'{key}: {dict_mat[key].shape}' for key in dict_mat.keys()])
print([f'{key}: {dict_vox[key].shape}' for key in dict_vox.keys() if dict_vox[key] is not None])

输入:

  • dict_mat: ['rots: torch.Size([1, 6, 3, 3])', 'trans: torch.Size([1, 6, 3, 1])', 'intrins: torch.Size([1, 6, 3, 3])', 'bev_aug: torch.Size([1, 1, 4, 4])', 'egoTin_to_seq: torch.Size([1, 1, 4, 4])']
  • dict_vox: ['vox_coords: torch.Size([1, 3, 200, 200, 8])', 'vox_idx: torch.Size([1, 3, 200, 200, 8])']

PointBev BEV特征生成

out, masks, tracks = self.forward_coarse_and_fine(
            dict_img, dict_mat, dict_shape, dict_vox, sampling_imgs,)

其中内容:

dict_vox.update(self.projector(dict_mat, dict_shape, dict_vox))

bev_feats, mask, vox_idx = self.view_transform(
    img_feats,
    dict_vox,
)

kwargs = {
    "feats": bev_feats,
    "indices": vox_idx,
    "spatial_shape": self.coord_selector.spatial_range[:2],
    "batch_size": b * t,
    "from_dense": False,
}
bev_feats = decoder(**kwargs)

bev_feats = self.forward_temporal(bev_feats, dict_shape)

# Heads
dict_out = self.forward_heads(bev_feats, dict_shape)
mask_dict = {k: mask for k in dict_out.keys()}
return dict_out, mask_dict, bev_feats

GridSampleVT

  • 输入:
    • 图像特征:torch.Size([1, 6, 128, 28, 60])
    • dict_vox:其中包含:dict_keys(['vox_feats', 'vox_valid', 'vox_coords', 'voxcam_coords', 'vox_idx'])
  • 输出:
    • BEV特征:torch.Size([40000, 128])
    • mask:torch.Size([1, 1, 1, 200, 200])
    • vox_idx:torch.Size([40000, 3])
GridSampleVT(
  (coordembd): PositionalEncodingMap(
    (layer): MLP(
      (act): GELU(approximate='none')
      (layers): ModuleList(
        (0): Linear(in_features=48, out_features=256, bias=True)
        (1): Linear(in_features=256, out_features=256, bias=True)
        (2): Linear(in_features=256, out_features=128, bias=True)
      )
    )
  )
  (compressor): MLP(
    (act): ReLU(inplace=True)
    (layers): ModuleList(
      (0): Conv2d(1024, 128, kernel_size=(1, 1), stride=(1, 1))
      (1-3): 3 x Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
    )
  )
)

Decoder:SparseUNet

位置:pointbev/models/autoencoder/sparse_resnet.py

输入:

  • BEV特征:torch.Size([40000, 128])
  • vox_idx:torch.Size([40000, 3])

输出:BEV特征:SparseConvTensor[shape=torch.Size([40000, 128])]

Head:BEVConvHead

位置:pointbev/models/heads/convn.py

输入:SparseConvTensor[shape=torch.Size([40000, 128])]

输出:字典,包括:

  • 'binimg': SparseConvTensor[shape=torch.Size([40000, 1])],
  • 'offsets': SparseConvTensor[shape=torch.Size([40000, 2])],
  • 'centerness': SparseConvTensor[shape=torch.Size([40000, 1])]}

还有待学习,这一块还没看懂。

反向传播及后处理

loss计算

losses, loss = self._common_step_losses(preds, batch)

Loss Function:(多任务)

ModuleDict(
  (binimg): ModuleDict(
    (T0_P0): BCELoss(
      (loss_fn): BCEWithLogitsLoss()
    )
  )
  (centerness): SpatialLoss()
  (offsets): SpatialLoss()
)

代码:

def _common_step_losses(self, preds, batch):
    losses = {}
    total_loss = 0.0

    def _update_total_loss(total_loss, loss, name, weighting):
        (weight, uncertainty) = weighting(name)
        return total_loss + loss * weight + uncertainty

    update_total_loss = partial(_update_total_loss, weighting=self.weighting)

    # Pipeline losses
    keys = self.dict_losses.keys()
    bev_losses = self.dict_losses["bev"] if "bev" in keys else None

    # Masks: 0 to remove, 1 to keep.
    dict_masks = self._get_masks(preds, batch)

    # Single element:
    # -> Centerness, Offsets
    for l_dict, l_pip, l_key, pred_key, target_key, l_mask, l_bool in zip(
        [bev_losses, bev_losses],
        ["bev", "bev"],
        ["centerness", "offsets"],
        ["centerness", "offsets"],
        ["centerness", "offsets"],
        [dict_masks["centerness"], dict_masks["binimg"]],
        [self.with_centr_offs, self.with_centr_offs],
    ):
        if not l_bool:
            continue

        l_bev_loss = l_dict[l_key]
        l_pred = preds[l_pip][pred_key]
        # ! Trace only present
        l_target = batch[target_key][:, -1:]

        loss = l_bev_loss(l_pred, l_target, l_mask)
        name = f"{l_pip}/{l_key}"
        losses.update({name: loss})
        total_loss = update_total_loss(total_loss, loss, name)

    # -> Dictionaries:
    # Binimg, HDMap
    for l_key, pred_key, target_key, l_mask, l_bool in zip(
        ["binimg", "hdmap"],
        ["binimg", "hdmap"],
        ["binimg", "hdmap"],
        [dict_masks["binimg"], None],
        [self.with_binimg, self.with_hdmap],
    ):
        if not l_bool:
            continue
        l_bev_losses = bev_losses[l_key]
        l_preds = preds["bev"][pred_key]
        l_targets = batch[target_key]

        for k, l in l_bev_losses.items():
            loss = l(l_preds, l_targets, l_mask)
            name = f"bev/{l_key}/{k}"
            losses.update({name: loss})
            total_loss = update_total_loss(total_loss, loss, name)
    return losses, total_loss / len(losses)

debug

debug_hooks

一段难懂的代码:

def execute_once(is_hook=True):
    def decorator(f):
        dict_cnt = {}

        @functools.wraps(f)
        def wrapper(*args, **kwargs):
            nonlocal dict_cnt
            if is_hook:
                module = args[0]
                name = module.__class__.__name__
            else:
                name = f.__qualname__
            if name not in dict_cnt.keys():
                dict_cnt[name] = 1
                return f(*args, **kwargs)

        return wrapper
    return decorator

一步步看:

  • 首先:def execute_once(is_hook=True): 定义了一个名为 execute_once 的函数,它接受一个参数 is_hook,默认值为 True。这个参数决定了如何确定被装饰函数的唯一性。
  • 然后return decorator,execute_once 函数返回 decorator 函数,这样 execute_once 本身就可以被用作一个装饰器。
  • 内层,def decorator(f): 定义了一个嵌套函数 decorator,它接受一个函数 f 作为参数,这个 f 将是被装饰的函数。
  • return wrapper 返回包装函数 wrapper,这样 wrapper 就可以被用作装饰器。
  • dict_cnt = {} 在 decorator 函数内部定义了一个字典 dict_cnt,用于存储已经执行过的函数名。
  • @functools.wraps(f) 是一个装饰器,用于复制原函数 f 的元数据(如函数名、文档字符串等)到包装函数 wrapper 上。
  • def wrapper(*args, **kwargs): 定义了一个包装函数 wrapper,它接受任意数量的位置参数 *args 和关键字参数 **kwargs。
  • nonlocal dict_cnt 声明 dict_cnt 为非局部变量,允许 wrapper 函数修改 decorator 函数作用域中的 dict_cnt 字典。
  • if name not in dict_cnt.keys(): 检查 name 是否不在 dict_cnt 的键中。dict_cnt[name] = 1 如果 name 不在 dict_cnt 中,则将其添加到字典中,并设置值为 1。
  • return f(*args, **kwargs) 如果 name 是第一次出现,则执行原函数 f,并返回其结果。

总结来说,这段代码定义了一个装饰器,当 is_hook=True 时,它将确保在同一个类中,每个方法只执行一次;当 is_hook=False 时,它将确保整个程序中,每个函数只执行一次。这是通过在 dict_cnt 字典中跟踪函数(或方法)的执行情况来实现的。

@execute_once()
@rank_zero_only
def debug_hook(module, input, output):
    """Note: torch hooks do not work with kwargs passed as inputs."""
    print("Class:", module.__class__.__name__)

    print("Inputs:")
    print_shape(input)

    print("Outputs:")
    print_shape(output)

    torch.cuda.reset_peak_memory_stats()
    print()

这段函数则是用来输出debug_hook的,@execute_once()是调用上面的装饰器,确保同一个类/整个程序只执行一次,@rank_zero_only是调用lightning的方法,确保只有rank:0机器执行这段代码。

那么代码里是怎么调用这段代码来输出网络情况的呢?

from pointbev.utils.debug import debug_hook
class Backbone(nn.Module):
    def __init__(self):
        super().__init__()
        self.register_forward_hook(debug_hook)
        return

    def forward(self, x, return_all=False):
        raise NotImplementedError()

也就是说Backbone是加了debug_hook端口的,而我们所用的EfficientNet(Backbone)是继承了Backbone的。那么,代码中的register_forward_hook是什么?答:钩子方法,参考:【PyTorch】 register_forward_hook()简单用法open in new window

环境安装

问题:安装作者的sparse-gs库时:subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

解决方案:

修改anaconda环境下的lib/python3.11/site-packages/torch/utils/cpp_extension.py文件:将第1858行的['ninja','-v']改成['ninja','--version']。再将第1636行的'ninja --version'改成'ninja -v'

问题:安装作者的sparse-gs库时:error: command '/usr/local/cuda-11.3/bin/nvcc' failed with exit code 1

疑似原因:CUDA版本不匹配,python所用的CUDA版本是11.8,但系统里是11.3,这可能与nvcc关系不大,因为安装了11.8的nvcc后仍报此错。

解决方案:换CUDA版本是11.8的开发机安装。

并在lib/python3.11/site-packages/torch/utils/cpp_extension.py文件的第60、72行后添加对11.8的支持:'11.8': ((6, 0, 0), (12, 0)),'11.8': (MINIMUM_CLANG_VERSION, (14, 0)),。好像之前也遇到过相似的问题,也是在11.8的开发机上才安装成功。