文本类
文本拆分为单词,单词转化为索引,进而句子转化为索引向量 入模型时,需要将索引向量转化为矩阵,即每个单词化为一个向量 预测时,又需要将向量转化为索引,进而与字典中的单词对应
在索引转向量时,是随机的, 一个随机的向量反向对应索引时,偏差就比较大 可以在入模前就训练一批向量参数,使 索引 -- 向量参数 -- 索引 之间有一一对应关系 一个索引,比如index=2,对应的就是向量a, 如果得到一个向量a,就立即确定它唯一对应index=2
进一步简化,直接求解一个参数矩阵,func(index) ~ func(param_mat) ~ one hot向量 ~ 最大值的位置索引 模型输入的是index,param_mat即包含了index的信息,又包含了模型的规律,还能通过func转化为one hot向量
文本类交叉熵举例
import torch from torch import nn dict_count = 3 #有三个单词,索引为0,1,2 seq_len = 5 batch_size = 2 #`low` (inclusive) and :attr:`high` (exclusive). data = torch.randint(low=0, high=dict_count, size=(batch_size,seq_len)) data tensor([[0, 0, 2, 1, 2], [0, 2, 0, 0, 0]])
label = data label.shape torch.Size([2, 5]) #对X进行编码 embed = nn.Embedding(num_embeddings=dict_count, embedding_dim=dict_count, padding_idx=0) X = embed(data)
现在求X与label之间的偏差
注意X与label的shape并不一样,交叉䊞能自动处理这一点, 但第1维必须是一样的,最多2维,高于2维的必须展平 认为X[0]的向量,对应y[0]转化后的One hot向量
X = X.reshape(-1,dict_count) label = label.reshape(-1) print(X.shape,label.shape) #torch.Size([10, 3]) torch.Size([10]) loss_fn = nn.CrossEntropyLoss() loss = loss_fn(X,label) loss tensor(1.3767, grad_fn=NllLossBackward0)
torch.nn.CrossEntropyLoss() This criterion computes the cross entropy loss between input logits and target. criterion 美 /kraɪˈtɪriən/ n.标准;(评判或作决定的)准则;原则 Init signature: nn.CrossEntropyLoss( weight: Optional[torch.Tensor] = None, size_average=None, ignore_index: int = -100, reduce=None, reduction: str = 'mean', label_smoothing: float = 0.0, ) -> None 使用 loss_fn=nn.CrossEntropyLoss() loss = loss_fn(input,target) input: 输入 target: 标签 input input没啥可注意的,重点是后面的target unnormalized 非归一化;非标准的 The `input` is expected to contain the unnormalized logits for each class (which do `not` need to be positive or sum to 1, in general). `input` has to be a Tensor of size :math:`(C)` for unbatched input, :math:`(minibatch, C)` or :math:`(minibatch, C, d_1, d_2, ..., d_K)` with :math:`K \geq 1` for the `K`-dimensional case. The last being useful for higher dimension inputs, such as computing cross entropy loss per-pixel for 2D images. 交叉熵的输入(input)的shape,可以是 [特征维度],没有批次的维度,只有特征的维度, 言外之意就是一行数据,一个样本, 这个特征维度就是模型的最后一层的维度,也是标签的维度 比如,模型输出[0.1,0.9] 跟对应的标签[0,1]求损失, 这是标准的,也是直接按数学理论转换过来的求交叉熵的格式 [batch_size, C],这说明交叉熵可以按批次求损失 [batch_size,C,H,W] 这是典型的2维图像的格式, 直接将这个格式作为交叉熵的损失函数也是可以的 target indices 英/ˈɪndɪsiːz/ 美/ˈɪndɪsiːz/ (index 的复数形式之一);索引;标志,迹象; The `target` that this criterion expects should contain either: - Class indices in the range :math:`[0, C)` where :math:`C` is the number of classes; 假设特征数C=3,那么 for i in range(3): print(i) 输出: 0 1 2 意思是说,如果标签类别有3,那么标签的编码应该是[0,1,2] 这就是后面的 索引标签计算交叉熵 |
loss = loss_fn(y_pred, y) y是1维列表 import os from tpf import pkl_save,pkl_load BASE_DIR = "/wks/datasets/hotel_reader" file_path = os.path.join(BASE_DIR,'data_pkl/word.pkl') X_train,y_train,X_test,y_test,words_set,word2idx,idx2word = pkl_load(file_path) # 字典长度 dict_len = len(words_set) # 序列长度 seq_len = 512 # 训练集 X_train1 = [] for x in X_train: temp = x + ["<PAD>"] * seq_len X_train1.append(temp[:seq_len]) # 测试集 X_test1 = [] for x in X_test: temp = x + ["<PAD>"] * seq_len X_test1.append(temp[:seq_len]) """ 索引向量化 """ # 训练集向量化 X_train2 = [] for x in X_train1: temp = [] for word in x: idx = word2idx[word] if word in word2idx else word2idx["<UNK>"] temp.append(idx) X_train2.append(temp) # 测试集向量化 X_test2 = [] for x in X_test1: temp = [] for word in x: idx = word2idx[word] if word in word2idx else word2idx["<UNK>"] temp.append(idx) X_test2.append(temp) """ 构建数据集 """ import numpy as np from torch.utils.data import Dataset from torch.utils.data import DataLoader import torch from torch import nn from torch.nn import functional as F class MyDataSet(Dataset): def __init__(self, X=X_train2, y=y_train): self.X = X self.y = y def __len__(self): return len(self.X) def __getitem__(self, idx): x = self.X[idx] y = self.y[idx] return torch.tensor(data=x).long(), torch.tensor(data=y).long() train_dataset = MyDataSet(X=X_train2, y=y_train) train_dataloader = DataLoader(dataset=train_dataset, shuffle=True, batch_size=128) test_dataset = MyDataSet(X=X_test2, y=y_test) test_dataloader = DataLoader(dataset=test_dataset, shuffle=False, batch_size=256) train_dataset[0][0][:7] tensor([ 5004, 10135, 20586, 9530, 16808, 14513, 9530]) train_dataset[0][1] 1 for X,y in test_dataloader: print(X.shape,y.shape) break torch.Size([256, 512]) torch.Size([256]) 二分类问题中,loss = loss_fn(y_pred, y), y是1维列表 |
标签y是long类型,类别是用整数表示的 """ 构建数据集 """ import numpy as np from torch.utils.data import Dataset from torch.utils.data import DataLoader import torch from torch import nn from torch.nn import functional as F class MyDataSet(Dataset): def __init__(self,X,y): """ 构建数据集 """ self.X = X self.y = y.reshape(-1) # self.y = y.reshape(-1,1) def __len__(self): return len(self.X) def __getitem__(self, idx): x = self.X[idx] y = self.y[idx] return torch.tensor(data=x).float(), torch.tensor(data=y).long() |
|
|
import torch from torch import nn import numpy as np input = torch.tensor([0,0,1],dtype=torch.float32,requires_grad=True) # target = torch.tensor(2,dtype=torch.long) # 0.5514 target = torch.tensor([0,1,0],dtype=torch.float32) # 1.5514 # target = torch.tensor(0,dtype=torch.long) # 1.5514 loss = nn.CrossEntropyLoss() output = loss(input, target) output.backward() output tensor(1.5514, grad_fn=DivBackward1)
import torch from torch import nn import numpy as np input = torch.tensor([0,0,1],dtype=torch.float32,requires_grad=True) # target = torch.tensor(2,dtype=torch.long) # 0.5514 # target = torch.tensor([0,1,0],dtype=torch.float32) # 1.5514 target = torch.tensor(0,dtype=torch.long) # 1.5514 loss = nn.CrossEntropyLoss() output = loss(input, target) output.backward() output tensor(1.5514, grad_fn=NllLossBackward0) 该例表明,交叉熵在输入的标签是整数时,默认将其转换为one hot标签,无须再手工转换,其效果与One Hot标签一致
reduction: str = 'mean'
每个样本与标签都有一个损失, 如果输入的是批次,交叉熵默认情况下求出的损失是批次样本损失的均值 reduce参数已废弃,看reduction,其默认值为mean batch_size = Nwhere :math:`x` is the input, :math:`y` is the target, :math:`w` is the weight, :math:`C` is the number of classes, and :math:`N` spans the minibatch dimension
批次损失计算
import torch from torch import nn import numpy as np # loss 1.0514 x = [[0,0,1],[0,0,1]] y = [[0,1,0],[0,0,1]] # 1.5514 x = [[0,0,1]] y = [[0,1,0]] # 0.5514 x = [[0,0,1]] y = [[0,0,1]] input = torch.tensor(x,dtype=torch.float32,requires_grad=True) # target = torch.tensor(2,dtype=torch.long) target = torch.tensor(y,dtype=torch.float32) # target = torch.tensor(0,dtype=torch.long) loss = nn.CrossEntropyLoss() output = loss(input, target) output.backward() output (0.5514+1.5514)/2 = 1.0514 每个样本与标签都有一个损失, 如果输入的是批次,交叉熵默认情况下求出的损失是批次样本损失的均值
转成索引标签也一样
import torch from torch import nn import numpy as np # loss 1.0514 x = [[0,0,1],[0,0,1]] y = [[0,1,0],[0,0,1]] y = [1,2] #按one hot编码的方式,这时共有3个类别 input = torch.tensor(x,dtype=torch.float32,requires_grad=True) target = torch.tensor(y,dtype=torch.long) loss = nn.CrossEntropyLoss() output = loss(input, target) output.backward() output tensor(1.0514, grad_fn=NllLossBackward0)
借pytorch交叉熵的API文档推一下交叉熵的数学公式
Probabilities for each class; useful when labels beyond a single class per minibatch item are required, such as for blended labels, label smoothing, etc. The unreduced (i.e. with :attr:`reduction` set to ``'none'``) loss for this case can be described as:
.. math:: \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - \sum_{c=1}^C w_c \log \frac{\exp(x_{n,c})}{\sum_{i=1}^C \exp(x_{n,i})} y_{n,c}
where :math:`x` is the input, :math:`y` is the target, :math:`w` is the weight, :math:`C` is the number of classes, and :math:`N` spans the minibatch dimension as well as :math:`d_1, ..., d_k` for the `K`-dimensional case.
上图就是softmax,对x中的每个元素进行指数运算,然后再除以指数的和, 其和为1, 它真正要表达的意思是,概率和为1, 比如,二分类问题,类型共有两个,[发生,不发生],[是,否], 总类别个数为C,这是全集, 一个事物不在这个类型,就必定在那个类别中 softmax处理的对象是模型的输出,为什么不对标签这么处理? 因为标签本身很固定,确切地说是你亲自编码的,它属于哪个类别是铁板钉钉的事,一算就知 而模型的输出那可真是乱七八糟,五花八门, 你期望模型输出0.9,期望它接近你的标签1, 但实际上它可能输出的是-0.7,直接就违反了概率要大于1的理论基础啊 怎么办? 常用的办法是用softmax处理一下, 将模型在各个类别上的差异,转化为符合概率理论的差异
xn是一个向量,指批次中第n个样本,它有C个维度(C是类别个数), yn就是标签,也是一个向量,也有C个维度(C是类别个数), 特别地,如果对标签使用one hot编码(实际上,通常,就是这么做的) yn中只有一个维度为1,其他维度皆为0 比如,二分类问题,类型共有两个,one hot编码是下面这样的 发生用[1,0]向量表示,不发生用[0,1]向量表示 二者在距离上完全一样,没有差异 0乘以任何数还是0,C维度的计算,只算标签为1的那个维度就可以了, 同时先不考虑权重的问题,上面的公式简化为:
如此,每个样本与标签的交叉熵损失,转化为模型输出的 信息量 训练的目的转化为,让模型输出的信息量,越来越小, 小到很容易就分出是哪个类别时,就达到目的了 xn是一个向量,xnc指与yn标签1所在索引对应的x的值, 因为其他位置都是与0相乘,乘完就消失了 因此P(xn)就 只是 其中与标签1对应索引位置的x占xn的概率
It is useful when training a classification problem with `C` classes. If provided, the optional argument :attr:`weight` should be a 1D `Tensor` assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.