Softmax is used for rescaling the input Tensor on a dim so that the elements of that dim will lie in the range of [0,1] and sum to 1. Softmax function is defined as: $Softmax(x_i)=\frac{\exp{x_i}}{\sum_j{\exp{x_j}}}$

Softmax is usually used in the category situation where we have n labels/tags on the last dim, therefore we use softmax to magnify the difference between them. For example:

class BertClassification(nn.Module):
  def __init__(self, config):
        super().__init__()

        self.config = config
        self.bert = BertModel(config)
        self.dense = torch.nn.Linear(config.hidden_size, config.num_labels) # FC layer
        self.pred = torch.nn.Softmax(dim=-1) # magnify by softmax

Loss

What it really does

Quote from albanD:

when you do loss.backward(), it is a shortcut for loss.backward(torch.Tensor([1])). This in only valid if loss is a tensor containing a single element. DataParallel returns to you the partial loss that was computed on each gpu, so you usually want to do loss.backward(torch.Tensor([1, 1])) or loss.sum().backward(). Both will have the exact same behaviour.

Ref: Loss.backward() raises error ‘grad can be implicitly created only for scalar outputs’

NLL & CrossEntropy

sparse 代表 targets 是数字编码，而不加 sparse 则是 one_hot 编码

  tf.nn.sparse_softmax_cross_entropy_with_logits(prediction, ground_truth)
= tf.keras.losses.sparse_categorical_crossentropy(prediction, ground_truth, from_logits=True)
= nn.NLLLoss(reduction="none")(nn.LogSoftmax(dim=-1)(prediction), ground_truth)
= torch.nn.CrossEntropyLoss(reduce=False)( prediction, ground_truth)

keras版本默认from_logits=False 注意这里的from_logits=False只是表示经过了一层softmax，维度为2则取axis=1，维度为3则取axis=2，log仍然是在sparse_categorical_crossentropy函数里面

  epsilon = 1e-7

  tf.keras.losses.sparse_categorical_crossentropy(prediction, ground_truth, from_logits=False)
= torch.nn.CrossEntropyLoss(reduce=False)(torch.log(torch.clamp(torch.tensor(y_pred), epsilon, 1-epsilon)), torch.tensor(y_true))

从源码可以看出当 from_logits==False 的时候，会经过一层 tf.math.log 所以加上之后就可以对齐了

由于 CrossEntropy 的计算写死的按照 dim=1 进行分类，因此在计算 Batch Loss 的时候需要讲 dim=1 设置为 categories:

predictions # (batch_size, feature, category)
targets # (batch_size, feature_category)
loss_fn = torch.nn.CrossEntropyLoss(reduction='None')
loss = loss_fn(predictions.permute(0,2,1), targets).mean(dim=1)

Ref: torch.nn.CrossEntropyLoss over Multiple Batches

Accuracy

Categorical Accuracy

The average score of true positives of entire dataset:

sum(torch.eq(torch.argmax(prediction, dim=-1), labels).view(-1)) / labels.view(-1).size()[0]

Alignment

Calculate similarity using cosine distance Cosine Similarity

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, paired_distances

x = np.array([[0.26304135, 0.91725843, 0.61099966, 0.40816231, 0.93606288, 0.52462691]])
print(x)
y = np.array([[0.03756129, 0.50223667, 0.66529424, 0.57392135, 0.20479857, 0.27286363]])
print(y)
# 余弦相似度
simi = cosine_similarity(x, y)
print('cosine similarity:', simi)
# 余弦距离 = 1 - 余弦相似度
dist = paired_distances(x, y, metric='cosine')
print('cosine distance:', dist)

Reference: python批量计算cosine distance

Model

convert_bert_original_tf_checkpoint_to_pytorch

The main function used is load_tf_weights_in_bert, the steps are as follows:

create a fake pytorch model, whose data will be filled/replaced by tf checkpoints;
collect all save parameters name and values;
use getattr to index into the submodule of pytorch model;
fill the data by pointer.data = torch.from_numpy(array).

PreviousModel NextMiscellaneous

Last updated 3 years ago

Was this helpful?