Keras binary_crossentropy 대 categorical_crossentropy 성능?
주제별로 텍스트를 분류하도록 CNN을 훈련하려고합니다. 이진 교차 엔트로피를 사용하면 ~ 80 % 정확도를 얻을 수 있으며 범주 형 교차 엔트로피를 사용하면 ~ 50 % 정확도를 얻을 수 있습니다.
왜 그런지 이해할 수 없습니다. 그것은 다중 클래스 문제입니다. 즉, 범주 간 엔트로피를 사용해야하며 이진 상호 엔트로피 결과가 의미가 없다는 것을 의미하지 않습니까?
model.add(embedding_layer)
model.add(Dropout(0.25))
# convolution layers
model.add(Conv1D(nb_filter=32,
filter_length=4,
border_mode='valid',
activation='relu'))
model.add(MaxPooling1D(pool_length=2))
# dense layers
model.add(Flatten())
model.add(Dense(256))
model.add(Dropout(0.25))
model.add(Activation('relu'))
# output layer
model.add(Dense(len(class_id_index)))
model.add(Activation('softmax'))
그런 다음 categorical_crossentropy
손실 함수로 사용하여 다음과 같이 컴파일합니다 .
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
또는
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
범주 형 및 이진 교차 엔트로피 간의 이러한 명백한 성능 불일치의 이유는 @ xtof54가 이미 그의 답변에서보고 한 것입니다.
evaluate
2 개 이상의 레이블이있는 binary_crossentropy를 사용할 때 Keras 방법으로 계산 된 정확도 가 잘못되었습니다.
이에 대해 더 자세히 설명하고 실제 근본적인 문제를 설명하고 설명하며 해결책을 제시하고 싶습니다.
이 동작은 버그가 아닙니다. 근본적인 이유는 단순히 모델 컴파일에 포함시킬 때 선택한 손실 함수에 따라 Keras가 실제로 어느 정확도를 사용 할지 추측 하는 방법에있어 미묘하고 문서화되지 않은 문제 metrics=['accuracy']
입니다. 즉, 첫 번째 컴파일 옵션 동안
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
두 번째 것은 유효합니다.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
당신이 기대하는 것을 생산하지는 않지만 그 이유는 이진 크로스 엔트로피 (적어도 원칙적으로는 절대적으로 유효한 손실 함수)를 사용하지 않기 때문입니다.
왜 그런 겁니까? 당신이 선택하면 측정 소스 코드를 , Keras는 하나의 정확도 측정하지만, 그 중 몇 가지 것들, 정의하지 않습니다 binary_accuracy
와 categorical_accuracy
. 후드 아래에서 발생 하는 것은 손실 함수로 이진 교차 엔트로피를 선택하고 특정 정확도 메트릭을 지정하지 않았기 때문에 Keras (잘못 ...)는에 관심이 있다고 추측하고 binary_accuracy
이것이 반환되는 것입니다. 실제로 당신은에 관심이 categorical_accuracy
있습니다.
Keras 의 MNIST CNN 예제 를 사용하여 다음과 같이 수정 한 경우인지 확인하십시오.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # WRONG way
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=2, # only 2 epochs, for demonstration purposes
verbose=1,
validation_data=(x_test, y_test))
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.9975801164627075
# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001
score[1]==acc
# False
To remedy this, i.e. to use indeed binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy
in the model compilation as follows:
from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])
In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.98580000000000001
# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001
score[1]==acc
# True
System setup:
Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4
UPDATE: After my post, I discovered that this issue had already been identified in this answer.
It all depends on the type of classification problem you are dealing with. There are three main categories;
- binary classification (two target classes)
- multi-class classification (more than two exclusive targets)
- multi-label classification (more than two non exclusive targets) in which multiple target classes can be on at the same time
In the first case, binary cross-entropy should be used and targets should be encoded as one-hot vectors.
In the second case, categorical cross-entropy should be used and targets should be encoded as one-hot vectors.
In the last case, binary cross-entropy should be used and targets should be encoded as one-hot vectors. Each output neuron (or unit) is considered as a separate random binary variable, and the loss for the entire vector of outputs is the product of the loss of single binary variables. Therefore it is the product of binary cross-entropy for each single output unit.
binary cross-entropy is defined as such: binary cross-entropy and categorical cross-entropy is defined as such: categorical cross-entropy
I came across an "inverted" issue — I was getting good results with categorical_crossentropy (with 2 classes) and poor with binary_crossentropy. It seems that problem was with wrong activation function. The correct settings were:
- for
binary_crossentropy
: sigmoid activation, scalar target - for
categorical_crossentropy
: softmax activation, one-hot encoded target
It's really interesting case. Actually in your setup the following statement is true:
binary_crossentropy = len(class_id_index) * categorical_crossentropy
This means that up to a constant multiplication factor your losses are equivalent. The weird behaviour that you are observing during a training phase might be an example of a following phenomenon:
- At the beginning the most frequent class is dominating the loss - so network is learning to predict mostly this class for every example.
- After it learnt the most frequent pattern it starts discriminating among less frequent classes. But when you are using
adam
- the learning rate has a much smaller value than it had at the beginning of training (it's because of the nature of this optimizer). It makes training slower and prevents your network from e.g. leaving a poor local minimum less possible.
That's why this constant factor might help in case of binary_crossentropy
. After many epochs - the learning rate value is greater than in categorical_crossentropy
case. I usually restart training (and learning phase) a few times when I notice such behaviour or/and adjusting a class weights using the following pattern:
class_weight = 1 / class_frequency
This makes loss from a less frequent classes balancing the influence of a dominant class loss at the beginning of a training and in a further part of an optimization process.
EDIT:
Actually - I checked that even though in case of maths:
binary_crossentropy = len(class_id_index) * categorical_crossentropy
should hold - in case of keras
it's not true, because keras
is automatically normalizing all outputs to sum up to 1
. This is the actual reason behind this weird behaviour as in case of multiclassification such normalization harms a training.
After commenting @Marcin answer, I have more carefully checked one of my students code where I found the same weird behavior, even after only 2 epochs ! (So @Marcin's explanation was not very likely in my case).
And I found that the answer is actually very simple: the accuracy computed with the Keras method evaluate
is just plain wrong when using binary_crossentropy with more than 2 labels. You can check that by recomputing the accuracy yourself (first call the Keras method "predict" and then compute the number of correct answers returned by predict): you get the true accuracy, which is much lower than the Keras "evaluate" one.
a simple example under a multi-class setting to illustrate
suppose you have 4 classes (onehot encoded) and below is just one prediction
true_label = [0,1,0,0] predicted_label = [0,0,1,0]
when using categorical_crossentropy, the accuracy is just 0 , it only cares about if you get the concerned class right.
however when using binary_crossentropy, the accuracy is calculated for all classes, it would be 50% for this prediction. and the final result will be the mean of the individual accuracies for both cases.
it is recommended to use categorical_crossentropy for multi-class(classes are mutually exclusive) problem but binary_crossentropy for multi-label problem.
As it is a multi-class problem, you have to use the categorical_crossentropy, the binary cross entropy will produce bogus results, most likely will only evaluate the first two classes only.
50% for a multi-class problem can be quite good, depending on the number of classes. If you have n classes, then 100/n is the minimum performance you can get by outputting a random class.
Take a look at the equation you can find that binary cross entropy not only punish those label = 1, predicted =0, but also label = 0, predicted = 1.
However categorical cross entropy only punish those label = 1 but predicted = 1.That's why we make assumption that there is only ONE label positive.
when using the categorical_crossentropy
loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample).
You are passing a target array of shape (x-dim, y-dim) while using as loss categorical_crossentropy
. categorical_crossentropy
expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:
from keras.utils import to_categorical
y_binary = to_categorical(y_int)
Alternatively, you can use the loss function sparse_categorical_crossentropy
instead, which does expect integer targets.
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
The binary_crossentropy(y_target, y_predict) doesn't need to apply in binary classification problem. .
In the source code of binary_crossentropy(), the nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
TensorFlow function was actually used. And, in the documentation, it says that:
Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
'Programing' 카테고리의 다른 글
Windows 10으로 업데이트 한 후 TortoiseSVN 아이콘이 표시되지 않음 (0) | 2020.07.15 |
---|---|
jQuery ID로 시작 (0) | 2020.07.15 |
스케일링시 보간 비활성화 (0) | 2020.07.14 |
Java LinkedHashMap은 첫 번째 또는 마지막 항목을 얻습니다. (0) | 2020.07.14 |
C #에서 ToUpper ()와 ToUpperInvariant ()의 차이점은 무엇입니까? (0) | 2020.07.14 |