fastai MultiLabel Classification using Kfold Cross Validation
July 18, 2020
#Blog18
I have written as Kaggle Public Notebook if you like please upvote.
The problem I have considered is Multi Label classification. In addition to having multiple labels in each image, the other challenge in this problem is the existence of rare classes and combinations of different classes. So in this situation normal split or random split doesnt work because you can end up putting rare cases in the validation set and your model will never learn about them. The stratification present in the scikit-learn is also not equipped to deal with multilabel targets.
I have specifically choosen this problem because we may learn some techniques on the way, which we otherwise would not have thought of.
There may be better or easy way of doing kfold cross validation but I have done it keeping in mind how to implement using fastai, so if you know some better way so please mail or tweet the idea, i will try to implement and give you credit.
Install all the necessary libraries
I am using fastai2 so import that.
!pip install -q fastai2
Cross Validation
Cross-validation, how I see it, is the idea of minimizing randomness from one split by makings n folds, each fold containing train and validation splits. You train the model on each fold, so you have n models. Then you take average predictions from all models, which supposedly give us more confidence in results. These we will see in following code. I found iterative-stratification package that provides scikit-learn compatible cross validators with stratification for multilabel data.
My opinion:
In my opinion it’s more important to make one right split, especially because CV takes n times more to train. Then why did I do it??
I wanted to explore classification using cross validation using fastai, which I didn’t find many resources to learn. So if I write this blog it may help people.
fastai has no cross validation split(may be) in their library to work like other functions they provide. It may be because cross validation takes time, so may be it not that useful.
But still in this condition I feel its worth exploring using fastai.
so what is stratification??
The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation
!pip install -q iterative-stratification
from fastai2.vision.all import *
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
Here dataset is of Zero to GANs - Human Protein Classification inclass jovian.ml hosted competition
path = Path('../input/jovian-pytorch-z2g/Human protein atlas')
train_df = pd.read_csv(path/'train.csv')
train_df['Image'] = train_df['Image'].apply(str) + ".png"
train_df['Image'] = "../input/jovian-pytorch-z2g/Human protein atlas/train/" + train_df['Image']
train_df.head()
The method I use here is if we have column called fold and with fold number it would be helpfull to split data using that.
fastai has IndexSplitter in datablock api so this would be helpful.
strat_kfold = MultilabelStratifiedKFold(n_splits=3, random_state=42, shuffle=True)
train_df['fold'] = -1
for i, (_, test_index) in enumerate(strat_kfold.split(train_df.Image.values, train_df.iloc[:,1:].values)):
train_df.iloc[test_index, -1] = i
train_df.head()
train_df.fold.value_counts().plot.bar();
DataBlock
now that data is in dataframe and also folds are also defined for cross validation, we will build dataloaders, for which we will use datablock.
If you want to learn how fastai datablock see my blog series Make code Simple with DataBlock api
we will create a function get_data to create dataloader.
get_data uses fold to split data to be used for cross validation using IndexSplitter. for multiLabel problem compared to single only extra thing to be done is to add MultiCategoryBlock in blocks, this is how fastai makes it easy to work.
def get_data(fold=0, size=224,bs=32):
return DataBlock(blocks=(ImageBlock,MultiCategoryBlock),
get_x=ColReader(0),
get_y=ColReader(1, label_delim=' '),
splitter=IndexSplitter(train_df[train_df.fold == fold].index),
item_tfms=[FlipItem(p=0.5),Resize(512,method='pad')],
batch_tfms=[*aug_transforms(size=size,do_flip=True, flip_vert=True, max_rotate=180.0, max_lighting=0.6,max_warp=0.1, p_affine=0.75, p_lighting=0.75,xtra_tfms=[RandomErasing(p=0.5,sh=0.1, min_aspect=0.2,max_count=2)]),Normalize],
).dataloaders(train_df, bs=bs)
metrics
Since this is multi label problem normal accuracy function wont work, so we have accuracy_multi. fastai has this which we can directly use in metrics but I wanted to know how that works so took code of it.
def accuracy_multi(inp, targ, thresh=0.5, sigmoid=True):
"Compute accuracy when `inp` and `targ` are the same size."
if sigmoid: inp = inp.sigmoid()
return ((inp>thresh)==targ.bool()).float().mean()
F_score is way of evaluation for this competition so used this.
def F_score(output, label, threshold=0.2, beta=1):
prob = output > threshold
label = label > threshold
TP = (prob & label).sum(1).float()
TN = ((~prob) & (~label)).sum(1).float()
FP = (prob & (~label)).sum(1).float()
FN = ((~prob) & label).sum(1).float()
precision = torch.mean(TP / (TP + FP + 1e-12))
recall = torch.mean(TP / (TP + FN + 1e-12))
F2 = (1 + beta**2) * precision * recall / (beta**2 * precision + recall + 1e-12)
return F2.mean(0)
Gathering test set
test_df = pd.read_csv('../input/jovian-pytorch-z2g/submission.csv')
tstpng = test_df.copy()
tstpng['Image'] = tstpng['Image'].apply(str) + ".png"
tstpng['Image'] = "../input/jovian-pytorch-z2g/Human protein atlas/test/" + tstpng['Image']
tstpng.head()
Training
I have used technique called mixup, its a data augmentation technique.
In fastai Mixup is callback, and this Callback is used to apply MixUp data augmentation to your training. to know more read this
I have tried this first time, but this technique didnot improve my result in this problem. It usually improves accuracy after 80 epochs but I have trained for 20 epoches. so there was no difference in accuracy without it. so you can ignore this.
But to know about how mixup works is good, I will separate blog on this, so follow my twitter for updates.
mixup = MixUp(0.3)
gc is for garbage collection
import gc
I have created 3 folds where I simply get the data from a particular fold, create a model, add metrics, I have used resnet34. And that’s the whole training process. I just trained model on each fold and saved predictions for the test set.
I have used a technique called progressive resizing.
this is very simple: start training using small images, and end training using large images. Spending most of the epochs training with small images, helps training complete much faster. Completing training using large images makes the final accuracy much higher. this approach is called progressive resizing.
we should use the fine_tune
method after we resize our images to get our model to learn to do something a little bit different from what it has learned to do before.
I have used cbs=EarlyStoppingCallback(monitor='valid_loss')
so that model doesnot overfit.
append all prediction to list so that we use it later.
I have run the model for less epochs to see code works and show result, or stopped model in between(it took so much time)
This method gave me F_score of .77
and accuracy of >91%
so you can try.
My Purpose here is to write blog and explain how to approach and how code works.
If GPU is out of memory delete learner and empty cuda cache done in last line of code.
all_preds = []
for i in range(3):
dls = get_data(i,256,64)
learn = cnn_learner(dls, resnet34, metrics=[partial(accuracy_multi, thresh=0.2),partial(F_score, threshold=0.2)],cbs=mixup).to_fp16()
learn.fit_one_cycle(10, cbs=EarlyStoppingCallback(monitor='valid_loss'))
learn.dls = get_data(i,512,32)
learn.fine_tune(10,cbs=EarlyStoppingCallback(monitor='valid_loss'))
tst_dl = learn.dls.test_dl(tstpng)
preds, _ = learn.get_preds(dl=tst_dl)
all_preds.append(preds)
del learn
torch.cuda.empty_cache()
gc.collect()
stack all the prediction stored in list and average the values.
subm = pd.read_csv("../input/jovian-pytorch-z2g/submission.csv")
preds = np.mean(np.stack(all_preds), axis=0)
You should have list of labels which we get using vocab.
k = dls.vocab
preds[0]
I found threshold of 0.2 works good for my code.
then all the labels predicted above 0.2 are labels of that image using vocab.
thresh=0.2
labelled_preds = [' '.join([k[i] for i,p in enumerate(pred) if p > thresh]) for pred in preds]
put them in Labels column
test_df['Label']=labelled_preds
this step is to submit result to kaggle.
test_df.to_csv('submission.csv',index=False)
I have written as Kaggle Public Notebook if you like please upvote.
Thank you for reading:)