Practical Deep Learning in Life - Plant Pathology 2020
Solving a Kaggle competition
Why practical, again?
In the previous article I shared my approach to engagement with life. The title contains practical to emphasize my approach - it's fulfilling to learn or do something, solving a real problem rather than artificial. Imagine you want to start jogging - until you put your sneakers on and start moving you are not jogging. You can have goals, desires, expectations about jogging, you can learn theory behind it and watch videos but all of this is not the same as jogging. You could gain lots of knowledge but when you practice this knowledge is mostly your narrow representations.
That was a big lesson for me - to engage with the thing I want to do directly. I can give the same example about Deep Learning - I could go through all popular and interesting courses, solve all Kaggle's knowledge or kudos competitions, do homework. All of this doesn't really matter to me because real world - paid competitions and project - have real problems and constraints. They engage you and open for you what you really will be dealing with - creating awesome communications and relationships with clients, handling failures and mistakes, listening to feedback, making your shortcomings visible, marketing yourself. Many things you wouldn't know about.
My Ink
I am writing in a Jupyter notebook which I'll post through fastpages. I'm adding these lines the latest and I give no excuse for inconsistencies. I want you to know that after wrapping the whole thing up I realized that I enjoy writing stories like the previous one more than explaining technical stuff. Partly because it's already explained by someone much better. Partly because I am not interested in this - I am interested in sharing my experience but THE CODE is mostly dry facts.
About the notebook itself, the whole time it feels like I'm trying to fit something here which doesn't fit. Maybe I am not doing it in the right way. Maybe there's no right way. Also it requires some technical preparation, which creates contrast with using medium.com directly - a couple of paragraphs, a proofread and you're done.
The Competition
Let's jump in one of the most practical things for Deep Learning - paid Kaggle competitions. I'll share my solution to Plant Pathology 2020 - FGVC7 which put me in top 23%.
The goal is to build a model which predicts if tree leaves are healthy, have scab, rust or a combination. It made me wonder why so few classes - I'm sure there are much more diseases than just 2. For a given photo like this we have to predict one label:
My Approach
I always like to start with something really simple, make it public and build up from it. So first I wrote this notebook which was the base.
I often check other works to see if there's something I don't know, especially the ones with fastai. If I use something in mine I tend to put the source url here on the top. It's my gratitude. Many people optimize for upvoting their work which seems about ego to me. I don't want to please their ego or mine, I want to be authentic and helpful.
Thanks this guy for metrics and k-folds.
https://www.kaggle.com/lextoumbourou/plant-pathology-2020-eda-training-fastai2
Fastai2 docs
#hide_output
!pip install -q git+https://github.com/fastai/fastai2
!pip install -q git+https://github.com/fastai/fastcore
!pip install -q iterative-stratification
I ran this notebook in Colab Pro because it took around 10 hours with 16GB GPU - the Kaggle competition run is limited to 6 hours.
I have a paid plan for Google Drive where I store the data - which actually allowed me to process 500GB and train the model for DeepFake challenge. The cons are that when the number of files gets bigger (350k) drive mounting stops to work. But here it was just 3642 images.
Colab Pro is cheap and easy-to-use.
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
!mkdir /root/.kaggle
# imports and folders
from fastai2.vision.all import *
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
root_dir = Path("/content/gdrive/My Drive/")
path = root_dir/"kaggle/Plants/"
path.mkdir(parents=True, exist_ok=True)
# downloading data from kaggle
# !cp "{root_dir}/kaggle/kaggle.json" /root/.kaggle/kaggle.json
# import kaggle
# !kaggle competitions download -c plant-pathology-2020-fgvc7 -p "{path}/data" -q
# !unzip -q '{path}/plant-pathology-2020-fgvc7.zip' -d '{path}'
len((path/"images").ls())
The labels are stored in a CSV file and for some reason hot-encoded. Which misrepresents the problem. I think most people who read the description the first time thought it's a multi-label problem (where you need to predict multiple labels for a photo).
train_df = pd.read_csv(path/"train.csv")
train_df.head()
Preparation for k-folds )
It's the first competition where I used cross-validation. It's important to make a right train, valid and test split. The model trains on data from the train split and it validates - calculates the loss and prints metrics - on the valid split. Metrics show us how good is the model.
But test split is something the model only makes predictions on. Kaggle competitions always have the test split prepared - but you don't see its labels, only a public score after submission and a private one after a challenge ends. In fact the private score is calculated on the test set you never see, the reason is to penalize overfitting. On real projects like Canadian housing price prediction I do the splits myself, and I have the labels.
Cross-validation, how I see it, is the idea of minimizing randomness from one split by makings n folds, each fold containing train and validation splits. You train the model on each fold, so you have n models. Then you take average predictions from all models, which supposedly give us more confidence in results.
In my opinion it's more important to make one right split, especially because CV takes n times more to train. But on Kaggle CV yields slightly better scores, thus the environment encourages people to use it. More info about splits and CV you can read in How (and why) to create a good validation set
strat_kfold = MultilabelStratifiedKFold(n_splits=5, random_state=42, shuffle=True)
train_df['fold'] = -1
for i, (_, test_index) in enumerate(strat_kfold.split(train_df.image_id.values, train_df.iloc[:,1:].values)):
train_df.iloc[test_index, -1] = i
train_df.head()
5 folds, 350 images per each.
train_df.fold.value_counts().plot.bar();
train_df.query("image_id == 'Train_5'")
get_image_files(path/"images")[5]
Because there's only one label per row, I transform the dataframe from one-hot to normal label name.
train_df.iloc[0, 1:][train_df.iloc[0, 1:] == 1].index[0]
# I keep the labels here because I can forget the order. Fuckin up the order fucks up your results.
# LABEL_COLS = ['healthy', 'multiple_diseases', 'rust', 'scab']
The batch size. The batch size affects GPU memory usage. The bigger it is - the faster training. I found the one which uses all the memory.
# BS = 100
BS = 8
That's a fastai data block - the thing which helps to load, label, split and transform the data. It can be read as:
I want something for image categorization,
which I read from images folder using names in my dataframe,
label from the same dataframe,
split using the folds I defined earlier,
resize to the size I want,
and use transforms to help model generalize.
def get_data(fold=0, size=224):
return DataBlock(blocks = (ImageBlock, CategoryBlock),
get_x=ColReader(0, pref=path/"images", suff=".jpg"),
get_y=lambda o:o.iloc[1:][o.iloc[1:] == 1].index[0],
splitter=IndexSplitter(train_df[train_df.fold == fold].index),
item_tfms=Resize(size),
batch_tfms=aug_transforms(flip_vert=True),
).dataloaders(train_df, bs=BS)
dls = get_data()
A batch is a pack of data in GPU memory on which the computations are being done at once. As the size is 8, it means 8 photos are in memory at once. Here they are, with labels. The data is ready.
# dls = dblock.dataloaders(train_df, bs=BS)
dls.show_batch()
Each Kaggle challenge has its own evaluation metric. Here it's ROC AUC which have a scary name - receiver operating characteristic curve. Making things simple is a virtue and a very rare trait in science generally and in AI in particular.
I have no idea why this metric. I won't explain what it means, I think this guy did a really good job. For me it looks like a metric to account for all these TP, FP, TN, FN predictions. I am not going to explain those either.
from sklearn.metrics import roc_auc_score
def roc_auc(preds, targs, labels=range(4)):
# One-hot encode targets
targs = np.eye(4)[targs]
return np.mean([roc_auc_score(targs[:,i], preds[:,i]) for i in labels])
def healthy_roc_auc(*args):
return roc_auc(*args, labels=[0])
def multiple_diseases_roc_auc(*args):
return roc_auc(*args, labels=[1])
def rust_roc_auc(*args):
return roc_auc(*args, labels=[2])
def scab_roc_auc(*args):
return roc_auc(*args, labels=[3])
Commented code - my experiments. Fastai has lots of state-of-the-art pieces, I tried them one at a time on a smaller dataset and checked if it helped to train a better model.
CutMix is a technique to mix two photos - cut out a piece from one and put it on another. With this we create more combinations, i.e. create new data programmatically without actually taking photos. We make data plenty.
# from fastai2.callback.cutmix import CutMix
And here I tried a weighted loss - a loss that penalizes underrepresented class. Simply put, the number of labels is uneven - we have much fewer photos with multiple disease. When the model calculates loss it just takes the average loss, so we can have a situation where the performance is wonderful on all labels but terrible on multiple diseases. Something everybody wants to be aware of. With weighted loss we pretend this particular class is more important than others by multiplying its loss, so the model tries to optimize it more diligently.
The practice proved that creating multiple metrics and stopping training based on them works here better.
# loss = partial(CrossEntropyLossFlat, weights=tensor([1,1.5,1,1]))
That's all model code and you can compare it with pure Pytorch and TensorFlow2 implementations. Man, these people love to code.
But I simply get the data from a particular fold, create a model, add metrics. I believe resnet152 is the biggest resnet in pytorch. The larger the model - the larger is the capacity to learn differences we want. I also tried the biggest efficientnet just because I saw it somewhere but the best results I had with this
metric = partial(AccumMetric, flatten=False)
def get_learner(fold, size=224):
dls = get_data(fold, size)
return cnn_learner(dls, resnet152, metrics=[
error_rate,
metric(healthy_roc_auc),
metric(multiple_diseases_roc_auc),
metric(rust_roc_auc),
metric(scab_roc_auc),
metric(roc_auc)],
# cbs=MixUp(0.5),
# loss=LabelSmoothingCrossEntropy,
).to_fp16()
Many things are written about learning rate. The things which fastai takes care about, so I chose the default one which works in 95% of cases.
lr = 3e-3
And that's the whole training process. I just trained 5 models on each fold and saved predictions for the test set. I also used big picture size - 450x800, which improved my results in comparison with 224x224. Unsurprisingly, larger images - more data to learn from.
The important thing is to find something to do meanwhile and not to fall a victim of checking constantly how it's going.
test_df = pd.read_csv(path/"test.csv")
test_df.head()
all_preds = []
for i in range(5):
learn = get_learner(i, (450,800))
learn.fine_tune(30, lr, freeze_epochs=3)
tst_dl = learn.dls.test_dl(test_df)
preds, _ = learn.get_preds(dl=tst_dl)
all_preds.append(preds)
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_top_losses(9, figsize=(15, 10))
subm = pd.read_csv(path/"sample_submission.csv")
preds = np.mean(np.stack(all_preds), axis=0)
subm.iloc[:, 1:] = preds
subm.to_csv("submission.csv", index=False)
pd.read_csv("submission.csv")
!kaggle competitions submit -c plant-pathology-2020-fgvc7 -f submission.csv -m "450x800"
The practicality I like is that it didn't take me much time to get initial results and then I built from that. I also like that fastai took care about most things, which is clearly visible in comparison with other notebooks.
I think being in top 23% is a good result. There were 1317 participants in total, the first place has 0.98445 score. Good scores which are greater 0.9 start from 1000 place and mine is 295 with 0.96892.