Skip to content
Snippets Groups Projects
Commit 32ccb7f8 authored by Raakesh's avatar Raakesh
Browse files

Preprocessing

parent cb7033a9
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
## Preprocessing CSV's for training
%% Cell type:markdown id: tags:
![](https://www.rsna.org/-/media/Images/RSNA/Menu/logo_sml.ashx?w=100&la=en&hash=9619A8238B66C7BA9692C1FC3A5C9E97C24A06E1)
%% Cell type:markdown id: tags:
Are you working a lot with Data Generators (for example Keras' ".flow_from_dataframe") and competing in the [RSNA Intercranial Hemorrhage 2019 competition](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection)?
I've created a function that creates a simple preprocessed DataFrame with a column for ImageID and a column for each label in the competition. ('epidural', 'intraparenchymal', 'intraventricular', 'subarachnoid', 'subdural', 'any')
I also made a function which translates your predictions into the correct submission format.
If you are interested in getting the metadata as CSV files also you can check out [this Kaggle kernel](https://www.kaggle.com/carlolepelaars/converting-dicom-metadata-to-csv-rsna-ihd-2019).
%% Cell type:markdown id: tags:
## Preparation
%% Cell type:code id: tags:
``` python
# We will only need OS and Pandas for this one
import os
import pandas as pd
# Path names
BASE_PATH = "../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/"
TRAIN_PATH = BASE_PATH + 'stage_2_train.csv'
TEST_PATH = BASE_PATH + 'stage_2_sample_submission.csv'
# All labels that we have to predict in this competition
targets = ['epidural', 'intraparenchymal',
'intraventricular', 'subarachnoid',
'subdural', 'any']
```
%% Cell type:code id: tags:
``` python
# File sizes and specifications
print('\n# Files and file sizes')
for file in os.listdir(BASE_PATH)[2:]:
print('{}| {} MB'.format(file.ljust(30),
str(round(os.path.getsize(BASE_PATH + file) / 1000000, 2))))
```
%% Output
# Files and file sizes
stage_2_train | 26.59 MB
stage_2_train.csv | 119.7 MB
%% Cell type:markdown id: tags:
## Preprocessing CSV's
%% Cell type:code id: tags:
``` python
train_df = pd.read_csv(TRAIN_PATH)
train_df['ImageID'] = train_df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'
label_lists = train_df.groupby('ImageID')['Label'].apply(list)
```
%% Cell type:code id: tags:
``` python
train_df[train_df['ImageID'] == 'ID_0002081b6.png']
```
%% Output
ID Label ImageID
770232 ID_0002081b6_epidural 0 ID_0002081b6.png
770233 ID_0002081b6_intraparenchymal 1 ID_0002081b6.png
770234 ID_0002081b6_intraventricular 0 ID_0002081b6.png
770235 ID_0002081b6_subarachnoid 0 ID_0002081b6.png
770236 ID_0002081b6_subdural 0 ID_0002081b6.png
770237 ID_0002081b6_any 1 ID_0002081b6.png
%% Cell type:code id: tags:
``` python
def prepare_df(path, train=False, nrows=None):
"""
Prepare Pandas DataFrame for fitting neural network models
Returns a Dataframe with two columns
ImageID and Labels (list of all labels for an image)
"""
df = pd.read_csv(path, nrows=nrows)
# Get ImageID and type for pivoting
df['ImageID'] = df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'
df['type'] = df['ID'].str.split("_", n = 3, expand = True)[2]
# Create new DataFrame by pivoting
new_df = df[['Label', 'ImageID', 'type']].drop_duplicates().pivot(index='ImageID',
columns='type',
values='Label').reset_index()
return new_df
```
%% Cell type:code id: tags:
``` python
# Convert dataframes to preprocessed format
train_df = prepare_df(TRAIN_PATH, train=True)
test_df = prepare_df(TEST_PATH)
```
%% Cell type:code id: tags:
``` python
print('Training data: ')
display(train_df.head())
print('Test data: ')
test_df.head()
```
%% Output
Training data:
Test data:
type ImageID any epidural intraparenchymal intraventricular \
0 ID_000000e27.png 0.5 0.5 0.5 0.5
1 ID_000009146.png 0.5 0.5 0.5 0.5
2 ID_00007b8cb.png 0.5 0.5 0.5 0.5
3 ID_000134952.png 0.5 0.5 0.5 0.5
4 ID_000176f2a.png 0.5 0.5 0.5 0.5
type subarachnoid subdural
0 0.5 0.5
1 0.5 0.5
2 0.5 0.5
3 0.5 0.5
4 0.5 0.5
%% Cell type:code id: tags:
``` python
# Save to CSV
train_df.to_csv('clean_train_df.csv', index=False)
test_df.to_csv('clean_test_df.csv', index=False)
```
%% Cell type:markdown id: tags:
## Creating submission file
%% Cell type:code id: tags:
``` python
def create_submission_file(IDs, preds):
"""
Creates a submission file for Kaggle when given image ID's and predictions
IDs: A list of all image IDs (Extensions will be cut off)
preds: A list of lists containing all predictions for each image
Returns a DataFrame that has the correct format for this competition
"""
sub_dict = {'ID': [], 'Label': []}
# Create a row for each ID / Label combination
for i, ID in enumerate(IDs):
ID = ID.split('.')[0] # Remove extension such as .png
sub_dict['ID'].extend([f"{ID}_{target}" for target in targets])
sub_dict['Label'].extend(preds[i])
return pd.DataFrame(sub_dict)
```
%% Cell type:code id: tags:
``` python
# Finalize submission files
train_sub_df = create_submission_file(train_df['ImageID'], train_df[targets].values)
test_sub_df = create_submission_file(test_df['ImageID'], test_df[targets].values)
```
%% Cell type:code id: tags:
``` python
print('Back to the original submission format:')
train_sub_df.head(6)
```
%% Output
Back to the original submission format:
ID Label
0 ID_000012eaf_epidural 0
1 ID_000012eaf_intraparenchymal 0
2 ID_000012eaf_intraventricular 0
3 ID_000012eaf_subarachnoid 0
4 ID_000012eaf_subdural 0
5 ID_000012eaf_any 0
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment