Skip to content
Snippets Groups Projects
Commit 32ccb7f8 authored by Raakesh's avatar Raakesh
Browse files

Preprocessing

parent cb7033a9
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
## Preprocessing CSV's for training
%% Cell type:markdown id: tags:
![](https://www.rsna.org/-/media/Images/RSNA/Menu/logo_sml.ashx?w=100&la=en&hash=9619A8238B66C7BA9692C1FC3A5C9E97C24A06E1)
%% Cell type:markdown id: tags:
Are you working a lot with Data Generators (for example Keras' ".flow_from_dataframe") and competing in the [RSNA Intercranial Hemorrhage 2019 competition](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection)?
I've created a function that creates a simple preprocessed DataFrame with a column for ImageID and a column for each label in the competition. ('epidural', 'intraparenchymal', 'intraventricular', 'subarachnoid', 'subdural', 'any')
I also made a function which translates your predictions into the correct submission format.
If you are interested in getting the metadata as CSV files also you can check out [this Kaggle kernel](https://www.kaggle.com/carlolepelaars/converting-dicom-metadata-to-csv-rsna-ihd-2019).
%% Cell type:markdown id: tags:
## Preparation
%% Cell type:code id: tags:
``` python
# We will only need OS and Pandas for this one
import os
import pandas as pd
# Path names
BASE_PATH = "../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/"
TRAIN_PATH = BASE_PATH + 'stage_2_train.csv'
TEST_PATH = BASE_PATH + 'stage_2_sample_submission.csv'
# All labels that we have to predict in this competition
targets = ['epidural', 'intraparenchymal',
'intraventricular', 'subarachnoid',
'subdural', 'any']
```
%% Cell type:code id: tags:
``` python
# File sizes and specifications
print('\n# Files and file sizes')
for file in os.listdir(BASE_PATH)[2:]:
print('{}| {} MB'.format(file.ljust(30),
str(round(os.path.getsize(BASE_PATH + file) / 1000000, 2))))
```
%% Output
# Files and file sizes
stage_2_train | 26.59 MB
stage_2_train.csv | 119.7 MB
%% Cell type:markdown id: tags:
## Preprocessing CSV's
%% Cell type:code id: tags:
``` python
train_df = pd.read_csv(TRAIN_PATH)
train_df['ImageID'] = train_df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'
label_lists = train_df.groupby('ImageID')['Label'].apply(list)
```
%% Cell type:code id: tags:
``` python
train_df[train_df['ImageID'] == 'ID_0002081b6.png']
```
%% Output
ID Label ImageID
770232 ID_0002081b6_epidural 0 ID_0002081b6.png
770233 ID_0002081b6_intraparenchymal 1 ID_0002081b6.png
770234 ID_0002081b6_intraventricular 0 ID_0002081b6.png
770235 ID_0002081b6_subarachnoid 0 ID_0002081b6.png
770236 ID_0002081b6_subdural 0 ID_0002081b6.png
770237 ID_0002081b6_any 1 ID_0002081b6.png
%% Cell type:code id: tags:
``` python
def prepare_df(path, train=False, nrows=None):
"""
Prepare Pandas DataFrame for fitting neural network models
Returns a Dataframe with two columns
ImageID and Labels (list of all labels for an image)
"""
df = pd.read_csv(path, nrows=nrows)
# Get ImageID and type for pivoting
df['ImageID'] = df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'
df['type'] = df['ID'].str.split("_", n = 3, expand = True)[2]
# Create new DataFrame by pivoting
new_df = df[['Label', 'ImageID', 'type']].drop_duplicates().pivot(index='ImageID',
columns='type',
values='Label').reset_index()
return new_df
```
%% Cell type:code id: tags:
``` python
# Convert dataframes to preprocessed format
train_df = prepare_df(TRAIN_PATH, train=True)
test_df = prepare_df(TEST_PATH)
```
%% Cell type:code id: tags:
``` python
print('Training data: ')
display(train_df.head())
print('Test data: ')
test_df.head()
```
%% Output
Training data:
Test data:
type ImageID any epidural intraparenchymal intraventricular \
0 ID_000000e27.png 0.5 0.5 0.5 0.5
1 ID_000009146.png 0.5 0.5 0.5 0.5
2 ID_00007b8cb.png 0.5 0.5 0.5 0.5
3 ID_000134952.png 0.5 0.5 0.5 0.5
4 ID_000176f2a.png 0.5 0.5 0.5 0.5
type subarachnoid subdural
0 0.5 0.5
1 0.5 0.5
2 0.5 0.5
3 0.5 0.5
4 0.5 0.5
%% Cell type:code id: tags:
``` python
# Save to CSV
train_df.to_csv('clean_train_df.csv', index=False)
test_df.to_csv('clean_test_df.csv', index=False)
```
%% Cell type:markdown id: tags:
## Creating submission file
%% Cell type:code id: tags:
``` python
def create_submission_file(IDs, preds):
"""
Creates a submission file for Kaggle when given image ID's and predictions
IDs: A list of all image IDs (Extensions will be cut off)
preds: A list of lists containing all predictions for each image
Returns a DataFrame that has the correct format for this competition
"""
sub_dict = {'ID': [], 'Label': []}
# Create a row for each ID / Label combination
for i, ID in enumerate(IDs):
ID = ID.split('.')[0] # Remove extension such as .png
sub_dict['ID'].extend([f"{ID}_{target}" for target in targets])
sub_dict['Label'].extend(preds[i])
return pd.DataFrame(sub_dict)
```
%% Cell type:code id: tags:
``` python
# Finalize submission files
train_sub_df = create_submission_file(train_df['ImageID'], train_df[targets].values)
test_sub_df = create_submission_file(test_df['ImageID'], test_df[targets].values)
```
%% Cell type:code id: tags:
``` python
print('Back to the original submission format:')
train_sub_df.head(6)
```
%% Output
Back to the original submission format:
ID Label
0 ID_000012eaf_epidural 0
1 ID_000012eaf_intraparenchymal 0
2 ID_000012eaf_intraventricular 0
3 ID_000012eaf_subarachnoid 0
4 ID_000012eaf_subdural 0
5 ID_000012eaf_any 0
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment