Skip to content
Snippets Groups Projects
Commit 16f398a6 authored by Raakesh's avatar Raakesh
Browse files

diacom to csv

parent 32ccb7f8
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Preprocessing CSV's for training ## Converting DICOM metadata to CSV files
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
![](https://www.rsna.org/-/media/Images/RSNA/Menu/logo_sml.ashx?w=100&la=en&hash=9619A8238B66C7BA9692C1FC3A5C9E97C24A06E1) ## Table Of Contents
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Are you working a lot with Data Generators (for example Keras' ".flow_from_dataframe") and competing in the [RSNA Intercranial Hemorrhage 2019 competition](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection)? - [Dependencies](#1)
- [Preparation](#2)
I've created a function that creates a simple preprocessed DataFrame with a column for ImageID and a column for each label in the competition. ('epidural', 'intraparenchymal', 'intraventricular', 'subarachnoid', 'subdural', 'any') - [Metadata](#3)
- [Type Conversion](#4)
I also made a function which translates your predictions into the correct submission format. - [Merge and Save](#5)
- [Final Check](#6)
If you are interested in getting the metadata as CSV files also you can check out [this Kaggle kernel](https://www.kaggle.com/carlolepelaars/converting-dicom-metadata-to-csv-rsna-ihd-2019).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Preparation ## Dependencies <a id="1"></a>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# We will only need OS and Pandas for this one # Standard libraries
import os import os
import pandas as pd import gc
import pydicom # For accessing DICOM files
# Path names import numpy as np
BASE_PATH = "../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/" import pandas as pd
TRAIN_PATH = BASE_PATH + 'stage_2_train.csv' import random as rn
TEST_PATH = BASE_PATH + 'stage_2_sample_submission.csv' from tqdm import tqdm
# All labels that we have to predict in this competition # Visualization
targets = ['epidural', 'intraparenchymal', import matplotlib.pyplot as plt
'intraventricular', 'subarachnoid', import matplotlib.image as mpimg
'subdural', 'any']
# Paths
KAGGLE_DIR = '../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/'
IMG_PATH_TRAIN = KAGGLE_DIR + 'stage_2_train/'
IMG_PATH_TEST = KAGGLE_DIR + 'stage_2_test/'
TRAIN_CSV_PATH = KAGGLE_DIR + 'stage_2_train.csv'
TEST_CSV_PATH = KAGGLE_DIR + 'stage_2_sample_submission.csv'
# Seed for reproducability
seed = 1234
np.random.seed(seed)
rn.seed(seed)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# File sizes and specifications # File sizes and specifications
print('\n# Files and file sizes') print('\n# Files and file sizes')
for file in os.listdir(BASE_PATH)[2:]: for file in os.listdir(KAGGLE_DIR)[2:]:
print('{}| {} MB'.format(file.ljust(30), print('{}| {} MB'.format(file.ljust(30),
str(round(os.path.getsize(BASE_PATH + file) / 1000000, 2)))) str(round(os.path.getsize(KAGGLE_DIR + file) / 1000000, 2))))
``` ```
%% Output %% Cell type:markdown id: tags:
# Files and file sizes ## Preparation <a id="2"></a>
stage_2_train | 26.59 MB
stage_2_train.csv | 119.7 MB
%% Cell type:markdown id: tags: %% Cell type:code id: tags:
## Preprocessing CSV's ``` python
# Load in raw datasets
train_df = pd.read_csv(TRAIN_CSV_PATH)
test_df = pd.read_csv(TEST_CSV_PATH)
# For convenience, collect sub type and seperate PatientID as new features
for df in [train_df, test_df]:
df['Sub_type'] = df['ID'].str.split("_", n = 3, expand = True)[2]
df['PatientID'] = df['ID'].str.split("_", n = 3, expand = True)[1]
```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
train_df = pd.read_csv(TRAIN_PATH) # All filenames for train and test images
train_df['ImageID'] = train_df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png' train_images = os.listdir(IMG_PATH_TRAIN)
label_lists = train_df.groupby('ImageID')['Label'].apply(list) test_images = os.listdir(IMG_PATH_TEST)
``` ```
%% Cell type:markdown id: tags:
## Metadata <a id="3"></a>
%% Cell type:markdown id: tags:
The [pydicom](https://pydicom.github.io/pydicom/stable/getting_started.html) library allows us to conveniently read in DICOM files and access different values from the file. The actual image can be found in "pixel_array".
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
train_df[train_df['ImageID'] == 'ID_0002081b6.png'] print('Example of all data in a single DICOM file:\n')
example_dicom = pydicom.dcmread(IMG_PATH_TRAIN + train_images[0])
print(example_dicom)
``` ```
%% Output %% Cell type:code id: tags:
ID Label ImageID ``` python
770232 ID_0002081b6_epidural 0 ID_0002081b6.png # All columns for which we want to collect information
770233 ID_0002081b6_intraparenchymal 1 ID_0002081b6.png meta_cols = ['BitsAllocated','BitsStored','Columns','HighBit',
770234 ID_0002081b6_intraventricular 0 ID_0002081b6.png 'Modality','PatientID','PhotometricInterpretation',
770235 ID_0002081b6_subarachnoid 0 ID_0002081b6.png 'PixelRepresentation','RescaleIntercept','RescaleSlope',
770236 ID_0002081b6_subdural 0 ID_0002081b6.png 'Rows','SOPInstanceUID','SamplesPerPixel','SeriesInstanceUID',
770237 ID_0002081b6_any 1 ID_0002081b6.png 'StudyID','StudyInstanceUID','ImagePositionPatient',
'ImageOrientationPatient','PixelSpacing']
```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def prepare_df(path, train=False, nrows=None): # Initialize dictionaries to collect the metadata
""" col_dict_train = {col: [] for col in meta_cols}
Prepare Pandas DataFrame for fitting neural network models col_dict_test = {col: [] for col in meta_cols}
Returns a Dataframe with two columns
ImageID and Labels (list of all labels for an image)
"""
df = pd.read_csv(path, nrows=nrows)
# Get ImageID and type for pivoting
df['ImageID'] = df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'
df['type'] = df['ID'].str.split("_", n = 3, expand = True)[2]
# Create new DataFrame by pivoting
new_df = df[['Label', 'ImageID', 'type']].drop_duplicates().pivot(index='ImageID',
columns='type',
values='Label').reset_index()
return new_df
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Convert dataframes to preprocessed format # Get values for training images
train_df = prepare_df(TRAIN_PATH, train=True) for img in tqdm(train_images):
test_df = prepare_df(TEST_PATH) dicom_object = pydicom.dcmread(IMG_PATH_TRAIN + img)
for col in meta_cols:
col_dict_train[col].append(str(getattr(dicom_object, col)))
# Store all information in a DataFrame
meta_df_train = pd.DataFrame(col_dict_train)
del col_dict_train
gc.collect()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print('Training data: ') # Get values for test images
display(train_df.head()) for img in tqdm(test_images):
dicom_object = pydicom.dcmread(IMG_PATH_TEST + img)
for col in meta_cols:
col_dict_test[col].append(str(getattr(dicom_object, col)))
print('Test data: ') # Store all information in a DataFrame
test_df.head() meta_df_test = pd.DataFrame(col_dict_test)
del col_dict_test
gc.collect()
``` ```
%% Output %% Cell type:markdown id: tags:
Training data: ## Type Conversion <a id="4"></a>
%% Cell type:markdown id: tags:
Above we used a bit of a hacky solution by converting all metadata to string values. Now we will convert all features back to proper types.
All numeric features will be converted to float types. We will keep all categorical features as string types.
Test data: The 'WindowCenter' and 'WindowWidth' were rather odd as they featured both int, float and list values. For now I skipped these features, but I may add them to this kernel later. Feel free to share code to conveniently handle this data.
type ImageID any epidural intraparenchymal intraventricular \ The features 'ImagePositionPatient', 'ImageOrientationPatient' and 'PixelSpacing' are stored as lists. In order to easily access these features we create a new column for every value in the list.
0 ID_000000e27.png 0.5 0.5 0.5 0.5
1 ID_000009146.png 0.5 0.5 0.5 0.5
2 ID_00007b8cb.png 0.5 0.5 0.5 0.5
3 ID_000134952.png 0.5 0.5 0.5 0.5
4 ID_000176f2a.png 0.5 0.5 0.5 0.5
type subarachnoid subdural We fill missing values with values that are outside the range of the feature (-999).
0 0.5 0.5
1 0.5 0.5
2 0.5 0.5
3 0.5 0.5
4 0.5 0.5
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Save to CSV # Specify numeric columns
train_df.to_csv('clean_train_df.csv', index=False) num_cols = ['BitsAllocated', 'BitsStored','Columns','HighBit', 'Rows',
test_df.to_csv('clean_test_df.csv', index=False) 'PixelRepresentation', 'RescaleIntercept', 'RescaleSlope', 'SamplesPerPixel']
``` ```
%% Cell type:markdown id: tags: %% Cell type:code id: tags:
## Creating submission file ``` python
# Split to get proper PatientIDs
meta_df_train['PatientID'] = meta_df_train['PatientID'].str.split("_", n = 3, expand = True)[1]
meta_df_test['PatientID'] = meta_df_test['PatientID'].str.split("_", n = 3, expand = True)[1]
# Convert all numeric cols to floats
for col in num_cols:
meta_df_train[col] = meta_df_train[col].fillna(-9999).astype(float)
meta_df_test[col] = meta_df_test[col].fillna(-9999).astype(float)
```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def create_submission_file(IDs, preds): # Hacky solution for multi features
""" for df in [meta_df_train, meta_df_test]:
Creates a submission file for Kaggle when given image ID's and predictions # ImagePositionPatient
ipp1 = []
ipp2 = []
ipp3 = []
for value in df['ImagePositionPatient'].fillna('[-9999,-9999,-9999]').values:
value_list = eval(value)
ipp1.append(float(value_list[0]))
ipp2.append(float(value_list[1]))
ipp3.append(float(value_list[2]))
df['ImagePositionPatient_1'] = ipp1
df['ImagePositionPatient_2'] = ipp2
df['ImagePositionPatient_3'] = ipp3
IDs: A list of all image IDs (Extensions will be cut off) # ImageOrientationPatient
preds: A list of lists containing all predictions for each image iop1 = []
iop2 = []
iop3 = []
iop4 = []
iop5 = []
iop6 = []
# Fill missing values and collect all Image Orientation information
for value in df['ImageOrientationPatient'].fillna('[-9999,-9999,-9999,-9999,-9999,-9999]').values:
value_list = eval(value)
iop1.append(float(value_list[0]))
iop2.append(float(value_list[1]))
iop3.append(float(value_list[2]))
iop4.append(float(value_list[3]))
iop5.append(float(value_list[4]))
iop6.append(float(value_list[5]))
df['ImageOrientationPatient_1'] = iop1
df['ImageOrientationPatient_2'] = iop2
df['ImageOrientationPatient_3'] = iop3
df['ImageOrientationPatient_4'] = iop4
df['ImageOrientationPatient_5'] = iop5
df['ImageOrientationPatient_6'] = iop6
Returns a DataFrame that has the correct format for this competition # Pixel Spacing
""" ps1 = []
sub_dict = {'ID': [], 'Label': []} ps2 = []
# Create a row for each ID / Label combination # Fill missing values and collect all pixal spacing features
for i, ID in enumerate(IDs): for value in df['PixelSpacing'].fillna('[-9999,-9999]').values:
ID = ID.split('.')[0] # Remove extension such as .png value_list = eval(value)
sub_dict['ID'].extend([f"{ID}_{target}" for target in targets]) ps1.append(float(value_list[0]))
sub_dict['Label'].extend(preds[i]) ps2.append(float(value_list[1]))
return pd.DataFrame(sub_dict) df['PixelSpacing_1'] = ps1
df['PixelSpacing_2'] = ps2
``` ```
%% Cell type:markdown id: tags:
## Merge and Save <a id="5"></a>
%% Cell type:markdown id: tags:
This metadata will only be useful if we can connect it to specific images. To make sure every value is in the correct row we can conveniently merge on the PatientID feature. However, an inner or left join will not work since our DataFrame with metadata contains a lot of rows that are not in the original DataFrame. Joining on the right and using a few columns from the original DataFrame will do the trick.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Finalize submission files # Merge DataFrames
train_sub_df = create_submission_file(train_df['ImageID'], train_df[targets].values) train_df_merged = meta_df_train.merge(train_df, how='left', on='PatientID')
test_sub_df = create_submission_file(test_df['ImageID'], test_df[targets].values) train_df_merged['ID'] = train_df['ID']
train_df_merged['Label'] = train_df['Label']
train_df_merged['Sub_type'] = train_df['Sub_type']
test_df_merged = meta_df_test.merge(test_df, how='left', on='PatientID')
test_df_merged['ID'] = test_df['ID']
test_df_merged['Label'] = test_df['Label']
test_df_merged['Sub_type'] = test_df['Sub_type']
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print('Back to the original submission format:') # Save to CSV
train_sub_df.head(6) train_df_merged.to_csv('stage_2_train_with_metadata.csv', index=False)
test_df_merged.to_csv('stage_2_test_with_metadata.csv', index=False)
``` ```
%% Output %% Cell type:markdown id: tags:
## Final Check <a id="6"></a>
Back to the original submission format: %% Cell type:code id: tags:
ID Label ``` python
0 ID_000012eaf_epidural 0 # Final check on the new dataset
1 ID_000012eaf_intraparenchymal 0 print('Training Data:')
2 ID_000012eaf_intraventricular 0 display(train_df_merged.head(3))
3 ID_000012eaf_subarachnoid 0 display(train_df_merged.tail(3))
4 ID_000012eaf_subdural 0 print('Testing Data:')
5 ID_000012eaf_any 0 display(test_df_merged.head(3))
display(test_df_merged.tail(3))
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment