Replace preprocessing-csv-rsna-ih-2019-19023.ipynb

26159253 · Raakesh · 16f398a6 · 26159253
Commit 26159253 authored 4 years ago by Raakesh
--- a/preprocessing-csv-rsna-ih-2019-19023.ipynb
+++ b/preprocessing-csv-rsna-ih-2019-19023.ipynb
-{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Converting DICOM metadata to CSV files","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"## Table Of Contents","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"- [Dependencies](#1)\n- [Preparation](#2)\n- [Metadata](#3)\n- [Type Conversion](#4)\n- [Merge and Save](#5)\n- [Final Check](#6)","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"## Dependencies <a id=\"1\"></a>","execution_count":null},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"# Standard libraries\nimport os\nimport gc\nimport pydicom # For accessing DICOM files\nimport numpy as np\nimport pandas as pd \nimport random as rn\nfrom tqdm import tqdm\n\n# Visualization\nimport matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\n\n# Paths \nKAGGLE_DIR = '../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/'\nIMG_PATH_TRAIN = KAGGLE_DIR + 'stage_2_train/'\nIMG_PATH_TEST = KAGGLE_DIR + 'stage_2_test/'\nTRAIN_CSV_PATH = KAGGLE_DIR + 'stage_2_train.csv'\nTEST_CSV_PATH = KAGGLE_DIR + 'stage_2_sample_submission.csv'\n\n# Seed for reproducability\nseed = 1234\nnp.random.seed(seed)\nrn.seed(seed)","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true,"_kg_hide-input":true},"cell_type":"code","source":"# File sizes and specifications\nprint('\\n# Files and file sizes')\nfor file in os.listdir(KAGGLE_DIR)[2:]:\n    print('{}| {} MB'.format(file.ljust(30), \n                             str(round(os.path.getsize(KAGGLE_DIR + file) / 1000000, 2))))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Preparation <a id=\"2\"></a>","execution_count":null},{"metadata":{"trusted":true},"cell_type":"code","source":"# Load in raw datasets\ntrain_df = pd.read_csv(TRAIN_CSV_PATH)\ntest_df = pd.read_csv(TEST_CSV_PATH)\n# For convenience, collect sub type and seperate PatientID as new features\nfor df in [train_df, test_df]:\n    df['Sub_type'] = df['ID'].str.split(\"_\", n = 3, expand = True)[2]\n    df['PatientID'] = df['ID'].str.split(\"_\", n = 3, expand = True)[1]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# All filenames for train and test images\ntrain_images = os.listdir(IMG_PATH_TRAIN)\ntest_images = os.listdir(IMG_PATH_TEST)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Metadata <a id=\"3\"></a>","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"The [pydicom](https://pydicom.github.io/pydicom/stable/getting_started.html) library allows us to conveniently read in DICOM files and access different values from the file. The actual image can be found in \"pixel_array\".","execution_count":null},{"metadata":{"_kg_hide-input":true,"trusted":true},"cell_type":"code","source":"print('Example of all data in a single DICOM file:\\n')\nexample_dicom = pydicom.dcmread(IMG_PATH_TRAIN + train_images[0])\nprint(example_dicom)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# All columns for which we want to collect information\nmeta_cols = ['BitsAllocated','BitsStored','Columns','HighBit',\n             'Modality','PatientID','PhotometricInterpretation',\n             'PixelRepresentation','RescaleIntercept','RescaleSlope',\n             'Rows','SOPInstanceUID','SamplesPerPixel','SeriesInstanceUID',\n             'StudyID','StudyInstanceUID','ImagePositionPatient',\n             'ImageOrientationPatient','PixelSpacing']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Initialize dictionaries to collect the metadata\ncol_dict_train = {col: [] for col in meta_cols}\ncol_dict_test = {col: [] for col in meta_cols}","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_kg_hide-output":true},"cell_type":"code","source":"# Get values for training images\nfor img in tqdm(train_images): \n    dicom_object = pydicom.dcmread(IMG_PATH_TRAIN + img)\n    for col in meta_cols: \n        col_dict_train[col].append(str(getattr(dicom_object, col)))\n\n# Store all information in a DataFrame\nmeta_df_train = pd.DataFrame(col_dict_train)\ndel col_dict_train\ngc.collect()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_kg_hide-output":true},"cell_type":"code","source":"# Get values for test images\nfor img in tqdm(test_images): \n    dicom_object = pydicom.dcmread(IMG_PATH_TEST + img)\n    for col in meta_cols: \n        col_dict_test[col].append(str(getattr(dicom_object, col)))\n\n# Store all information in a DataFrame\nmeta_df_test = pd.DataFrame(col_dict_test)\ndel col_dict_test\ngc.collect()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Type Conversion <a id=\"4\"></a>","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"Above we used a bit of a hacky solution by converting all metadata to string values. Now we will convert all features back to proper types.\n\nAll numeric features will be converted to float types. We will keep all categorical features as string types.\n\nThe 'WindowCenter' and 'WindowWidth' were rather odd as they featured both int, float and list values. For now I skipped these features, but I may add them to this kernel later. Feel free to share code to conveniently handle this data.\n\nThe features 'ImagePositionPatient', 'ImageOrientationPatient' and 'PixelSpacing' are stored as lists. In order to easily access these features we create a new column for every value in the list. \n\nWe fill missing values with values that are outside the range of the feature (-999).\n","execution_count":null},{"metadata":{"trusted":true},"cell_type":"code","source":"# Specify numeric columns\nnum_cols = ['BitsAllocated', 'BitsStored','Columns','HighBit', 'Rows',\n            'PixelRepresentation', 'RescaleIntercept', 'RescaleSlope', 'SamplesPerPixel']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_kg_hide-input":false,"_kg_hide-output":true},"cell_type":"code","source":"# Split to get proper PatientIDs\nmeta_df_train['PatientID'] = meta_df_train['PatientID'].str.split(\"_\", n = 3, expand = True)[1]\nmeta_df_test['PatientID'] = meta_df_test['PatientID'].str.split(\"_\", n = 3, expand = True)[1]\n\n# Convert all numeric cols to floats\nfor col in num_cols:\n    meta_df_train[col] = meta_df_train[col].fillna(-9999).astype(float)\n    meta_df_test[col] = meta_df_test[col].fillna(-9999).astype(float)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_kg_hide-input":false,"_kg_hide-output":true},"cell_type":"code","source":"# Hacky solution for multi features\nfor df in [meta_df_train, meta_df_test]:\n    # ImagePositionPatient\n    ipp1 = []\n    ipp2 = []\n    ipp3 = []\n    for value in df['ImagePositionPatient'].fillna('[-9999,-9999,-9999]').values:\n        value_list = eval(value)\n        ipp1.append(float(value_list[0]))\n        ipp2.append(float(value_list[1]))\n        ipp3.append(float(value_list[2]))\n    df['ImagePositionPatient_1'] = ipp1\n    df['ImagePositionPatient_2'] = ipp2\n    df['ImagePositionPatient_3'] = ipp3\n    \n    # ImageOrientationPatient\n    iop1 = []\n    iop2 = []\n    iop3 = []\n    iop4 = []\n    iop5 = []\n    iop6 = []\n    # Fill missing values and collect all Image Orientation information\n    for value in df['ImageOrientationPatient'].fillna('[-9999,-9999,-9999,-9999,-9999,-9999]').values:\n        value_list = eval(value)\n        iop1.append(float(value_list[0]))\n        iop2.append(float(value_list[1]))\n        iop3.append(float(value_list[2]))\n        iop4.append(float(value_list[3]))\n        iop5.append(float(value_list[4]))\n        iop6.append(float(value_list[5]))\n    df['ImageOrientationPatient_1'] = iop1\n    df['ImageOrientationPatient_2'] = iop2\n    df['ImageOrientationPatient_3'] = iop3\n    df['ImageOrientationPatient_4'] = iop4\n    df['ImageOrientationPatient_5'] = iop5\n    df['ImageOrientationPatient_6'] = iop6\n    \n    # Pixel Spacing\n    ps1 = []\n    ps2 = []\n    # Fill missing values and collect all pixal spacing features\n    for value in df['PixelSpacing'].fillna('[-9999,-9999]').values:\n        value_list = eval(value)\n        ps1.append(float(value_list[0]))\n        ps2.append(float(value_list[1]))\n    df['PixelSpacing_1'] = ps1\n    df['PixelSpacing_2'] = ps2","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Merge and Save <a id=\"5\"></a>","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"This metadata will only be useful if we can connect it to specific images. To make sure every value is in the correct row we can conveniently merge on the PatientID feature. However, an inner or left join will not work since our DataFrame with metadata contains a lot of rows that are not in the original DataFrame. Joining on the right and using a few columns from the original DataFrame will do the trick.","execution_count":null},{"metadata":{"trusted":true},"cell_type":"code","source":"# Merge DataFrames\ntrain_df_merged = meta_df_train.merge(train_df, how='left', on='PatientID')\ntrain_df_merged['ID'] = train_df['ID']\ntrain_df_merged['Label'] = train_df['Label']\ntrain_df_merged['Sub_type'] = train_df['Sub_type']\ntest_df_merged = meta_df_test.merge(test_df, how='left', on='PatientID')\ntest_df_merged['ID'] = test_df['ID']\ntest_df_merged['Label'] = test_df['Label']\ntest_df_merged['Sub_type'] = test_df['Sub_type']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Save to CSV\ntrain_df_merged.to_csv('stage_2_train_with_metadata.csv', index=False)\ntest_df_merged.to_csv('stage_2_test_with_metadata.csv', index=False)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Final Check <a id=\"6\"></a>","execution_count":null},{"metadata":{"_kg_hide-input":true,"trusted":true},"cell_type":"code","source":"# Final check on the new dataset\nprint('Training Data:')\ndisplay(train_df_merged.head(3))\ndisplay(train_df_merged.tail(3))\nprint('Testing Data:')\ndisplay(test_df_merged.head(3))\ndisplay(test_df_merged.tail(3))","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4}
\ No newline at end of file
+{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Preprocessing CSV's for training"},{"metadata":{},"cell_type":"markdown","source":"![](https://www.rsna.org/-/media/Images/RSNA/Menu/logo_sml.ashx?w=100&la=en&hash=9619A8238B66C7BA9692C1FC3A5C9E97C24A06E1)"},{"metadata":{},"cell_type":"markdown","source":"Are you working a lot with Data Generators (for example Keras' \".flow_from_dataframe\") and competing in the [RSNA Intercranial Hemorrhage 2019 competition](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection)? \n\nI've created a function that creates a simple preprocessed DataFrame with a column for ImageID and a column for each label in the competition. ('epidural', 'intraparenchymal', 'intraventricular', 'subarachnoid', 'subdural', 'any') \n\nI also made a function which translates your predictions into the correct submission format.\n\nIf you are interested in getting the metadata as CSV files also you can check out [this Kaggle kernel](https://www.kaggle.com/carlolepelaars/converting-dicom-metadata-to-csv-rsna-ihd-2019). \n\n"},{"metadata":{},"cell_type":"markdown","source":"## Preparation"},{"metadata":{"trusted":true},"cell_type":"code","source":"# We will only need OS and Pandas for this one\nimport os\nimport pandas as pd\n\n# Path names\nBASE_PATH = \"../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/\"\nTRAIN_PATH = BASE_PATH + 'stage_2_train.csv'\nTEST_PATH = BASE_PATH + 'stage_2_sample_submission.csv'\n\n# All labels that we have to predict in this competition\ntargets = ['epidural', 'intraparenchymal', \n           'intraventricular', 'subarachnoid', \n           'subdural', 'any']","execution_count":1,"outputs":[]},{"metadata":{"_kg_hide-input":true,"trusted":true},"cell_type":"code","source":"# File sizes and specifications\nprint('\\n# Files and file sizes')\nfor file in os.listdir(BASE_PATH)[2:]:\n    print('{}| {} MB'.format(file.ljust(30), \n                             str(round(os.path.getsize(BASE_PATH + file) / 1000000, 2))))","execution_count":2,"outputs":[{"output_type":"stream","text":"\n# Files and file sizes\nstage_2_train                 | 26.59 MB\nstage_2_train.csv             | 119.7 MB\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"## Preprocessing CSV's"},{"metadata":{"trusted":true},"cell_type":"code","source":"train_df = pd.read_csv(TRAIN_PATH)\ntrain_df['ImageID'] = train_df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'\nlabel_lists = train_df.groupby('ImageID')['Label'].apply(list)","execution_count":3,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train_df[train_df['ImageID'] == 'ID_0002081b6.png']","execution_count":4,"outputs":[{"output_type":"execute_result","execution_count":4,"data":{"text/plain":"                                   ID  Label           ImageID\n770232          ID_0002081b6_epidural      0  ID_0002081b6.png\n770233  ID_0002081b6_intraparenchymal      1  ID_0002081b6.png\n770234  ID_0002081b6_intraventricular      0  ID_0002081b6.png\n770235      ID_0002081b6_subarachnoid      0  ID_0002081b6.png\n770236          ID_0002081b6_subdural      0  ID_0002081b6.png\n770237               ID_0002081b6_any      1  ID_0002081b6.png","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>Label</th>\n      <th>ImageID</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>770232</td>\n      <td>ID_0002081b6_epidural</td>\n      <td>0</td>\n      <td>ID_0002081b6.png</td>\n    </tr>\n    <tr>\n      <td>770233</td>\n      <td>ID_0002081b6_intraparenchymal</td>\n      <td>1</td>\n      <td>ID_0002081b6.png</td>\n    </tr>\n    <tr>\n      <td>770234</td>\n      <td>ID_0002081b6_intraventricular</td>\n      <td>0</td>\n      <td>ID_0002081b6.png</td>\n    </tr>\n    <tr>\n      <td>770235</td>\n      <td>ID_0002081b6_subarachnoid</td>\n      <td>0</td>\n      <td>ID_0002081b6.png</td>\n    </tr>\n    <tr>\n      <td>770236</td>\n      <td>ID_0002081b6_subdural</td>\n      <td>0</td>\n      <td>ID_0002081b6.png</td>\n    </tr>\n    <tr>\n      <td>770237</td>\n      <td>ID_0002081b6_any</td>\n      <td>1</td>\n      <td>ID_0002081b6.png</td>\n    </tr>\n  </tbody>\n</table>\n</div>"},"metadata":{}}]},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"def prepare_df(path, train=False, nrows=None):\n    \"\"\"\n    Prepare Pandas DataFrame for fitting neural network models\n    Returns a Dataframe with two columns\n    ImageID and Labels (list of all labels for an image)\n    \"\"\" \n    df = pd.read_csv(path, nrows=nrows)\n    \n    # Get ImageID and type for pivoting\n    df['ImageID'] = df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'\n    df['type'] = df['ID'].str.split(\"_\", n = 3, expand = True)[2]\n    # Create new DataFrame by pivoting\n    new_df = df[['Label', 'ImageID', 'type']].drop_duplicates().pivot(index='ImageID', \n                                                                      columns='type', \n                                                                      values='Label').reset_index()\n    return new_df","execution_count":5,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Convert dataframes to preprocessed format\ntrain_df = prepare_df(TRAIN_PATH, train=True)\ntest_df = prepare_df(TEST_PATH)","execution_count":6,"outputs":[]},{"metadata":{"_kg_hide-input":true,"trusted":true},"cell_type":"code","source":"print('Training data: ')\ndisplay(train_df.head())\n\nprint('Test data: ')\ntest_df.head()","execution_count":7,"outputs":[{"output_type":"stream","text":"Training data: \n","name":"stdout"},{"output_type":"display_data","data":{"text/plain":"type           ImageID  any  epidural  intraparenchymal  intraventricular  \\\n0     ID_000012eaf.png    0         0                 0                 0   \n1     ID_000039fa0.png    0         0                 0                 0   \n2     ID_00005679d.png    0         0                 0                 0   \n3     ID_00008ce3c.png    0         0                 0                 0   \n4     ID_0000950d7.png    0         0                 0                 0   \n\ntype  subarachnoid  subdural  \n0                0         0  \n1                0         0  \n2                0         0  \n3                0         0  \n4                0         0  ","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th>type</th>\n      <th>ImageID</th>\n      <th>any</th>\n      <th>epidural</th>\n      <th>intraparenchymal</th>\n      <th>intraventricular</th>\n      <th>subarachnoid</th>\n      <th>subdural</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>0</td>\n      <td>ID_000012eaf.png</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>1</td>\n      <td>ID_000039fa0.png</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>2</td>\n      <td>ID_00005679d.png</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>3</td>\n      <td>ID_00008ce3c.png</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>4</td>\n      <td>ID_0000950d7.png</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"},"metadata":{}},{"output_type":"stream","text":"Test data: \n","name":"stdout"},{"output_type":"execute_result","execution_count":7,"data":{"text/plain":"type           ImageID  any  epidural  intraparenchymal  intraventricular  \\\n0     ID_000000e27.png  0.5       0.5               0.5               0.5   \n1     ID_000009146.png  0.5       0.5               0.5               0.5   \n2     ID_00007b8cb.png  0.5       0.5               0.5               0.5   \n3     ID_000134952.png  0.5       0.5               0.5               0.5   \n4     ID_000176f2a.png  0.5       0.5               0.5               0.5   \n\ntype  subarachnoid  subdural  \n0              0.5       0.5  \n1              0.5       0.5  \n2              0.5       0.5  \n3              0.5       0.5  \n4              0.5       0.5  ","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th>type</th>\n      <th>ImageID</th>\n      <th>any</th>\n      <th>epidural</th>\n      <th>intraparenchymal</th>\n      <th>intraventricular</th>\n      <th>subarachnoid</th>\n      <th>subdural</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>0</td>\n      <td>ID_000000e27.png</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n    </tr>\n    <tr>\n      <td>1</td>\n      <td>ID_000009146.png</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n    </tr>\n    <tr>\n      <td>2</td>\n      <td>ID_00007b8cb.png</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n    </tr>\n    <tr>\n      <td>3</td>\n      <td>ID_000134952.png</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n    </tr>\n    <tr>\n      <td>4</td>\n      <td>ID_000176f2a.png</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n    </tr>\n  </tbody>\n</table>\n</div>"},"metadata":{}}]},{"metadata":{"_kg_hide-output":true,"trusted":true},"cell_type":"code","source":"# Save to CSV\ntrain_df.to_csv('clean_train_df.csv', index=False)\ntest_df.to_csv('clean_test_df.csv', index=False)","execution_count":8,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Creating submission file"},{"metadata":{"trusted":true},"cell_type":"code","source":"def create_submission_file(IDs, preds):\n    \"\"\"\n    Creates a submission file for Kaggle when given image ID's and predictions\n    \n    IDs: A list of all image IDs (Extensions will be cut off)\n    preds: A list of lists containing all predictions for each image\n    \n    Returns a DataFrame that has the correct format for this competition\n    \"\"\"\n    sub_dict = {'ID': [], 'Label': []}\n    # Create a row for each ID / Label combination\n    for i, ID in enumerate(IDs):\n        ID = ID.split('.')[0] # Remove extension such as .png\n        sub_dict['ID'].extend([f\"{ID}_{target}\" for target in targets])\n        sub_dict['Label'].extend(preds[i])\n    return pd.DataFrame(sub_dict)","execution_count":9,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Finalize submission files\ntrain_sub_df = create_submission_file(train_df['ImageID'], train_df[targets].values)\ntest_sub_df = create_submission_file(test_df['ImageID'], test_df[targets].values)","execution_count":10,"outputs":[]},{"metadata":{"_kg_hide-input":true,"trusted":true},"cell_type":"code","source":"print('Back to the original submission format:')\ntrain_sub_df.head(6)","execution_count":11,"outputs":[{"output_type":"stream","text":"Back to the original submission format:\n","name":"stdout"},{"output_type":"execute_result","execution_count":11,"data":{"text/plain":"                              ID  Label\n0          ID_000012eaf_epidural      0\n1  ID_000012eaf_intraparenchymal      0\n2  ID_000012eaf_intraventricular      0\n3      ID_000012eaf_subarachnoid      0\n4          ID_000012eaf_subdural      0\n5               ID_000012eaf_any      0","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>Label</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>0</td>\n      <td>ID_000012eaf_epidural</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>1</td>\n      <td>ID_000012eaf_intraparenchymal</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>2</td>\n      <td>ID_000012eaf_intraventricular</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>3</td>\n      <td>ID_000012eaf_subarachnoid</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>4</td>\n      <td>ID_000012eaf_subdural</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <td>5</td>\n      <td>ID_000012eaf_any</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"},"metadata":{}}]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4}
\ No newline at end of file
 %% Cell type:markdown id: tags:

-## Converting DICOM metadata to CSV files
+## Preprocessing CSV's for training

 %% Cell type:markdown id: tags:

-## Table Of Contents
+![](https://www.rsna.org/-/media/Images/RSNA/Menu/logo_sml.ashx?w=100&la=en&hash=9619A8238B66C7BA9692C1FC3A5C9E97C24A06E1)

 %% Cell type:markdown id: tags:

- [Dependencies](#1)
- [Preparation](#2)
- [Metadata](#3)
- [Type Conversion](#4)
- [Merge and Save](#5)
- [Final Check](#6)
+Are you working a lot with Data Generators (for example Keras' ".flow_from_dataframe") and competing in the [RSNA Intercranial Hemorrhage 2019 competition](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection)? 
+
+I've created a function that creates a simple preprocessed DataFrame with a column for ImageID and a column for each label in the competition. ('epidural', 'intraparenchymal', 'intraventricular', 'subarachnoid', 'subdural', 'any') 
+
+I also made a function which translates your predictions into the correct submission format.
+
+If you are interested in getting the metadata as CSV files also you can check out [this Kaggle kernel](https://www.kaggle.com/carlolepelaars/converting-dicom-metadata-to-csv-rsna-ihd-2019). 

 %% Cell type:markdown id: tags:

-## Dependencies <a id="1"></a>
+## Preparation

 %% Cell type:code id: tags:

 ``` python
-# Standard libraries
+# We will only need OS and Pandas for this one
 import os
-import gc
-import pydicom # For accessing DICOM files
-import numpy as np
-import pandas as pd 
-import random as rn
-from tqdm import tqdm
-
-# Visualization
-import matplotlib.pyplot as plt
-import matplotlib.image as mpimg
-
-# Paths 
-KAGGLE_DIR = '../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/'
-IMG_PATH_TRAIN = KAGGLE_DIR + 'stage_2_train/'
-IMG_PATH_TEST = KAGGLE_DIR + 'stage_2_test/'
-TRAIN_CSV_PATH = KAGGLE_DIR + 'stage_2_train.csv'
-TEST_CSV_PATH = KAGGLE_DIR + 'stage_2_sample_submission.csv'
-
-# Seed for reproducability
-seed = 1234
-np.random.seed(seed)
-rn.seed(seed)
+import pandas as pd
+
+# Path names
+BASE_PATH = "../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/"
+TRAIN_PATH = BASE_PATH + 'stage_2_train.csv'
+TEST_PATH = BASE_PATH + 'stage_2_sample_submission.csv'
+
+# All labels that we have to predict in this competition
+targets = ['epidural', 'intraparenchymal', 
+           'intraventricular', 'subarachnoid', 
+           'subdural', 'any']
 ```

 %% Cell type:code id: tags:

 ``` python
 # File sizes and specifications
 print('\n# Files and file sizes')
-for file in os.listdir(KAGGLE_DIR)[2:]:
+for file in os.listdir(BASE_PATH)[2:]:
    print('{}| {} MB'.format(file.ljust(30), 
-                             str(round(os.path.getsize(KAGGLE_DIR + file) / 1000000, 2))))
-```
-
-%% Cell type:markdown id: tags:
-
-## Preparation <a id="2"></a>
-
-%% Cell type:code id: tags:
-
-``` python
-# Load in raw datasets
-train_df = pd.read_csv(TRAIN_CSV_PATH)
-test_df = pd.read_csv(TEST_CSV_PATH)
-# For convenience, collect sub type and seperate PatientID as new features
-for df in [train_df, test_df]:
-    df['Sub_type'] = df['ID'].str.split("_", n = 3, expand = True)[2]
-    df['PatientID'] = df['ID'].str.split("_", n = 3, expand = True)[1]
+                             str(round(os.path.getsize(BASE_PATH + file) / 1000000, 2))))
 ```

-%% Cell type:code id: tags:
-
-``` python
-# All filenames for train and test images
-train_images = os.listdir(IMG_PATH_TRAIN)
-test_images = os.listdir(IMG_PATH_TEST)
-```
+%% Output

-%% Cell type:markdown id: tags:

-## Metadata <a id="3"></a>
+# Files and file sizes
+stage_2_train                 | 26.59 MB
+stage_2_train.csv             | 119.7 MB

 %% Cell type:markdown id: tags:

-The [pydicom](https://pydicom.github.io/pydicom/stable/getting_started.html) library allows us to conveniently read in DICOM files and access different values from the file. The actual image can be found in "pixel_array".
+## Preprocessing CSV's

 %% Cell type:code id: tags:

 ``` python
-print('Example of all data in a single DICOM file:\n')
-example_dicom = pydicom.dcmread(IMG_PATH_TRAIN + train_images[0])
-print(example_dicom)
+train_df = pd.read_csv(TRAIN_PATH)
+train_df['ImageID'] = train_df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'
+label_lists = train_df.groupby('ImageID')['Label'].apply(list)
 ```

 %% Cell type:code id: tags:

 ``` python
-# All columns for which we want to collect information
-meta_cols = ['BitsAllocated','BitsStored','Columns','HighBit',
-             'Modality','PatientID','PhotometricInterpretation',
-             'PixelRepresentation','RescaleIntercept','RescaleSlope',
-             'Rows','SOPInstanceUID','SamplesPerPixel','SeriesInstanceUID',
-             'StudyID','StudyInstanceUID','ImagePositionPatient',
-             'ImageOrientationPatient','PixelSpacing']
+train_df[train_df['ImageID'] == 'ID_0002081b6.png']
 ```

+%% Output
+
+                                   ID  Label           ImageID
+770232          ID_0002081b6_epidural      0  ID_0002081b6.png
+770233  ID_0002081b6_intraparenchymal      1  ID_0002081b6.png
+770234  ID_0002081b6_intraventricular      0  ID_0002081b6.png
+770235      ID_0002081b6_subarachnoid      0  ID_0002081b6.png
+770236          ID_0002081b6_subdural      0  ID_0002081b6.png
+770237               ID_0002081b6_any      1  ID_0002081b6.png
+
 %% Cell type:code id: tags:

 ``` python
-# Initialize dictionaries to collect the metadata
-col_dict_train = {col: [] for col in meta_cols}
-col_dict_test = {col: [] for col in meta_cols}
+def prepare_df(path, train=False, nrows=None):
+    """
+    Prepare Pandas DataFrame for fitting neural network models
+    Returns a Dataframe with two columns
+    ImageID and Labels (list of all labels for an image)
+    """ 
+    df = pd.read_csv(path, nrows=nrows)
+    
+    # Get ImageID and type for pivoting
+    df['ImageID'] = df['ID'].str.rsplit('_', 1).map(lambda x: x[0]) + '.png'
+    df['type'] = df['ID'].str.split("_", n = 3, expand = True)[2]
+    # Create new DataFrame by pivoting
+    new_df = df[['Label', 'ImageID', 'type']].drop_duplicates().pivot(index='ImageID', 
+                                                                      columns='type', 
+                                                                      values='Label').reset_index()
+    return new_df
 ```

 %% Cell type:code id: tags:

 ``` python
-# Get values for training images
-for img in tqdm(train_images): 
-    dicom_object = pydicom.dcmread(IMG_PATH_TRAIN + img)
-    for col in meta_cols: 
-        col_dict_train[col].append(str(getattr(dicom_object, col)))
-
-# Store all information in a DataFrame
-meta_df_train = pd.DataFrame(col_dict_train)
-del col_dict_train
-gc.collect()
+# Convert dataframes to preprocessed format
+train_df = prepare_df(TRAIN_PATH, train=True)
+test_df = prepare_df(TEST_PATH)
 ```

 %% Cell type:code id: tags:

 ``` python
-# Get values for test images
-for img in tqdm(test_images): 
-    dicom_object = pydicom.dcmread(IMG_PATH_TEST + img)
-    for col in meta_cols: 
-        col_dict_test[col].append(str(getattr(dicom_object, col)))
+print('Training data: ')
+display(train_df.head())

-# Store all information in a DataFrame
-meta_df_test = pd.DataFrame(col_dict_test)
-del col_dict_test
-gc.collect()
+print('Test data: ')
+test_df.head()
 ```

-%% Cell type:markdown id: tags:
+%% Output

-## Type Conversion <a id="4"></a>
+Training data: 

-%% Cell type:markdown id: tags:
-
-Above we used a bit of a hacky solution by converting all metadata to string values. Now we will convert all features back to proper types.
-
-All numeric features will be converted to float types. We will keep all categorical features as string types.

-The 'WindowCenter' and 'WindowWidth' were rather odd as they featured both int, float and list values. For now I skipped these features, but I may add them to this kernel later. Feel free to share code to conveniently handle this data.
+Test data: 

-The features 'ImagePositionPatient', 'ImageOrientationPatient' and 'PixelSpacing' are stored as lists. In order to easily access these features we create a new column for every value in the list. 
+type           ImageID  any  epidural  intraparenchymal  intraventricular  \
+0     ID_000000e27.png  0.5       0.5               0.5               0.5   
+1     ID_000009146.png  0.5       0.5               0.5               0.5   
+2     ID_00007b8cb.png  0.5       0.5               0.5               0.5   
+3     ID_000134952.png  0.5       0.5               0.5               0.5   
+4     ID_000176f2a.png  0.5       0.5               0.5               0.5   

-We fill missing values with values that are outside the range of the feature (-999).
+type  subarachnoid  subdural  
+0              0.5       0.5  
+1              0.5       0.5  
+2              0.5       0.5  
+3              0.5       0.5  
+4              0.5       0.5  

 %% Cell type:code id: tags:

 ``` python
-# Specify numeric columns
-num_cols = ['BitsAllocated', 'BitsStored','Columns','HighBit', 'Rows',
-            'PixelRepresentation', 'RescaleIntercept', 'RescaleSlope', 'SamplesPerPixel']
+# Save to CSV
+train_df.to_csv('clean_train_df.csv', index=False)
+test_df.to_csv('clean_test_df.csv', index=False)
 ```

-%% Cell type:code id: tags:
-
-``` python
-# Split to get proper PatientIDs
-meta_df_train['PatientID'] = meta_df_train['PatientID'].str.split("_", n = 3, expand = True)[1]
-meta_df_test['PatientID'] = meta_df_test['PatientID'].str.split("_", n = 3, expand = True)[1]
+%% Cell type:markdown id: tags:

-# Convert all numeric cols to floats
-for col in num_cols:
-    meta_df_train[col] = meta_df_train[col].fillna(-9999).astype(float)
-    meta_df_test[col] = meta_df_test[col].fillna(-9999).astype(float)
-```
+## Creating submission file

 %% Cell type:code id: tags:

 ``` python
-# Hacky solution for multi features
-for df in [meta_df_train, meta_df_test]:
-    # ImagePositionPatient
-    ipp1 = []
-    ipp2 = []
-    ipp3 = []
-    for value in df['ImagePositionPatient'].fillna('[-9999,-9999,-9999]').values:
-        value_list = eval(value)
-        ipp1.append(float(value_list[0]))
-        ipp2.append(float(value_list[1]))
-        ipp3.append(float(value_list[2]))
-    df['ImagePositionPatient_1'] = ipp1
-    df['ImagePositionPatient_2'] = ipp2
-    df['ImagePositionPatient_3'] = ipp3
+def create_submission_file(IDs, preds):
+    """
+    Creates a submission file for Kaggle when given image ID's and predictions
    
-    # ImageOrientationPatient
-    iop1 = []
-    iop2 = []
-    iop3 = []
-    iop4 = []
-    iop5 = []
-    iop6 = []
-    # Fill missing values and collect all Image Orientation information
-    for value in df['ImageOrientationPatient'].fillna('[-9999,-9999,-9999,-9999,-9999,-9999]').values:
-        value_list = eval(value)
-        iop1.append(float(value_list[0]))
-        iop2.append(float(value_list[1]))
-        iop3.append(float(value_list[2]))
-        iop4.append(float(value_list[3]))
-        iop5.append(float(value_list[4]))
-        iop6.append(float(value_list[5]))
-    df['ImageOrientationPatient_1'] = iop1
-    df['ImageOrientationPatient_2'] = iop2
-    df['ImageOrientationPatient_3'] = iop3
-    df['ImageOrientationPatient_4'] = iop4
-    df['ImageOrientationPatient_5'] = iop5
-    df['ImageOrientationPatient_6'] = iop6
+    IDs: A list of all image IDs (Extensions will be cut off)
+    preds: A list of lists containing all predictions for each image
    
-    # Pixel Spacing
-    ps1 = []
-    ps2 = []
-    # Fill missing values and collect all pixal spacing features
-    for value in df['PixelSpacing'].fillna('[-9999,-9999]').values:
-        value_list = eval(value)
-        ps1.append(float(value_list[0]))
-        ps2.append(float(value_list[1]))
-    df['PixelSpacing_1'] = ps1
-    df['PixelSpacing_2'] = ps2
+    Returns a DataFrame that has the correct format for this competition
+    """
+    sub_dict = {'ID': [], 'Label': []}
+    # Create a row for each ID / Label combination
+    for i, ID in enumerate(IDs):
+        ID = ID.split('.')[0] # Remove extension such as .png
+        sub_dict['ID'].extend([f"{ID}_{target}" for target in targets])
+        sub_dict['Label'].extend(preds[i])
+    return pd.DataFrame(sub_dict)
 ```

-%% Cell type:markdown id: tags:
-
-## Merge and Save <a id="5"></a>
-
-%% Cell type:markdown id: tags:
-
-This metadata will only be useful if we can connect it to specific images. To make sure every value is in the correct row we can conveniently merge on the PatientID feature. However, an inner or left join will not work since our DataFrame with metadata contains a lot of rows that are not in the original DataFrame. Joining on the right and using a few columns from the original DataFrame will do the trick.
-
 %% Cell type:code id: tags:

 ``` python
-# Merge DataFrames
-train_df_merged = meta_df_train.merge(train_df, how='left', on='PatientID')
-train_df_merged['ID'] = train_df['ID']
-train_df_merged['Label'] = train_df['Label']
-train_df_merged['Sub_type'] = train_df['Sub_type']
-test_df_merged = meta_df_test.merge(test_df, how='left', on='PatientID')
-test_df_merged['ID'] = test_df['ID']
-test_df_merged['Label'] = test_df['Label']
-test_df_merged['Sub_type'] = test_df['Sub_type']
+# Finalize submission files
+train_sub_df = create_submission_file(train_df['ImageID'], train_df[targets].values)
+test_sub_df = create_submission_file(test_df['ImageID'], test_df[targets].values)
 ```

 %% Cell type:code id: tags:

 ``` python
-# Save to CSV
-train_df_merged.to_csv('stage_2_train_with_metadata.csv', index=False)
-test_df_merged.to_csv('stage_2_test_with_metadata.csv', index=False)
+print('Back to the original submission format:')
+train_sub_df.head(6)
 ```

-%% Cell type:markdown id: tags:
-
-## Final Check <a id="6"></a>
+%% Output

-%% Cell type:code id: tags:
+Back to the original submission format:

-``` python
-# Final check on the new dataset
-print('Training Data:')
-display(train_df_merged.head(3))
-display(train_df_merged.tail(3))
-print('Testing Data:')
-display(test_df_merged.head(3))
-display(test_df_merged.tail(3))
-```
+                              ID  Label
+0          ID_000012eaf_epidural      0
+1  ID_000012eaf_intraparenchymal      0
+2  ID_000012eaf_intraventricular      0
+3      ID_000012eaf_subarachnoid      0
+4          ID_000012eaf_subdural      0
+5               ID_000012eaf_any      0