{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Converting DICOM metadata to CSV files","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"## Table Of Contents","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"- [Dependencies](#1)\n- [Preparation](#2)\n- [Metadata](#3)\n- [Type Conversion](#4)\n- [Merge and Save](#5)\n- [Final Check](#6)","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"## Dependencies <a id=\"1\"></a>","execution_count":null},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"# Standard libraries\nimport os\nimport gc\nimport pydicom # For accessing DICOM files\nimport numpy as np\nimport pandas as pd \nimport random as rn\nfrom tqdm import tqdm\n\n# Visualization\nimport matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\n\n# Paths \nKAGGLE_DIR = '../input/rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection/'\nIMG_PATH_TRAIN = KAGGLE_DIR + 'stage_2_train/'\nIMG_PATH_TEST = KAGGLE_DIR + 'stage_2_test/'\nTRAIN_CSV_PATH = KAGGLE_DIR + 'stage_2_train.csv'\nTEST_CSV_PATH = KAGGLE_DIR + 'stage_2_sample_submission.csv'\n\n# Seed for reproducability\nseed = 1234\nnp.random.seed(seed)\nrn.seed(seed)","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true,"_kg_hide-input":true},"cell_type":"code","source":"# File sizes and specifications\nprint('\\n# Files and file sizes')\nfor file in os.listdir(KAGGLE_DIR)[2:]:\n print('{}| {} MB'.format(file.ljust(30), \n str(round(os.path.getsize(KAGGLE_DIR + file) / 1000000, 2))))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Preparation <a id=\"2\"></a>","execution_count":null},{"metadata":{"trusted":true},"cell_type":"code","source":"# Load in raw datasets\ntrain_df = pd.read_csv(TRAIN_CSV_PATH)\ntest_df = pd.read_csv(TEST_CSV_PATH)\n# For convenience, collect sub type and seperate PatientID as new features\nfor df in [train_df, test_df]:\n df['Sub_type'] = df['ID'].str.split(\"_\", n = 3, expand = True)[2]\n df['PatientID'] = df['ID'].str.split(\"_\", n = 3, expand = True)[1]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# All filenames for train and test images\ntrain_images = os.listdir(IMG_PATH_TRAIN)\ntest_images = os.listdir(IMG_PATH_TEST)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Metadata <a id=\"3\"></a>","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"The [pydicom](https://pydicom.github.io/pydicom/stable/getting_started.html) library allows us to conveniently read in DICOM files and access different values from the file. The actual image can be found in \"pixel_array\".","execution_count":null},{"metadata":{"_kg_hide-input":true,"trusted":true},"cell_type":"code","source":"print('Example of all data in a single DICOM file:\\n')\nexample_dicom = pydicom.dcmread(IMG_PATH_TRAIN + train_images[0])\nprint(example_dicom)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# All columns for which we want to collect information\nmeta_cols = ['BitsAllocated','BitsStored','Columns','HighBit',\n 'Modality','PatientID','PhotometricInterpretation',\n 'PixelRepresentation','RescaleIntercept','RescaleSlope',\n 'Rows','SOPInstanceUID','SamplesPerPixel','SeriesInstanceUID',\n 'StudyID','StudyInstanceUID','ImagePositionPatient',\n 'ImageOrientationPatient','PixelSpacing']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Initialize dictionaries to collect the metadata\ncol_dict_train = {col: [] for col in meta_cols}\ncol_dict_test = {col: [] for col in meta_cols}","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_kg_hide-output":true},"cell_type":"code","source":"# Get values for training images\nfor img in tqdm(train_images): \n dicom_object = pydicom.dcmread(IMG_PATH_TRAIN + img)\n for col in meta_cols: \n col_dict_train[col].append(str(getattr(dicom_object, col)))\n\n# Store all information in a DataFrame\nmeta_df_train = pd.DataFrame(col_dict_train)\ndel col_dict_train\ngc.collect()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_kg_hide-output":true},"cell_type":"code","source":"# Get values for test images\nfor img in tqdm(test_images): \n dicom_object = pydicom.dcmread(IMG_PATH_TEST + img)\n for col in meta_cols: \n col_dict_test[col].append(str(getattr(dicom_object, col)))\n\n# Store all information in a DataFrame\nmeta_df_test = pd.DataFrame(col_dict_test)\ndel col_dict_test\ngc.collect()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Type Conversion <a id=\"4\"></a>","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"Above we used a bit of a hacky solution by converting all metadata to string values. Now we will convert all features back to proper types.\n\nAll numeric features will be converted to float types. We will keep all categorical features as string types.\n\nThe 'WindowCenter' and 'WindowWidth' were rather odd as they featured both int, float and list values. For now I skipped these features, but I may add them to this kernel later. Feel free to share code to conveniently handle this data.\n\nThe features 'ImagePositionPatient', 'ImageOrientationPatient' and 'PixelSpacing' are stored as lists. In order to easily access these features we create a new column for every value in the list. \n\nWe fill missing values with values that are outside the range of the feature (-999).\n","execution_count":null},{"metadata":{"trusted":true},"cell_type":"code","source":"# Specify numeric columns\nnum_cols = ['BitsAllocated', 'BitsStored','Columns','HighBit', 'Rows',\n 'PixelRepresentation', 'RescaleIntercept', 'RescaleSlope', 'SamplesPerPixel']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_kg_hide-input":false,"_kg_hide-output":true},"cell_type":"code","source":"# Split to get proper PatientIDs\nmeta_df_train['PatientID'] = meta_df_train['PatientID'].str.split(\"_\", n = 3, expand = True)[1]\nmeta_df_test['PatientID'] = meta_df_test['PatientID'].str.split(\"_\", n = 3, expand = True)[1]\n\n# Convert all numeric cols to floats\nfor col in num_cols:\n meta_df_train[col] = meta_df_train[col].fillna(-9999).astype(float)\n meta_df_test[col] = meta_df_test[col].fillna(-9999).astype(float)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_kg_hide-input":false,"_kg_hide-output":true},"cell_type":"code","source":"# Hacky solution for multi features\nfor df in [meta_df_train, meta_df_test]:\n # ImagePositionPatient\n ipp1 = []\n ipp2 = []\n ipp3 = []\n for value in df['ImagePositionPatient'].fillna('[-9999,-9999,-9999]').values:\n value_list = eval(value)\n ipp1.append(float(value_list[0]))\n ipp2.append(float(value_list[1]))\n ipp3.append(float(value_list[2]))\n df['ImagePositionPatient_1'] = ipp1\n df['ImagePositionPatient_2'] = ipp2\n df['ImagePositionPatient_3'] = ipp3\n \n # ImageOrientationPatient\n iop1 = []\n iop2 = []\n iop3 = []\n iop4 = []\n iop5 = []\n iop6 = []\n # Fill missing values and collect all Image Orientation information\n for value in df['ImageOrientationPatient'].fillna('[-9999,-9999,-9999,-9999,-9999,-9999]').values:\n value_list = eval(value)\n iop1.append(float(value_list[0]))\n iop2.append(float(value_list[1]))\n iop3.append(float(value_list[2]))\n iop4.append(float(value_list[3]))\n iop5.append(float(value_list[4]))\n iop6.append(float(value_list[5]))\n df['ImageOrientationPatient_1'] = iop1\n df['ImageOrientationPatient_2'] = iop2\n df['ImageOrientationPatient_3'] = iop3\n df['ImageOrientationPatient_4'] = iop4\n df['ImageOrientationPatient_5'] = iop5\n df['ImageOrientationPatient_6'] = iop6\n \n # Pixel Spacing\n ps1 = []\n ps2 = []\n # Fill missing values and collect all pixal spacing features\n for value in df['PixelSpacing'].fillna('[-9999,-9999]').values:\n value_list = eval(value)\n ps1.append(float(value_list[0]))\n ps2.append(float(value_list[1]))\n df['PixelSpacing_1'] = ps1\n df['PixelSpacing_2'] = ps2","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Merge and Save <a id=\"5\"></a>","execution_count":null},{"metadata":{},"cell_type":"markdown","source":"This metadata will only be useful if we can connect it to specific images. To make sure every value is in the correct row we can conveniently merge on the PatientID feature. However, an inner or left join will not work since our DataFrame with metadata contains a lot of rows that are not in the original DataFrame. Joining on the right and using a few columns from the original DataFrame will do the trick.","execution_count":null},{"metadata":{"trusted":true},"cell_type":"code","source":"# Merge DataFrames\ntrain_df_merged = meta_df_train.merge(train_df, how='left', on='PatientID')\ntrain_df_merged['ID'] = train_df['ID']\ntrain_df_merged['Label'] = train_df['Label']\ntrain_df_merged['Sub_type'] = train_df['Sub_type']\ntest_df_merged = meta_df_test.merge(test_df, how='left', on='PatientID')\ntest_df_merged['ID'] = test_df['ID']\ntest_df_merged['Label'] = test_df['Label']\ntest_df_merged['Sub_type'] = test_df['Sub_type']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Save to CSV\ntrain_df_merged.to_csv('stage_2_train_with_metadata.csv', index=False)\ntest_df_merged.to_csv('stage_2_test_with_metadata.csv', index=False)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Final Check <a id=\"6\"></a>","execution_count":null},{"metadata":{"_kg_hide-input":true,"trusted":true},"cell_type":"code","source":"# Final check on the new dataset\nprint('Training Data:')\ndisplay(train_df_merged.head(3))\ndisplay(train_df_merged.tail(3))\nprint('Testing Data:')\ndisplay(test_df_merged.head(3))\ndisplay(test_df_merged.tail(3))","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4}
The [pydicom](https://pydicom.github.io/pydicom/stable/getting_started.html) library allows us to conveniently read in DICOM files and access different values from the file. The actual image can be found in "pixel_array".
%% Cell type:code id: tags:
``` python
print('Example of all data in a single DICOM file:\n')
Above we used a bit of a hacky solution by converting all metadata to string values. Now we will convert all features back to proper types.
All numeric features will be converted to float types. We will keep all categorical features as string types.
The 'WindowCenter' and 'WindowWidth' were rather odd as they featured both int, float and list values. For now I skipped these features, but I may add them to this kernel later. Feel free to share code to conveniently handle this data.
The features 'ImagePositionPatient', 'ImageOrientationPatient' and 'PixelSpacing' are stored as lists. In order to easily access these features we create a new column for every value in the list.
We fill missing values with values that are outside the range of the feature (-999).
This metadata will only be useful if we can connect it to specific images. To make sure every value is in the correct row we can conveniently merge on the PatientID feature. However, an inner or left join will not work since our DataFrame with metadata contains a lot of rows that are not in the original DataFrame. Joining on the right and using a few columns from the original DataFrame will do the trick.