Patient Selection for Diabetes Drug Testing

Utilizing a synthetic Diabetes patient dataset, we will create a deep learning model trained on EHR data (Electronic Health Records) to find suitable patients for testing a new Diabetes drug.
health
deep-learning
electronic-health-records
Author

Pranath Fernando

Published

February 6, 2022

1 Introduction

EHR data is becoming a key source of real-world evidence (RWE) for the pharmaceutical industry and regulators to make decisions on clinical trials.

For this project, we have a groundbreaking diabetes drug that is ready for clinical trial testing. It is a very unique and sensitive drug that requires administering the drug over at least 5-7 days of time in the hospital with frequent monitoring/testing and patient medication adherence training with a mobile application. We have been provided a patient dataset from a client partner and are tasked with building a predictive model that can identify which type of patients the company should focus their efforts testing this drug on. Target patients are people that are likely to be in the hospital for this duration of time and will not incur significant additional costs for administering this drug to the patient and monitoring.

In order to achieve our goal we must build a regression model that can predict the estimated hospitalization time for a patient and use this to select/filter patients for this study.

2 Approach

Utilizing a synthetic dataset (denormalized at the line level augmentation) built off of the UCI Diabetes readmission dataset, we will build a regression model that predicts the expected days of hospitalization time and then convert this to a binary prediction of whether to include or exclude that patient from the clinical trial.

This project will demonstrate the importance of building the right data representation at the encounter level, with appropriate filtering and preprocessing/feature engineering of key medical code sets. We will also analyze and interpret the model for biases across key demographic groups.

3 Dataset

Due to healthcare PHI regulations (HIPAA, HITECH), there are limited number of publicly available datasets and some datasets require training and approval. So, for the purpose of this study, we are using a dataset from UC Irvine that has been modified.

4 Dataset Loading and Schema Review

dataset_path = "./data/final_project_dataset.csv"
df = pd.read_csv(dataset_path)
# Show first few rows
df.head()
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty primary_diagnosis_code other_diagnosis_codes number_outpatient number_inpatient number_emergency num_lab_procedures number_diagnoses num_medications num_procedures ndc_code max_glu_serum A1Cresult change readmitted
0 2278392 8222157 Caucasian Female [0-10) ? 6 25 1 1 ? Pediatrics-Endocrinology 250.83 ?|? 0 0 0 41 1 1 0 NaN None None No NO
1 149190 55629189 Caucasian Female [10-20) ? 1 1 7 3 ? ? 276 250.01|255 0 0 0 59 9 18 0 68071-1701 None None Ch >30
2 64410 86047875 AfricanAmerican Female [20-30) ? 1 1 7 2 ? ? 648 250|V27 2 1 0 11 6 13 5 0378-1110 None None No NO
3 500364 82442376 Caucasian Male [30-40) ? 1 1 7 2 ? ? 8 250.43|403 0 0 0 44 7 16 1 68071-1701 None None Ch NO
4 16680 42519267 Caucasian Male [40-50) ? 1 1 7 1 ? ? 197 157|250 0 0 0 51 5 8 0 0049-4110 None None Ch NO

4.1 Determine Level of Dataset (Line or Encounter)

Given there are only 101766 unique encounter_id’s yet there are 143424 rows that are not nulls, this looks like the dataset is at the line level.

We would also want to aggregate on the primary_diagnosis_code as there is also only one of these per encounter. By aggregating on these 3 columns, we can create a encounter level dataset.

5 Analyze Dataset

# Look at range of values & key stats for numerical columns
numerical_feature_list = ['time_in_hospital',  'number_outpatient', 'number_inpatient', 'number_emergency', 'num_lab_procedures', 'number_diagnoses', 'num_medications', 'num_procedures' ]
df[numerical_feature_list].describe()
time_in_hospital number_outpatient number_inpatient number_emergency num_lab_procedures number_diagnoses num_medications num_procedures
count 143424.000000 143424.000000 143424.000000 143424.000000 143424.000000 143424.000000 143424.000000 143424.000000
mean 4.490190 0.362429 0.600855 0.195086 43.255745 7.424434 16.776035 1.349021
std 2.999667 1.249295 1.207934 0.920410 19.657319 1.924872 8.397130 1.719104
min 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000
25% 2.000000 0.000000 0.000000 0.000000 32.000000 6.000000 11.000000 0.000000
50% 4.000000 0.000000 0.000000 0.000000 44.000000 8.000000 15.000000 1.000000
75% 6.000000 0.000000 1.000000 0.000000 57.000000 9.000000 21.000000 2.000000
max 14.000000 42.000000 21.000000 76.000000 132.000000 16.000000 81.000000 6.000000
# Define utility functions
def create_cardinality_feature(df):
    num_rows = len(df)
    random_code_list = np.arange(100, 1000, 1)
    return np.random.choice(random_code_list, num_rows)

def count_unique_values(df, cat_col_list):
    cat_df = df[cat_col_list]
    cat_df['principal_diagnosis_code'] = create_cardinality_feature(cat_df)
    #add feature with high cardinality
    val_df = pd.DataFrame({'columns': cat_df.columns,
                       'cardinality': cat_df.nunique() } )
    return val_df

categorical_feature_list = [ 'race', 'gender', 'age', 'weight', 'payer_code', 'medical_specialty', 'primary_diagnosis_code', 'other_diagnosis_codes','ndc_code', 'max_glu_serum', 'A1Cresult', 'change', 'readmitted']

categorical_df = count_unique_values(df, categorical_feature_list)
categorical_df
columns cardinality
race race 6
gender gender 3
age age 10
weight weight 10
payer_code payer_code 18
medical_specialty medical_specialty 73
primary_diagnosis_code primary_diagnosis_code 717
other_diagnosis_codes other_diagnosis_codes 19374
ndc_code ndc_code 251
max_glu_serum max_glu_serum 4
A1Cresult A1Cresult 4
change change 2
readmitted readmitted 3
principal_diagnosis_code principal_diagnosis_code 900

5.1 Analysis key findings

  • The ndc_code field has a high amount of missing values (23460)
  • num_lab_procedures and num_medications seem to have a roughly normal distribution
  • Fields that have a high cardinality are - medical_specialty, primary_diagnosis_code, other_diagnosis_codes, ndc_code, and principal_diagnosis_code. This is because there are many thousands of these codes that correspond to the many disease and diagnosis sub-classes that exist in the medical field.
  • The distribution for the age field is approximately normal, which we would expect. The distribution for the gender field is roughly uniform & equal. In this case we discount the very small number of Unknown/valid cases. Again this is not surprising, as the distribution of genders in the general population is also roughly equal so this seems to be a representitive sample from the general population.

6 Reduce Dimensionality of the NDC Code Feature

NDC codes are a common format to represent the wide variety of drugs that are prescribed for patient care in the United States. The challenge is that there are many codes that map to the same or similar drug. We are provided with the ndc drug lookup file https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/ndc_lookup_table.csv derived from the National Drug Codes List site(https://ndclist.com/).

We can use this file to come up with a way to reduce the dimensionality of this field and create a new field in the dataset called “generic_drug_name” in the output dataframe.

#NDC code lookup file
ndc_code_path = "./medication_lookup_tables/final_ndc_lookup_table"
ndc_code_df = pd.read_csv(ndc_code_path)
# Check first new rows
ndc_code_df.head()
NDC_Code Proprietary Name Non-proprietary Name Dosage Form Route Name Company Name Product Type
0 0087-6060 Glucophage Metformin Hydrochloride Tablet, Film Coated Oral Bristol-myers Squibb Company Human Prescription Drug
1 0087-6063 Glucophage XR Metformin Hydrochloride Tablet, Extended Release Oral Bristol-myers Squibb Company Human Prescription Drug
2 0087-6064 Glucophage XR Metformin Hydrochloride Tablet, Extended Release Oral Bristol-myers Squibb Company Human Prescription Drug
3 0087-6070 Glucophage Metformin Hydrochloride Tablet, Film Coated Oral Bristol-myers Squibb Company Human Prescription Drug
4 0087-6071 Glucophage Metformin Hydrochloride Tablet, Film Coated Oral Bristol-myers Squibb Company Human Prescription Drug
# Check for duplicate NDC_Code's
ndc_code_df[ndc_code_df.duplicated(subset=['NDC_Code'])]
NDC_Code Proprietary Name Non-proprietary Name Dosage Form Route Name Company Name Product Type
263 0781-5634 Pioglitazone Hydrochloride And Glimepiride Pioglitazone Hydrochloride And Glimepiride Tablet Oral Sandoz Inc Human Prescription Drug
264 0781-5635 Pioglitazone Hydrochloride And Glimepiride Pioglitazone Hydrochloride And Glimepiride Tablet Oral Sandoz Inc Human Prescription Drug
# Remove duplicates
ndc_code_df = ndc_code_df.drop(ndc_code_df.index[[263,264]])
ndc_code_df[ndc_code_df.duplicated(subset=['NDC_Code'])]
NDC_Code Proprietary Name Non-proprietary Name Dosage Form Route Name Company Name Product Type

7 Select First Encounter for each Patient

In order to simplify the aggregation of data for the model, we will only select the first encounter for each patient in the dataset. This is to reduce the risk of data leakage of future patient encounters and to reduce complexity of the data transformation and modeling steps. We will assume that sorting in numerical order on the encounter_id provides the time horizon for determining which encounters come before and after another.

from student_utils import select_first_encounter
first_encounter_df = select_first_encounter(reduce_dim_df)
# unique patients in transformed dataset
unique_patients = first_encounter_df['patient_nbr'].nunique()
print("Number of unique patients:{}".format(unique_patients))

# unique encounters in transformed dataset
unique_encounters = first_encounter_df['encounter_id'].nunique()
print("Number of unique encounters:{}".format(unique_encounters))

original_unique_patient_number = reduce_dim_df['patient_nbr'].nunique()
# number of unique patients should be equal to the number of unique encounters and patients in the final dataset
assert original_unique_patient_number == unique_patients
assert original_unique_patient_number == unique_encounters
  • Number of unique patients:71518
  • Number of unique encounters:71518

8 Aggregate Dataset to Right Level for Modelling

To make it simpler, we are creating dummy columns for each unique generic drug name and adding those are input features to the model.

exclusion_list = ['generic_drug_name']
grouping_field_list = [c for c in first_encounter_df.columns if c not in exclusion_list]
agg_drug_df, ndc_col_list = aggregate_dataset(first_encounter_df, grouping_field_list, 'generic_drug_name')
assert len(agg_drug_df) == agg_drug_df['patient_nbr'].nunique() == agg_drug_df['encounter_id'].nunique()

9 Prepare Fields and Cast Dataset

9.1 Feature Selection

# Look at counts for payer_code categories
ax = sns.countplot(x="payer_code", data=agg_drug_df)

png
# Look at counts for weight categories
ax = sns.countplot(x="weight", data=agg_drug_df)

png

From the category counts above, we can see that for payer_code while there are many unknown values i.e. ‘?’, there are still many values for other payer codes, these may prove useful predictors for our target variable. For weight, there are so few unknown ‘?’ codes, that this feature is likely to be not very helpful for predicting our target variable.

# Selected features
required_demo_col_list = ['race', 'gender', 'age']
student_categorical_col_list = [ "change", "readmitted", "payer_code", "medical_specialty", "primary_diagnosis_code", "other_diagnosis_codes", "max_glu_serum", "A1Cresult",  "admission_type_id", "discharge_disposition_id", "admission_source_id"] + required_demo_col_list + ndc_col_list
student_numerical_col_list = ["number_outpatient", "number_inpatient", "number_emergency", "num_lab_procedures", "number_diagnoses", "num_medications", "num_procedures"]
PREDICTOR_FIELD = 'time_in_hospital'
def select_model_features(df, categorical_col_list, numerical_col_list, PREDICTOR_FIELD, grouping_key='patient_nbr'):
    selected_col_list = [grouping_key] + [PREDICTOR_FIELD] + categorical_col_list + numerical_col_list   
    return agg_drug_df[selected_col_list]
selected_features_df = select_model_features(agg_drug_df, student_categorical_col_list, student_numerical_col_list,
                                            PREDICTOR_FIELD)

9.2 Preprocess Dataset - Casting and Imputing

We will cast and impute the dataset before splitting so that we do not have to repeat these steps across the splits in the next step. For imputing, there can be deeper analysis into which features to impute and how to impute but for the sake of time, we are taking a general strategy of imputing zero for only numerical features.

processed_df = preprocess_df(selected_features_df, student_categorical_col_list,
        student_numerical_col_list, PREDICTOR_FIELD, categorical_impute_value='nan', numerical_impute_value=0)

10 Split Dataset into Train, Validation, and Test Partitions

In order to prepare the data for being trained and evaluated by a deep learning model, we will split the dataset into three partitions, with the validation partition used for optimizing the model hyperparameters during training. One of the key parts is that we need to be sure that the data does not accidently leak across partitions.

We will split the input dataset into three partitions(train, validation, test) with the following requirements:

  • Approximately 60%/20%/20% train/validation/test split
  • Randomly sample different patients into each data partition
  • We need to take care that a patient’s data is not in more than one partition, so that we can avoid possible data leakage.
  • We need to take care the total number of unique patients across the splits is equal to the total number of unique patients in the original dataset
  • Total number of rows in original dataset = sum of rows across all three dataset partitions
from student_utils import patient_dataset_splitter
d_train, d_val, d_test = patient_dataset_splitter(processed_df, 'patient_nbr')
  • Total number of unique patients in train = 32563
  • Total number of unique patients in validation = 10854
  • Total number of unique patients in test = 10854
  • Training partition has a shape = (32563, 43)
  • Validation partition has a shape = (10854, 43)
  • Test partition has a shape = (10854, 43)

11 Demographic Representation Analysis of Split

After the split, we should check to see the distribution of key features/groups and make sure that there is representative samples across the partitions.

11.1 Label Distribution Across Partitions

Are the histogram distribution shapes similar across partitions?

show_group_stats_viz(processed_df, PREDICTOR_FIELD)

png
show_group_stats_viz(d_train, PREDICTOR_FIELD)

png
show_group_stats_viz(d_test, PREDICTOR_FIELD)

png

11.2 Demographic Group Analysis

We should check that our partitions/splits of the dataset are similar in terms of their demographic profiles.

# Full dataset before splitting
patient_demo_features = ['race', 'gender', 'age', 'patient_nbr']
patient_group_analysis_df = processed_df[patient_demo_features].groupby('patient_nbr').head(1).reset_index(drop=True)
show_group_stats_viz(patient_group_analysis_df, 'gender')

png
# Training partition
show_group_stats_viz(d_train, 'gender')

png
# Test partition
show_group_stats_viz(d_test, 'gender')

png

12 Convert Dataset Splits to TF Dataset

# Convert dataset from Pandas dataframes to TF dataset
batch_size = 128
diabetes_train_ds = df_to_dataset(d_train, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_val_ds = df_to_dataset(d_val, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_test_ds = df_to_dataset(d_test, PREDICTOR_FIELD, batch_size=batch_size)
# We use this sample of the dataset to show transformations later
diabetes_batch = next(iter(diabetes_train_ds))[0]
def demo(feature_column, example_batch):
    feature_layer = layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch))

13 Create Features

13.1 Create Categorical Features with TF Feature Columns

Before we can create the TF categorical features, we must first create the vocab files with the unique values for a given field that are from the training dataset.

# Build Vocabulary for Categorical Features
vocab_file_list = build_vocab_files(d_train, student_categorical_col_list)

13.2 Create Categorical Features with Tensorflow Feature Column API

from student_utils import create_tf_categorical_feature_cols
tf_cat_col_list = create_tf_categorical_feature_cols(student_categorical_col_list)
test_cat_var1 = tf_cat_col_list[0]
print("Example categorical field:\n{}".format(test_cat_var1))
demo(test_cat_var1, diabetes_batch)

13.3 Create Numerical Features with TF Feature Columns

from student_utils import create_tf_numeric_feature
def calculate_stats_from_train_data(df, col):
    mean = df[col].describe()['mean']
    std = df[col].describe()['std']
    return mean, std

def create_tf_numerical_feature_cols(numerical_col_list, train_df):
    tf_numeric_col_list = []
    for c in numerical_col_list:
        mean, std = calculate_stats_from_train_data(train_df, c)
        tf_numeric_feature = create_tf_numeric_feature(c, mean, std)
        tf_numeric_col_list.append(tf_numeric_feature)
    return tf_numeric_col_list
tf_cont_col_list = create_tf_numerical_feature_cols(student_numerical_col_list, d_train)
test_cont_var1 = tf_cont_col_list[0]
print("Example continuous field:\n{}\n".format(test_cont_var1))
demo(test_cont_var1, diabetes_batch)

14 Build Deep Learning Regression Model with Sequential API and TF Probability Layers

14.1 Use DenseFeatures to combine features for model

Now that we have prepared categorical and numerical features using Tensorflow’s Feature Column API, we can combine them into a dense vector representation for the model. Below we will create this new input layer, which we will call ‘claim_feature_layer’.

claim_feature_columns = tf_cat_col_list + tf_cont_col_list
claim_feature_layer = tf.keras.layers.DenseFeatures(claim_feature_columns)

14.2 Build Sequential API Model from DenseFeatures and TF Probability Layers

def build_sequential_model(feature_layer):
    model = tf.keras.Sequential([
        feature_layer,
        tf.keras.layers.Dense(150, activation='relu'),
        tf.keras.layers.Dense(200, activation='relu'),# New
        tf.keras.layers.Dense(75, activation='relu'),
        tfp.layers.DenseVariational(1+1, posterior_mean_field, prior_trainable),
        tfp.layers.DistributionLambda(
            lambda t:tfp.distributions.Normal(loc=t[..., :1],
                                             scale=1e-3 + tf.math.softplus(0.01 * t[...,1:])
                                             )
        ),
    ])
    return model

def build_diabetes_model(train_ds, val_ds,  feature_layer,  epochs=5, loss_metric='mse'):
    model = build_sequential_model(feature_layer)
    opt = tf.keras.optimizers.Adam(learning_rate=0.01)
    model.compile(optimizer=opt, loss=loss_metric, metrics=[loss_metric])
    #model.compile(optimizer='rmsprop', loss=loss_metric, metrics=[loss_metric])
    #early_stop = tf.keras.callbacks.EarlyStopping(monitor=loss_metric, patience=3)     
    history = model.fit(train_ds, validation_data=val_ds,
                        #callbacks=[early_stop],
                        epochs=epochs)
    return model, history
diabetes_model, history = build_diabetes_model(diabetes_train_ds, diabetes_val_ds,  claim_feature_layer,  epochs=10)

14.3 Show Model Uncertainty Range with TF Probability

Now that we have trained a model with TF Probability layers, we can extract the mean and standard deviation for each prediction.

feature_list = student_categorical_col_list + student_numerical_col_list
diabetes_x_tst = dict(d_test[feature_list])
diabetes_yhat = diabetes_model(diabetes_x_tst)
preds = diabetes_model.predict(diabetes_test_ds)
from student_utils import get_mean_std_from_preds
m, s = get_mean_std_from_preds(diabetes_yhat)

14.4 Show Prediction Output

prob_outputs = {
    "pred": preds.flatten(),
    "actual_value": d_test['time_in_hospital'].values,
    "pred_mean": m.numpy().flatten(),
    "pred_std": s.numpy().flatten()
}
prob_output_df = pd.DataFrame(prob_outputs)
prob_output_df.head()
pred actual_value pred_mean pred_std
0 3.587955 3.0 4.673843 0.693749
1 5.007016 2.0 4.673843 0.693749
2 4.809363 9.0 4.673843 0.693749
3 5.003417 2.0 4.673843 0.693749
4 5.346958 8.0 4.673843 0.693749
prob_output_df.describe()
pred actual_value pred_mean pred_std
count 10854.000000 10854.000000 10854.000000 10854.000000
mean 4.376980 4.429888 4.673843 0.693749
std 0.908507 3.002044 0.000000 0.000000
min 0.976290 1.000000 4.673843 0.693749
25% 3.755292 2.000000 4.673843 0.693749
50% 4.382993 4.000000 4.673843 0.693749
75% 5.002859 6.000000 4.673843 0.693749
max 7.529900 14.000000 4.673843 0.693749

14.5 Convert Regression Output to Classification Output for Patient Selection

from student_utils import get_student_binary_prediction
student_binary_prediction = get_student_binary_prediction(prob_output_df, 'pred')
student_binary_prediction.value_counts()
  • 0:8137
  • 1:2717

14.6 Add Binary Prediction to Test Dataframe

Using the student_binary_prediction output that is a numpy array with binary labels, we can use this to add to a dataframe to better visualize and also to prepare the data for the Aequitas toolkit. The Aequitas toolkit requires that the predictions be mapped to a binary label for the predictions (called ‘score’ field) and the actual value (called ‘label_value’).

def add_pred_to_test(test_df, pred_np, demo_col_list):
    for c in demo_col_list:
        test_df[c] = test_df[c].astype(str)
    test_df['score'] = pred_np
    test_df['label_value'] = test_df['time_in_hospital'].apply(lambda x: 1 if x >=5 else 0)
    return test_df

pred_test_df = add_pred_to_test(d_test, student_binary_prediction, ['race', 'gender'])
pred_test_df[['patient_nbr', 'gender', 'race', 'time_in_hospital', 'score', 'label_value']].head()
patient_nbr gender race time_in_hospital score label_value
0 122896787 Male Caucasian 3.0 0 0
1 102598929 Male Caucasian 2.0 1 0
2 80367957 Male Caucasian 9.0 0 1
3 6721533 Male Caucasian 2.0 1 0
4 104346288 Female Caucasian 8.0 1 1

15 Model Evaluation Metrics

Now it is time to use the newly created binary labels in the ‘pred_test_df’ dataframe to evaluate the model with some common classification metrics. We will create a report summary of the performance of the model and give the ROC AUC, F1 score(weighted), class precision and recall scores.

# AUC, F1, precision and recall
# Summary
y_true = pred_test_df['label_value'].values
y_pred = pred_test_df['score'].values
accuracy_score(y_true, y_pred)
  • 0.5627418463239359
roc_auc_score(y_true, y_pred)
  • 0.5032089104088319

Precision-recall tradeoff - The model has been optimised to identify those patients correct for the trial with the fewest mistakes, while also trying to ensure we identify as many of them as possible.

Areas of imporovement - we could look to engineer new features that might help us better predict our target patients.

16 Evaluating Potential Model Biases with Aequitas Toolkit

16.1 Prepare Data For Aequitas Bias Toolkit

Using the gender and race fields, we will prepare the data for the Aequitas Toolkit.

# Aequitas
from aequitas.preprocessing import preprocess_input_df
from aequitas.group import Group
from aequitas.plotting import Plot
from aequitas.bias import Bias
from aequitas.fairness import Fairness

ae_subset_df = pred_test_df[['race', 'gender', 'score', 'label_value']]
ae_df, _ = preprocess_input_df(ae_subset_df)
g = Group()
xtab, _ = g.get_crosstabs(ae_df)
absolute_metrics = g.list_absolute_metrics(xtab)
clean_xtab = xtab.fillna(-1)
aqp = Plot()
b = Bias()
  • model_id, score_thresholds 1 {‘rank_abs’: [2717]}
absolute_metrics = g.list_absolute_metrics(xtab)
xtab[[col for col in xtab.columns if col not in absolute_metrics]]
model_id score_threshold k attribute_name attribute_value pp pn fp fn tn tp group_label_pos group_label_neg group_size total_entities
0 1 binary 0/1 2717 race ? 86 240 56 85 155 30 115 211 326 10854
1 1 binary 0/1 2717 race AfricanAmerican 491 1530 291 592 938 200 792 1229 2021 10854
2 1 binary 0/1 2717 race Asian 15 60 10 16 44 5 21 54 75 10854
3 1 binary 0/1 2717 race Caucasian 2030 6038 1249 2298 3740 781 3079 4989 8068 10854
4 1 binary 0/1 2717 race Hispanic 52 141 35 48 93 17 65 128 193 10854
5 1 binary 0/1 2717 race Other 43 128 26 40 88 17 57 114 171 10854
6 1 binary 0/1 2717 gender Female 1413 4306 820 1675 2631 593 2268 3451 5719 10854
7 1 binary 0/1 2717 gender Male 1304 3831 847 1404 2427 457 1861 3274 5135 10854

16.2 Reference Group Selection

# Test reference group with Caucasian Male
bdf = b.get_disparity_predefined_groups(clean_xtab,
                    original_df=ae_df,
                    ref_groups_dict={'race':'Caucasian', 'gender':'Male'
                                     },
                    alpha=0.05,
                    check_significance=False)


f = Fairness()
fdf = f.get_group_value_fairness(bdf)

16.3 Race and Gender Bias Analysis for Patient Selection

# Plot two metrics
# Is there significant bias in your model for either race or gender?
fpr_disparity1 = aqp.plot_disparity(bdf, group_metric='fpr_disparity', attribute_name='race')

png

We notice that while with most races, there is no significant indication of bias, there is an indication that Asians are less likely to be itentified by the model, based on the 0.4 disparity in relation to the Caucasian reference group.

fpr_disparity2 = aqp.plot_disparity(bdf, group_metric='fpr_disparity', attribute_name='gender')

png

With gender, there does not seem to be any significant indication of bias.

16.4 Fairness Analysis Example - Relative to a Reference Group

# Reference group fairness plot
fpr_fairness = aqp.plot_fairness_group(fdf, group_metric='fpr', title=True)

png

Here again we can see that there appears to be signficant disparity with the Asian race being under-represented with a magnitude of 0.19.

Subscribe