How to Prepare Image Datasets and Models¶

This guide will bring you through the preparation of image datasets and models for testing on AI Verify.

To test models that take in images as an input, you would require the following and this guide will bring you through the preparation of the following:

Dataset: Folder of images for testing
Annotated Ground Truth Dataset: DataFrame containing file names of the images, along with their ground truth labels
Model: Pipeline that processes image file paths before feeding into the final estimator

If you would like to download and follow through this guide, you may download the relevant files via this link.

1. Dataset Preparation¶

AI Verify is able to process images stored in a folder. As such, you may prepare your testing data as a folder of images.

An example of a folder structure you are required to have:

└── raw_fashion_image_10
    ├── 0.png
    ├── 1.png
    ├── 2.png
        ...
    ├── 7.png        
    ├── 8.png
    └── 9.png

Upon upload of the folder, AI Verify would convert this folder into a pandas Dataframe with a column with the header 'image_directory' containing the file paths to these images. This is useful information to note to understand how the model pipeline is to be created.

	image_directory
0	/home/documents/aiverify/uploads/raw_fashion_image_10/0.png
1	/home/documents/aiverify/uploads/raw_fashion_image_10/1.png
2	/home/documents/aiverify/uploads/raw_fashion_image_10/2.png
...	...
7	/home/documents/aiverify/uploads/raw_fashion_image_10/7.png
8	/home/documents/aiverify/uploads/raw_fashion_image_10/8.png
9	/home/documents/aiverify/uploads/raw_fashion_image_10/9.png

2. Annotated Ground Truth Dataset¶

While the test dataset can be uploaded as a folder as detailed in 1. Dataset Preparation, an annotated ground truth dataset will have to be uploaded alongside this. The purpose of this dataset is to provide a map between the image file names and the corresponding ground truth.

This section will show an exmaple of how to prepare this dataset. Firstly, load the DataFrame containing the labels for the test dataset.

First import the relevant libraries:

In [1]:

Copied!





import pickle, os
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from os.path import join
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
import pickle, os
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from os.path import join
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

In [2]:

Copied!

test_labels = pickle.load(open('./data/pickle_pandas_fashion_mnist_test_labels.sav','rb'))
test_labels = test_labels.rename(columns = {0:'labels'})
display(test_labels)
test_labels = pickle.load(open('./data/pickle_pandas_fashion_mnist_test_labels.sav','rb'))
test_labels = test_labels.rename(columns = {0:'labels'})
display(test_labels)

	labels
0	9
1	2
2	1
3	1
4	6
5	1
6	4
7	6
8	5
9	7

Next, create a DataFrame that contains the file names of the images that are mapped to these labels. In this example, the order of the test labels in test_labels correspond to the ascending order of files in the folder containing the test images.

In [3]:

Copied!





test_dir_path = './data/raw_fashion_image_10/'
file_names = []

for i in sorted(os.listdir(test_dir_path), 
                key=lambda i: int(os.path.splitext(os.path.basename(i))[0])):
    file_names.append(Path(i).name)
    file_names_df = pd.DataFrame(file_names, columns = ["file_name"])

display(file_names_df)
test_dir_path = './data/raw_fashion_image_10/'
file_names = []

for i in sorted(os.listdir(test_dir_path), 
                key=lambda i: int(os.path.splitext(os.path.basename(i))[0])):
    file_names.append(Path(i).name)
    file_names_df = pd.DataFrame(file_names, columns = ["file_name"])

display(file_names_df)

	file_name
0	0.png
1	1.png
2	2.png
3	3.png
4	4.png
5	5.png
6	6.png
7	7.png
8	8.png
9	9.png

Create the annotated dataset by joining file_names_df and test_labels. This will provide the annotated ground truth dataset required by AI Verify (one column should contain the file names, and the other column should contain the ground truth labels).

In [4]:

Copied!

annotated_ground_truth = pd.concat((file_names_df,test_labels), axis = 1)
pickle.dump(annotated_ground_truth, open('./data/pickle_pandas_fashion_mnist_annotated_labels_10.sav','wb+'))
display(annotated_ground_truth)
annotated_ground_truth = pd.concat((file_names_df,test_labels), axis = 1)
pickle.dump(annotated_ground_truth, open('./data/pickle_pandas_fashion_mnist_annotated_labels_10.sav','wb+'))
display(annotated_ground_truth)

	file_name	labels
0	0.png	9
1	1.png	2
2	2.png	1
3	3.png	1
4	4.png	6
5	5.png	1
6	6.png	4
7	7.png	6
8	8.png	5
9	9.png	7

3. Model Preparation (Example: Scikit-learn Pipeline)¶

To use AI Verify to test image models, the model will have to similarly take in a pandas DataFrame of image directories. This would mean that a pipeline model will have to be trained, as seen in the example below.

Step 1: Creating dataframe of directories¶

For the folders of images that you have on hand, convert them into pandas Dataframes with a column named 'image_directory' containing file paths.

In this example, the user has a folder (./train) containing the images used for training the model

In [5]:

Copied!





train_dir_path = './data/raw_fashion_image_train/'
train_dirs = []

for i in sorted(os.listdir(train_dir_path), 
                key=lambda i: int(os.path.splitext(os.path.basename(i))[0])):
    train_dirs.append(train_dir_path + i)

train_df = pd.DataFrame(train_dirs,columns = ['image_directory'])

print("DataFrame for training dataset:")
display(train_df)
train_dir_path = './data/raw_fashion_image_train/'
train_dirs = []

for i in sorted(os.listdir(train_dir_path), 
                key=lambda i: int(os.path.splitext(os.path.basename(i))[0])):
    train_dirs.append(train_dir_path + i)

train_df = pd.DataFrame(train_dirs,columns = ['image_directory'])

print("DataFrame for training dataset:")
display(train_df)

DataFrame for training dataset:

	image_directory
0	./data/raw_fashion_image_train/0.png
1	./data/raw_fashion_image_train/1.png
2	./data/raw_fashion_image_train/2.png
3	./data/raw_fashion_image_train/3.png
4	./data/raw_fashion_image_train/4.png
...	...
995	./data/raw_fashion_image_train/995.png
996	./data/raw_fashion_image_train/996.png
997	./data/raw_fashion_image_train/997.png
998	./data/raw_fashion_image_train/998.png
999	./data/raw_fashion_image_train/999.png

1000 rows × 1 columns

Step 2: Loading the training labels¶

In this example, the user has a saved file 'train_labels.sav' containing the labels for the images in the training dataset above.

In [6]:

Copied!

train_labels = pickle.load(open('./data/pickle_pandas_fashion_mnist_train_labels.sav','rb'))
display(train_labels)
train_labels = pickle.load(open('./data/pickle_pandas_fashion_mnist_train_labels.sav','rb'))
display(train_labels)

	0
0	9
1	0
2	0
3	3
4	0
...	...
995	7
996	3
997	3
998	9
999	8

1000 rows × 1 columns

Step 3: Training a custom pipeline¶

With the training dataset and labels prepared, you may now define and train a custom pipeline to process images from a folder and make predictions with the final estimator

In [7]:

Copied!





import numpy as np
import pandas as pd
from PIL import Image

class imageProcessingStage():
    def __init__(self, dir_column):
        self.dir_column = dir_column
    
    def transform(self, X, y=None):
        """Convert columns into dataframe for model input
        """
        images = []
        height, width, channel = 100, 100, 3
        X_=X.copy()
        for dir in X_[self.dir_column]:
            image_array = np.array(Image.open(dir)) / 255.
            image_array = image_array.reshape(height*width*channel)
            images.append(np.array(image_array))
        return pd.DataFrame(images)

    def fit(self, X, y=None):
        return self
import numpy as np
import pandas as pd
from PIL import Image

class imageProcessingStage():
    def __init__(self, dir_column):
        self.dir_column = dir_column
    
    def transform(self, X, y=None):
        """Convert columns into dataframe for model input
        """
        images = []
        height, width, channel = 100, 100, 3
        X_=X.copy()
        for dir in X_[self.dir_column]:
            image_array = np.array(Image.open(dir)) / 255.
            image_array = image_array.reshape(height*width*channel)
            images.append(np.array(image_array))
        return pd.DataFrame(images)

    def fit(self, X, y=None):
        return self

In [8]:

Copied!

pipe = Pipeline([
    ('preprocess images', imageProcessingStage(dir_column = 'image_directory')),
    ('model',  LogisticRegression())])
pipe = Pipeline([
    ('preprocess images', imageProcessingStage(dir_column = 'image_directory')),
    ('model',  LogisticRegression())])

Training the pipeline:

In [9]:

Copied!

pipe.fit(train_df, train_labels)
pipe.fit(train_df, train_labels)

Out[9]:

Pipeline(steps=[('preprocess images',
                 <__main__.imageProcessingStage object at 0x00000219E05F6DA0>),
                ('model', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Save the trained pipeline:

In [10]:

Copied!

pickle.dump(pipe, open('pipeline/multiclass_classification_image_mnist_fashion/fashion_mnist_lr_pipeline.sav','wb+'))
pickle.dump(pipe, open('pipeline/multiclass_classification_image_mnist_fashion/fashion_mnist_lr_pipeline.sav','wb+'))

To test this model, upload a model folder containing:

A python file containing the class files that is used in the pipeline (i.e. imageProcessingStage in this example). Tip: Remember to include the relevant library imports.
The trained pipeline file (i.e. 'pipeline_file.sav' in this example)

An example of a pipeline model folder structure:

└── multiclass_classification_image_mnist_fashion
    ├── fashion_mnist_lr_pipeline.sav
    └── fashionCustomClass.py

In summary, for this example, users may upload the following for testing:

Data: 'data/raw_fashion_image_10'
Ground Truth Dataset/ Annotated Ground Truth Path: 'data/pickle_pandas_fashion_mnist_annotated_labels_10.sav' ; Select Ground Truth : labels ; Name of column containing image file names : file_name
Model: 'pipeline/multiclass_classification_image_mnist_fashion' ; Note that the model should be uploaded as a folder as it is a pipeline

Alternatively, users can also test fairness on images with our sample data and model. The data, ground truth dataset and model can be found here:

Data: 'data:/small_test'
Ground Truth Dataset/ Annotated Ground Truth Path: 'data/pickle_pandas_annotated_labels_50.sav' ; Select Ground Truth : gender ; Name of column containing image file names : image_directory
Model: 'pipeline/bc_image_face'
Sensitive Feature: race