Ore Mining Analysis

Data Source

This is an open dataset about ore mining process public on Kaggle community by a single source provider. The contents are basically describing the factors that are used in floating plant to purify the iron ore flow by removing the impurity(silica). For details of the description, visit: Ore mining

Objectives

The main goal is to use this data to predict how much impurity is in the ore concentrate. As this impurity is measured every hour, if we can predict how much silica (impurity) is in the ore concentrate, we can help the engineers, giving them early information to take actions (empowering!). Hence, they will be able to take corrective actions in advance (reduce impurity, if it is the case) and also help the environment (reducing the amount of ore that goes to tailings as you reduce silica in the ore concentrate).

Data workflow

Here’s some of the outlining features I add for this project:

  1. Memory Saving
  2. Timeseries Data Integrity
  3. Exploratory Data Analysis(EDA)
  4. ML model(LightGBM)

Memory Saving

At first, this notebook was implement and experimented on kaggle for it’s data origin and compatibility of swift testing and play around for ore mining data. Therefore, it is essential to save more memory for larger dataset, one for potential insufficiency of data RAM, another for faster process speed on publish stage on Kaggle.

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                       df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float32)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df
df = pd.read_csv('/kaggle/input/quality-prediction-in-a-mining-process/MiningProcess_Flotation_Plant_Database.csv',decimal=",",parse_dates=["date"],infer_datetime_format=True)
df = reduce_mem_usage(df)
Mem. usage decreased to 70.33 Mb (47.9% reduction)

Features outline

df.head()
date %_Iron_Feed %_Silica_Feed Starch_Flow Amina_Flow Ore_Pulp_Flow Ore_Pulp_pH Ore_Pulp_Density Flotation_Column_01_Air_Flow Flotation_Column_02_Air_Flow Flotation_Column_03_Air_Flow Flotation_Column_04_Air_Flow Flotation_Column_05_Air_Flow Flotation_Column_06_Air_Flow Flotation_Column_07_Air_Flow Flotation_Column_01_Level Flotation_Column_02_Level Flotation_Column_03_Level Flotation_Column_04_Level Flotation_Column_05_Level Flotation_Column_06_Level Flotation_Column_07_Level %_Iron_Concentrate %_Silica_Concentrate
0 2017-03-10 01:00:00 55.200001 16.98 3019.530029 557.434021 395.713013 10.0664 1.74 249.214005 253.235001 250.576004 295.096008 306.399994 250.225006 250.884003 457.395996 432.962006 424.954010 443.558014 502.255005 446.369995 523.343994 66.910004 1.31
1 2017-03-10 01:00:00 55.200001 16.98 3024.409912 563.965027 397.382996 10.0672 1.74 249.718994 250.531998 250.862000 295.096008 306.399994 250.136993 248.994003 451.890991 429.559998 432.938995 448.085999 496.363007 445.921997 498.075012 66.910004 1.31
2 2017-03-10 01:00:00 55.200001 16.98 3043.459961 568.054016 399.667999 10.0680 1.74 249.740997 247.873993 250.313004 295.096008 306.399994 251.345001 248.070999 451.239990 468.927002 434.609985 449.687988 484.411011 447.825989 458.566986 66.910004 1.31
3 2017-03-10 01:00:00 55.200001 16.98 3047.360107 568.664978 397.938995 10.0689 1.74 249.917007 254.487000 250.048996 295.096008 306.399994 250.421997 251.147003 452.441010 458.165009 442.864990 446.209991 471.411011 437.690002 427.669006 66.910004 1.31
4 2017-03-10 01:00:00 55.200001 16.98 3033.689941 558.166992 400.253998 10.0697 1.74 250.203003 252.136002 249.895004 295.096008 306.399994 249.983002 248.927994 452.441010 452.899994 450.523010 453.670013 462.597992 443.682007 425.678986 66.910004 1.31
  • Goal is to predict % Silica Concentrate
  • Silica Concentrate is the impurity in the iron ore which needs to be removed
  • The value of modeling on this target value is that it takes at least an hour to measure the impurity

Timeseries Data Integrity

For this task, there’s an obvious need to treat the data as timeseries format for it’s time continuity and the need mentioned in objectives that we would like to predict the trend and future purity based on current state.

# Set datetime column as index
dt_df = df.set_index('date')
# Pick out discontinuity in data
all_hours = pd.Series(data=pd.date_range(start=dt_df.index.min(), end=dt_df.index.max(), freq='H'))
mask = all_hours.isin(dt_df.index.values)
all_hours[~mask]
149   2017-03-16 06:00:00
150   2017-03-16 07:00:00
151   2017-03-16 08:00:00
152   2017-03-16 09:00:00
153   2017-03-16 10:00:00
              ...        
462   2017-03-29 07:00:00
463   2017-03-29 08:00:00
464   2017-03-29 09:00:00
465   2017-03-29 10:00:00
466   2017-03-29 11:00:00
Length: 318, dtype: datetime64[ns]

Therefore, we choose only to reserve the data after 2017-03-29 12:00:00

dt_df = dt_df.loc["2017-03-29 12:00:00":]

The measurement frequency of most rows in the dataset is every 20 seconds according to what the publisher/author of the dataset mentions. This would imply that there should be 180 measurements in each hour if this information is correct.

dt_df.groupby(dt_df.index).count()["%_Silica_Concentrate"].value_counts()
180    3947
179       1
Name: %_Silica_Concentrate, dtype: int64
dt_df.groupby(dt_df.index).count()["%_Silica_Concentrate"][dt_df.groupby(dt_df.index).count()["%_Silica_Concentrate"] < 180]
date
2017-04-10    179
Name: %_Silica_Concentrate, dtype: int64
dt_df.loc['2017-04-10'].groupby(dt_df.loc['2017-04-10'].index).count()['%_Silica_Concentrate']
date
2017-04-10 00:00:00    179
2017-04-10 01:00:00    180
2017-04-10 02:00:00    180
2017-04-10 03:00:00    180
2017-04-10 04:00:00    180
2017-04-10 05:00:00    180
2017-04-10 06:00:00    180
2017-04-10 07:00:00    180
2017-04-10 08:00:00    180
2017-04-10 09:00:00    180
2017-04-10 10:00:00    180
2017-04-10 11:00:00    180
2017-04-10 12:00:00    180
2017-04-10 13:00:00    180
2017-04-10 14:00:00    180
2017-04-10 15:00:00    180
2017-04-10 16:00:00    180
2017-04-10 17:00:00    180
2017-04-10 18:00:00    180
2017-04-10 19:00:00    180
2017-04-10 20:00:00    180
2017-04-10 21:00:00    180
2017-04-10 22:00:00    180
2017-04-10 23:00:00    180
Name: %_Silica_Concentrate, dtype: int64

We replace that missing timestamp with the latest datapoint behind it, which is 2017-04-09 23:00:00

df_before = dt_df.copy().loc[:'2017-04-10 00:00:00']
df_after = dt_df.copy().loc['2017-04-10 01:00:00':]
new_date = pd.to_datetime('2017-04-10 00:00:00')
new_data = pd.DataFrame(df_before[-1:].values, index=[new_date], columns=df_before.columns)
df_before = pd.concat([df_before,new_data],axis=0)

dt_df = pd.concat([df_before, df_after])
dt_df.reset_index(drop=True)

dt_df["duration"] = 20
dt_df.loc[0,"duration"] = 0
dt_df.duration = dt_df.duration.cumsum()

dt_df['Date_with_seconds'] = pd.Timestamp("2017-03-29 12:00:00") + pd.to_timedelta(dt_df['duration'], unit='s')

dt_df = dt_df.set_index("Date_with_seconds")

EDA

Correlation Heatmap

fig = plt.figure(figsize=(18,18))

sns.heatmap(dt_df.corr(), annot=True, cmap='viridis')
<AxesSubplot:>

png

  • First you can tell the % of silica is negatively corelated to the % of iron concentrate, which makes sense since they are separately desired and undesired substance of the process, purity and impurity
  • Secondly, the airflow in column 1 is in general positively corelated to the airflow in column 2-7 except for column 5, which arise a potential doubt deserve diving deeper. Also we can tell the larger the airflow is, the lower the flotation level is in the responding column.
  • Thirdly, we can see from the correlation heat map that the ore pulp flow and starch flow barely have influence on the final product – % iron and % silica. on top of the fact that they both have positive corelation with the airflow measured in the columns, we can later verify the underlying collinearity by eigenvalue.

Feature Histogram-cross correlation and distribution

plt.figure(figsize=(20,20),dpi=200)
for i , n in enumerate(dt_df.columns.to_list()):
    plt.subplot(6,4,i+1)
    ax = sns.histplot(data=dt_df, x=n, kde=False, bins=20)#, multiple="stack")
    plt.title(f"Histograma {n}", fontdict={"fontsize":14})
    plt.xlabel("")
    plt.ylabel(ax.get_ylabel(), fontdict={"fontsize":12})
    if i not in [0,4,8,12,16,20,24]:
        plt.ylabel("")
    

plt.tight_layout();

png

Feature Boxplot-distribution

scaler = StandardScaler()

z = pd.DataFrame(scaler.fit_transform(dt_df), columns=dt_df.columns, index=dt_df.index)
z = z.melt()

plt.figure(figsize=(14,6),dpi=100)
sns.boxplot(x=z["variable"], y=z["value"]);
plt.xticks(rotation=90);
plt.xlabel("");
plt.title("Boxplots Univariados");

png

Timeseries lag trend

df_h = dt_df.resample('H').first()
df_h.index.names = ['Date']
result = seasonal_decompose(df_h["%_Silica_Concentrate"][:300])

def decompose_plot(result):

    fig,axes = plt.subplots(ncols=1,nrows=4,figsize=(12,7))

    axes[0].plot(result.observed.index, result.observed.values, color='tab:cyan')
    axes[0].set_ylabel("Observed",fontdict={"size":10,},labelpad=10)
    axes[0].set_xticks(axes[0].get_xticks(), rotation=45)

    axes[1].plot(result.trend.index, result.trend.values, color='tab:cyan')
    axes[1].set_ylabel("Trend",fontdict={"size":10,},labelpad=10)

    axes[2].plot(result.seasonal.index, result.seasonal.values, color='tab:cyan')
    axes[2].set_ylabel("Seasonality",fontdict={"size":10,},labelpad=10)

    axes[3].plot(result.resid.index, result.resid.values, color='tab:cyan')
    axes[3].set_ylabel("Residuals",fontdict={"size":10,},labelpad=10)
    
    for n in range(0,4):
        axes[n].autoscale(axis="both",tight=True)

    fig.suptitle("Seasonal Decompose",fontsize=15)
    plt.tight_layout()

decompose_plot(result)

png

Supervised TimeDiff Machine Learning

dt_df.columns
Index(['%_Iron_Feed', '%_Silica_Feed', 'Starch_Flow', 'Amina_Flow',
       'Ore_Pulp_Flow', 'Ore_Pulp_pH', 'Ore_Pulp_Density',
       'Flotation_Column_01_Air_Flow', 'Flotation_Column_02_Air_Flow',
       'Flotation_Column_03_Air_Flow', 'Flotation_Column_04_Air_Flow',
       'Flotation_Column_05_Air_Flow', 'Flotation_Column_06_Air_Flow',
       'Flotation_Column_07_Air_Flow', 'Flotation_Column_01_Level',
       'Flotation_Column_02_Level', 'Flotation_Column_03_Level',
       'Flotation_Column_04_Level', 'Flotation_Column_05_Level',
       'Flotation_Column_06_Level', 'Flotation_Column_07_Level',
       '%_Iron_Concentrate', '%_Silica_Concentrate', 'duration'],
      dtype='object')
list_cols = [col for col in dt_df.columns.to_list()]  # Use this if I want to include all explanatory variables.

list_cols.remove("%_Silica_Concentrate")
list_cols.remove("%_Iron_Concentrate")

# Resample the original df every 15 minutes.
df_15 = dt_df.resample("15min").first()
df_15 = df_15.drop("%_Iron_Concentrate", axis=1)

window_size = 3

# Taking advantage of the time factor, I add lagged explanatory variables at 15min, 30min, and 45min, respectively
# (these are the ones that were updated every 20s)

for col_name in list_cols:
    for i in range(window_size):
        df_15[f"{col_name} ({-15*(i+1)}mins)"] = df_15[f"{col_name}"].shift(periods=i+1)

# Resample from 15min to 1hour
df_h = df_15.resample('H').first()

# Here I change the name of my index.
df_h.index.names = ['Date']
# Add lagged values of the target variable as explanatory variables, lagged by 1, 2, and 3 hours respectively.
col_name = "%_Silica_Concentrate"
window_size = 3

for i in range(window_size):
    df_h[f"{col_name} ({-i-1}h)"] = df_h[f"{col_name}"].shift(periods=i+1)

# Make the target variable the first column.
x = df_h.columns.to_list()
x.remove("%_Silica_Concentrate")
x.insert(0, "%_Silica_Concentrate")
df_h = df_h[x]

df_h = df_h.dropna()  # Remove rows with missing values due to the shift().
df_h = df_h.astype("float32")  # Convert data to float32 for faster computation.

I create performance reporting functions for my model, a tracker to keep track of metrics, and also add a function to perform a time series split (where temporal order is respected).

Helper Functions

def plot_time_series(timesteps, values, start=0, end=None, label=None):#, #color="blue"):
    """
    Plots timesteps (a series of points in time) against values (a series of values across timesteps).

    Parameters
    ----------
    timesteps : array of timestep values
    values : array of values across time
    format : style of plot, default "."
    start : where to start the plot (setting a value will index from start of timesteps & values)
    end : where to end the plot (similar to start but for the end)
    label : label to show on plot about values, default None 
    """
  # Plot the series
    sns.lineplot(x=timesteps[start:end], y=values[start:end], label=label)
    plt.xlabel("Date")
    plt.ylabel("% Silica Concentrate")
    if label:
        plt.legend(fontsize=14) # make label bigger
    plt.grid(True)

def timeseries_report_model(y_test, model_preds, tracker_df="none", model_name="model_unknown", seasonality=1, naive=False):
    mae = round(mean_absolute_error(y_test, model_preds), 4)
    rmse = round(mean_squared_error(y_test, model_preds) ** 0.5, 4)
    mase = round(mean_absolute_scaled_error(y_test, model_preds, seasonality, naive),4)
    r2 = round(r2_score(y_test, model_preds), 4)
    mape = round(mean_absolute_percentage_error(y_test, model_preds), 4)

    print("MAE: ", mae)
    print("RMSE :", rmse)
    print("MASE :", mase)
    print("R2 :", r2)
    print("MAPE :", mape)

    if isinstance(tracker_df, pd.core.frame.DataFrame):
        tracker_df.loc[tracker_df.shape[0]] = [
            model_name, mae, rmse, mase, r2, mape]
    else:
        pass

LSTM model

# Make the train/test splits
def make_train_test_splits(windows, labels, test_split=0.2):
    """
    Splits matching pairs of winodws and labels into train and test splits.
    """
    split_size = int(len(windows) * (1-test_split)) # this will default to 80% train/20% test
    train_windows = windows[:split_size]
    train_labels = labels[:split_size]
    test_windows = windows[split_size:]
    test_labels = labels[split_size:]
    return train_windows, test_windows, train_labels, test_labels
df_forecast = df_h.copy()
df_forecast.shape
(3946, 92)
tracker = timeseries_models_tracker_df()
X = df_forecast.drop("%_Silica_Concentrate", axis=1)
y = df_forecast["%_Silica_Concentrate"]
train_windows, test_windows, train_labels, test_labels = make_train_test_splits(X,y,test_split=0.1)
len(train_windows), len(test_windows), len(train_labels), len(test_labels)
(3551, 395, 3551, 395)
warnings.filterwarnings("ignore")
def objective(trial, X,y):


    
    param_grid = {
        "random_state": 123,
        "verbosity": -1,
        "boosting_type": trial.suggest_categorical("boosting_type", ['gbdt']),#,"goss"]),
        "device_type": trial.suggest_categorical("device_type", ['cpu']),
        "n_estimators": trial.suggest_int("n_estimators", 25, 10000,step=100), #for large datasets this should be very high
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3),
        "num_leaves": trial.suggest_int("num_leaves", 10, 190), # Also change this for large datasets, should be small to avoid overfitting
        "max_depth": trial.suggest_int("max_depth", 2, 80),
        "min_child_samples": trial.suggest_int("min_child_samples", 10, 800, step=20), #modify this for large datasets, causes overfittin if too low
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0001, 1000, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.0001, 1000, log=True),
        #"min_split_gain": trial.suggest_float("min_split_gain", 0, 3),
        "subsample": trial.suggest_float("subsample", 0.05, 1, step=0.05),
        "subsample_freq": trial.suggest_categorical("subsample_freq", [0,1]), #[0,1] bagging
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.3, 1, step=0.1),
    }
    
    #min_data_in_leaf # min_child_samples
    
    # RMSE (Root Mean Squared Error)
    def rmse(y_val,y_pred):
        is_higher_better = False
        name = "rmse"
        value = mean_squared_error(y_val,y_pred, squared=False)
        return name, value, is_higher_better
    
    cv_scores = np.empty(5)
    
    tscv = TimeSeriesSplit(n_splits=5, max_train_size=None)
    
    for idx , (train_index, val_index) in enumerate(tscv.split(X)):
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        

        model = LGBMRegressor(objective="regression", silent=True,**param_grid)
        model.fit(
            X_train,
            y_train,
            eval_set=[(X_val, y_val)],
            eval_metric=rmse,
            early_stopping_rounds=100,
            categorical_feature="auto",   # cat_idx, #specifiy categorial features.
            callbacks=[LightGBMPruningCallback(trial, "l2", report_interval=20)],  # Add a pruning callback
            verbose=0)

        preds =  model.predict(X_val)
        rmse = mean_squared_error(y_val,preds, squared=False)

        cv_scores[idx] = rmse

    
    return np.mean(cv_scores)
study = optuna.create_study(direction="minimize", study_name="LGBM Regressor",sampler=TPESampler(),
                            pruner=optuna.pruners.PercentilePruner(50, n_startup_trials=5, n_warmup_steps=50))

#study.enqueue_trial(try_this_first)
func = lambda trial: objective(trial, train_windows, train_labels)
study.optimize(func, n_trials=100)
[I 2023-04-14 20:06:45,615] A new study created in memory with name: LGBM Regressor
[I 2023-04-14 20:06:46,696] Trial 0 finished with value: 0.8109004958352836 and parameters: {'boosting_type': 'gbdt', 'device_type': 'cpu', 'n_estimators': 8325, 'learning_rate': 0.2136801933659405, 'num_leaves': 122, 'max_depth': 24, 'min_child_samples': 290, 'reg_alpha': 0.4089594656291566, 'reg_lambda': 233.41962530839834, 'subsample': 0.55, 'subsample_freq': 1, 'colsample_bytree': 0.5}. Best is trial 0 with value: 0.8109004958352836.
[I 2023-04-14 20:06:47,790] Trial 1 finished with value: 0.8149075483133782 and parameters: {'boosting_type': 'gbdt', 'device_type': 'cpu', 'n_estimators': 2625, 'learning_rate': 0.20374443271553472, 'num_leaves': 24, 'max_depth': 65, 'min_child_samples': 550, 'reg_alpha': 0.00011716269934726968, 'reg_lambda': 0.005714841791728027, 'subsample': 0.45, 'subsample_freq': 0, 'colsample_bytree': 0.8}. Best is trial 0 with value: 0.8109004958352836.
......
[I 2023-04-14 20:08:36,711] Trial 99 finished with value: 0.718496606821031 and parameters: {'boosting_type': 'gbdt', 'device_type': 'cpu', 'n_estimators': 3425, 'learning_rate': 0.25476699329382246, 'num_leaves': 78, 'max_depth': 44, 'min_child_samples': 30, 'reg_alpha': 0.006442917820858156, 'reg_lambda': 0.9818132921273436, 'subsample': 0.45, 'subsample_freq': 1, 'colsample_bytree': 0.8}. Best is trial 71 with value: 0.7022349235029833.
# Train again using the optimized hyperparameters
params = study.best_params
model = LGBMRegressor(objective="regression",random_state=123,**params)
model.fit(train_windows, train_labels)
LGBMRegressor(colsample_bytree=0.8, device_type='cpu',
              learning_rate=0.2514458173401127, max_depth=48,
              min_child_samples=90, n_estimators=2125, num_leaves=128,
              objective='regression', random_state=123,
              reg_alpha=0.0042343839479169615, reg_lambda=8.01369798587411,
              subsample=0.6500000000000001, subsample_freq=1)
preds = model.predict(test_windows)
timeseries_report_model(test_labels, preds, tracker, model_name="Experiment , LightGBM",
                        seasonality=1, naive=False)
MAE:  0.6045
RMSE : 0.8177
MASE : 1.1928
R2 : 0.5233
MAPE : 0.2752
plot_importance(model, max_num_features = 20, height=.9, importance_type="gain", figsize=(14,8));

png

plt.figure(figsize=(16,8),dpi=100)
plot_time_series(test_labels.index, test_labels, label="Real Values", start=100)
plot_time_series(test_labels.index, preds,label="LightGBM", start=100)

png