PKBoost Documentation

Gradient boosting that adjusts to concept drift in imbalanced data.

Built from scratch in Rust, PKBoost manages changing data distributions in fraud detection with a fraud rate of 0.2%. It shows less than 2% degradation under drift. In comparison, XGBoost experiences a 31.8% drop and LightGBM a 42.5% drop. PKBoost outperforms XGBoost by 10-18% on the Standard dataset when no drift is applied. It employs information theory with Shannon entropy and Newton Raphson to identify shifts in rare events and trigger an adaptive "metamorphosis" for real-time recovery.

"Most boosting libraries overlook concept drift. PKBoost identifies it and evolves to persist."

Perfect for: Streaming fraud detection, real-time medical monitoring, anomaly detection in changing environments, or any scenario where data evolves over time and positive instances are rare.

🚀 Quick Start

To use it in Python Please refer to: PKBoost Python or install via pip install pkboost

Clone the repository and build:

git clone https://github.com/Pushp-Kharat1/pkboost.git
cd pkboost
cargo build --release

Run the benchmark:

1. Use included sample data (already in data/)

ls data/  # Should show creditcard_train.csv, creditcard_val.csv, etc.

2. Run benchmark

cargo run --release --bin benchmark

💻 Basic Usage

use pkboost::*;
use csv;
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // Load CSV with headers: feature1,feature2,...,Class
    let (x_train, y_train) = load_csv("train.csv")?;
    let (x_val, y_val) = load_csv("val.csv")?;
    let (x_test, y_test) = load_csv("test.csv")?;

    // Auto-configure based on data characteristics
    let mut model = OptimizedPKBoostShannon::auto(&x_train, &y_train);

    // Train with early stopping on validation set
    model.fit(
        &x_train,
        &y_train,
        Some((&x_val, &y_val)),  // Optional validation
        true  // Verbose output
    )?;

    // Predict probabilities (not classes)
    let test_probs = model.predict_proba(&x_test)?;

    // Evaluate
    let pr_auc = calculate_pr_auc(&y_test, &test_probs);
    println!("PR-AUC: {:.4}", pr_auc);

    Ok(())
}

✨ Key Features

Extreme Imbalance Handling: Automatic class weighting and MI regularization boost recall on rare positives without reducing precision. Binary classification only.
Adaptive Hyperparameters: auto_tune_principled profiles your dataset for optimal params—no manual tuning needed.
Histogram-Based Trees: Optimized binning with medians for missing values; supports up to 32 bins per feature for fast splits.
Parallelism & Efficiency: Rayon-based adaptive parallelism detects hardware and scales thresholds dynamically. Efficient batching is used for large datasets.
Adaptation Mechanisms: AdversarialLivingBooster monitors vulnerability scores to detect drift and trigger retraining, such as pruning unused features through "metabolism" tracking.
Metrics Built-In: PR-AUC, ROC-AUC, F1@0.5, and threshold optimization are available out-of-the-box.

← BACK TO HOME BENCHMARKS → PYTHON PACKAGE →