Awesome ML Privacy Mitigation

A curated list of practical privacy-preserving techniques for machine learning

This repository aims to bridge the gap between theoretical privacy research and practical implementation in machine learning. Unlike other resources that only provide high-level overviews, we focus on actionable techniques with code examples, specific parameter recommendations, and realistic privacy-utility trade-offs.

Introduction
1. Data Collection Phase
- 1.1 Data Minimization
- 1.2 Synthetic Data Generation
2. Data Processing Phase
- 2.1 Local Differential Privacy (LDP)
- 2.2 Secure Multi-Party Computation (SMPC)
3. Model Training Phase
- 3.1 Differentially Private Training
- 3.2 Federated Learning
4. Model Deployment Phase
- 4.1 Private Inference
- 4.2 Model Anonymization and Protection
5. Privacy Governance
- 5.1 Privacy Budget Management
- 5.2 Privacy Impact Evaluation
6. Evaluation & Metrics
- 6.1 Privacy Metrics
- 6.2 Utility Metrics
7. Libraries & Tools
8. Tutorials & Resources
Contribute

Introduction

Machine learning systems increasingly handle sensitive data, making privacy protection essential. Building on the NIST Adversarial Machine Learning Taxonomy (2025), this repository provides implementation-focused guidance for ML practitioners.

Primary goals:

✅ Provide code examples for privacy-preserving techniques
✅ Document realistic privacy-utility trade-offs
✅ Help practitioners select appropriate techniques for their use case
✅ Maintain up-to-date links to libraries, tools, and research

1. Data Collection Phase

1.1 Data Minimization

Description:

Collecting only the data necessary for the intended purpose
Built on two core privacy pillars: purpose limitation and data relevance
Different from anonymization - focuses on reducing data collection upfront rather than transforming collected data
Mandated by regulations like GDPR and CCPA as a fundamental privacy principle [1]

NIST AML Attack Mappings:

Primary Mitigation: [NISTAML.032] Data Reconstruction
Additional Protection:
- [NISTAML.033] Membership Inference
- [NISTAML.034] Property Inference

Why It Matters for ML:

Machine learning systems often collect excessive data “just in case,” creating unnecessary privacy risks
Reduces the attack surface and potential harm from data breaches [2]
Prevents feature creep that can lead to model overfitting and privacy vulnerabilities
Simplifies compliance with privacy regulations and builds user trust

Implementation Approach:

Pre-collection Phase
- Conduct a data necessity audit before collection
- Define explicit variables needed for model functionality based on domain expertise
- Document justification for each feature’s necessity relative to the model’s objective
- Avoid collecting indirect identifiers where possible
- Libraries:
  - ML Privacy Meter - Privacy risk assessment
  - Adversarial Robustness Toolbox - Feature importance analysis
Feature Selection and Evaluation
- Apply feature importance ranking to identify non-essential features [3]
- Evaluate correlation between features to avoid redundant data collection
- Measure model performance impact when removing features
- Test different feature subsets to find minimal viable feature set
- Libraries:
  - scikit-learn - Feature selection utilities
  - SHAP - Feature importance analysis
  - Data Shapley - Data valuation
Ongoing Governance
- Implement data expiration policies to remove data that’s no longer needed
- Review feature requirements when model objectives change
- Conduct periodic audits to identify and eliminate feature creep

Algorithms and Tools:

Feature Selection Methods
- Column-based filtering with domain expertise validation
- Feature importance analysis using permutation importance or Shapley values [4]
- Privacy Impact Assessment (PIA) frameworks for systematic evaluation
Privacy Risk Assessment
- Evaluate uniqueness of feature combinations
- Analyze correlation between features and user identifiability [5]
- Measure how much each feature contributes to model performance vs. privacy risk
Privacy Auditing
- Use Membership Inference Attacks (MIAs) to evaluate privacy leakage in models [6]
- Test if models trained with minimal data are more resilient to privacy attacks
- Iteratively adjust feature selection based on audit results

Utility/Privacy Trade-off:

Minimal impact on model utility if properly implemented with domain expertise
Models can maintain accuracy with significantly reduced feature sets [7]
Impact varies by domain and use case - requires empirical testing

Important Considerations:

Data minimization is not a full fix since various features are inherently correlated
Proper correlation analysis should be conducted to understand feature relationships
Domain expertise is crucial for effective minimization without harming model utility
Regular reassessment is needed as data relevance may change over time

References:

[1] The Data Minimization Principle in Machine Learning (Ganesh et al., 2024) / Blog - Empirical exploration of data minimization and its misalignment with privacy, along with potential solutions

[2] Data Minimization for GDPR Compliance in Machine Learning Models (Goldsteen et al., 2022) - Method to reduce personal data needed for ML predictions while preserving model accuracy through knowledge distillation

[3] From Principle to Practice: Vertical Data Minimization for Machine Learning (Staab et al., 2023) - Comprehensive framework for implementing data minimization in machine learning with data generalization techniques

[4] Data Shapley: Equitable Valuation of Data for Machine Learning (Ghorbani & Zou, 2019) - Introduces method to quantify the value of individual data points to model performance, enabling systematic data reduction

[5] Algorithmic Data Minimization for ML over IoT Data Streams (Kil et al., 2024) - Framework for minimizing data collection in IoT environments while balancing utility and privacy

[6] Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) - Pioneering work on membership inference attacks that can be used to audit privacy leakage in ML models

[7] Selecting critical features for data classification based on machine learning methods (Dewi et al., 2020) - Demonstrates that feature selection improves model accuracy and performance while reducing dimensionality

1.2 Synthetic Data Generation

Description:

Creating artificial data that preserves statistical properties without containing real individual information
Alternative to traditional anonymization for enabling data sharing and machine learning
Generates new data points rather than transforming existing ones
Increasingly adopted for privacy-sensitive applications [1]

NIST AML Attack Mappings:

Primary Mitigation:
- [NISTAML.037] Training Data Attacks
- [NISTAML.038] Data Extraction
Additional Protection:
- [NISTAML.033] Membership Inference

Why It Matters for ML:

Provides training data for models without exposing real individuals’ information directly
Helps address data scarcity and imbalance issues in specialized domains
Enables experimentation and development while minimizing privacy risks
Facilitates data sharing across organizations or with researchers [2]

Implementation Approaches:

Generative Adversarial Networks (GANs)
- Mechanism: Two-network architecture where generator creates samples and discriminator evaluates authenticity
- Best For: Complex, high-dimensional data including tabular, time-series, and images
- Variants: CTGAN for tabular data, PATE-GAN for enhanced privacy guarantees [3]
- Libraries:
  - CTGAN - Tabular data generation
  - PATE-GAN - Privacy-preserving GAN
  - Gretel Synthetics - GAN-based synthesis
Variational Autoencoders (VAEs)
- Mechanism: Encoder-decoder architecture with probabilistic latent space
- Best For: Tabular data with mixed numerical and categorical variables
- Variants: TVAE specifically designed for tabular data [4]
- Libraries:
  - SDV - TVAE implementation
  - Ydata-Synthetic - VAE-based synthesis
  - Gretel Synthetics - VAE support
Hybrid Approaches
- Mechanism: Combines VAE’s encoding capabilities with GAN’s generation abilities
- Best For: Applications requiring both high fidelity and enhanced privacy protection
- Recent Advances: VAE-GAN models with improved membership inference resistance [5]
- Libraries:
  - Gretel Synthetics - Hybrid models
  - Ydata-Synthetic - Advanced synthesis
Traditional Statistical Methods
- Bayesian Networks: Model conditional dependencies between variables
- Copula Methods: Capture complex correlation structures
- SMOTE: Generate synthetic minority samples for imbalanced data
- Libraries:
  - SDV - Statistical methods
  - imbalanced-learn - SMOTE implementation
  - Copulas - Copula-based synthesis

Critical Privacy Evaluation:

Common Evaluation Approaches [7]
- Measuring similarity between synthetic data and original data
- Testing for successful membership inference attacks
- Analyzing model performance when trained on synthetic versus real data
Limitations of Current Metrics [8]
- Distance-based metrics may not capture actual privacy risks
- Simple attacker models don’t reflect sophisticated real-world attacks
- Averaged metrics can miss vulnerabilities affecting minority groups or outliers
- Results often vary significantly with different random initializations
Beyond Empirical Metrics
- Complementing testing with formal privacy guarantees like differential privacy
- Adopting adversarial mindset when evaluating privacy claims
- Considering multiple attack vectors beyond basic membership inference [9]

Important Considerations:

Privacy-Utility Trade-off
- Higher privacy protection typically reduces data utility and vice versa
- Optimal balance depends on specific use case and sensitivity of the data
- Quantitative measurement of both aspects is essential for decision-making [10]
Technical Challenges
- Handling categorical variables effectively
- Preserving complex relationships between features
- Scaling to high-dimensional data
- Computational resources required for training [11]
Deployment Guidance
- Validate both utility and privacy before use
- Consider complementary privacy techniques alongside synthetic data
- Be cautious of overstated privacy claims from vendors
- Match evaluation rigor to application sensitivity [12]

Implementation Example: #need to add sample SDV tutorial notebooks Privacy Auditing using ML Privacy Meter []

Best Practices:

Data Preprocessing
- Remove direct identifiers before synthetic data generation
- Consider dimensionality reduction for very high-dimensional data
- Address class imbalance issues at preprocessing stage
Model Selection and Configuration
- Choose generation method based on data type and privacy requirements
- Consider differential privacy mechanisms when possible
- Tune hyperparameters to balance utility and privacy
Evaluation and Validation
- Test with multiple privacy metrics at different thresholds
- Evaluate utility for specific downstream tasks
- Pay special attention to outliers and minority groups
- Document privacy evaluation methodology alongside synthetic data

References:

[1] Synthetic Data: Revisiting the Privacy-Utility Trade-off (Sarmin et al., 2024) - Analysis of privacy-utility trade-offs between synthetic data and traditional anonymization

[2] Machine Learning for Synthetic Data Generation: A Review (Zhao et al., 2023) - Comprehensive review of synthetic data generation techniques and their applications

[3] Modeling Tabular Data using Conditional GAN (Xu et al., 2019) - Introduces CTGAN, designed specifically for mixed-type tabular data generation

[4] Tabular and latent space synthetic data generation: a literature review (Garcia-Gasulla et al., 2023) - Review of data generation methods for tabular data

[5] Synthetic data for enhanced privacy: A VAE-GAN approach against membership inference attacks (Yan et al., 2024) - Novel hybrid approach combining VAE and GAN

[6] SMOTE: Synthetic Minority Over-sampling Technique (Chawla et al., 2002) - Classic approach for generating synthetic samples for minority classes

[7] Empirical privacy metrics: the bad, the ugly… and the good, maybe? (Desfontaines, 2024) - Critical analysis of common empirical privacy metrics in synthetic data

[8] Challenges of Using Synthetic Data Generation Methods for Tabular Microdata (Winter & Tolan, 2023) - Empirical study of trade-offs in different synthetic data generation methods

[9] Privacy Auditing of Machine Learning using Membership Inference Attacks (Yaghini et al., 2021) - Framework for privacy auditing in ML models

[10] PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (Jordon et al., 2019) - Integrates differential privacy into GANs using the PATE framework

[11] A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning (Domingo-Ferrer & Soria-Comas, 2022) - Analysis of privacy in ML including synthetic data approaches

[12] Protect and Extend - Using GANs for Synthetic Data Generation of Time-Series Medical Records (2024) - Application and evaluation of synthetic data in healthcare domain

Libraries:

2. Data Processing Phase

2.1 Local Differential Privacy (LDP)

NIST AML Attack Mappings:

Primary Mitigation:
- [NISTAML.032] Data Reconstruction
- [NISTAML.033] Membership Inference
Additional Protection:
- [NISTAML.034] Property Inference

Description:

Adding calibrated noise to data on the user’s device before it leaves their control
Provides strong privacy guarantees without requiring a trusted central aggregator
Each user independently applies a randomization mechanism to their own data
Allows organizations to collect sensitive data while maintaining formal privacy guarantees [1]

Key Concepts:

Definition: Algorithm M satisfies ε-LDP if for all possible inputs x, x’ and all possible outputs y:
```
Pr[M(x) = y] ≤ e^ε × Pr[M(x') = y]
```
Versus Central DP: LDP typically requires more noise than central DP for the same privacy level but eliminates the need for a trusted data collector [2]
Privacy Budget Management:
- ε value controls privacy-utility trade-off
- Lower ε = stronger privacy but greater accuracy loss
- Composition: Multiple LDP queries consume privacy budget cumulatively [3]

Variants of Differential Privacy:

Pure ε-Differential Privacy
- Definition: The strictest form, defined by the inequality above
- Properties:
  - No probability of failure
  - Strict worst-case guarantees
  - Typically requires more noise than relaxed versions
- Local Application: Randomized response, RAPPOR in high-privacy settings [4]
Approximate (ε,δ)-Differential Privacy
- Definition: Relaxes pure DP by allowing small probability δ of exceeding the privacy bound
- Properties:
  - More practical for many applications
  - Allows δ probability of information leakage
  - Enables more efficient mechanisms
- Local Application: Gaussian mechanism, discrete Laplace in local settings [5]
Rényi Differential Privacy (RDP)
- Definition: Based on Rényi divergence between output distributions
- Properties:
  - Better handles composition of mechanisms
  - More precise accounting of privacy loss
  - Particularly useful for iterative algorithms
- Local Application: Advanced LDP systems with multiple rounds of communication [6]
Gaussian Differential Privacy (GDP)
- Definition: Special form that connects DP to hypothesis testing
- Properties:
  - Elegant handling of composition via central limit theorem
  - Natural framework for analyzing mechanisms with Gaussian noise
  - Tighter bounds than (ε,δ)-DP in many cases
- Local Application: Modern private federated learning systems [7]

Implementation Approaches:

Randomized Response (for binary/categorical data)
- Mechanism: Random perturbation of true value based on privacy parameter
- Use Case: Surveys with sensitive yes/no or categorical questions
- Variants: Unary encoding, RAPPOR, Generalized Randomized Response [8]
- Libraries:
  - OpenDP - Supports randomized response and RAPPOR
  - IBM Differential Privacy Library - Implements RAPPOR and variants
  - Tumult Analytics - Includes RAPPOR implementation
Laplace Mechanism (for numerical data)
- Mechanism: Adds noise calibrated to L1 sensitivity
- Properties:
  - Achieves pure ε-DP
  - Noise proportional to sensitivity/ε
  - Simple to implement
- Use Case: Count queries, sums, averages with bounded sensitivity [9]
- Libraries:
  - Google’s Differential Privacy Library - Core implementation
  - OpenDP - Python bindings with Laplace mechanism
  - IBM Differential Privacy Library - Laplace mechanism with utilities
Gaussian Mechanism (for numerical data)
- Mechanism: Adds noise calibrated to L2 sensitivity
- Properties:
  - Achieves (ε,δ)-DP
  - Better for vector-valued functions (lower noise in high dimensions)
  - Allows leveraging L2 sensitivity
- Use Case: ML model training, high-dimensional statistics [10]
- Libraries:
  - TensorFlow Privacy - DP-SGD implementation
  - Opacus - PyTorch-based DP training
  - Microsoft SmartNoise - Core implementation
Advanced Techniques
- Amplification by Shuffling: Improving privacy by anonymizing source of contributions
- Sampled Gaussian Mechanism: Subsampling data before applying Gaussian noise
- Discrete Gaussian: Better handling of integer-valued functions [11]
- Libraries:
  - OpenDP - Supports composition and amplification
  - Tumult Analytics - Advanced composition utilities
  - IBM Differential Privacy Library - Composition tools

Privacy Budget Considerations:

Selecting Appropriate Parameters:
- ε value: Controls privacy-utility trade-off in all variants
- δ parameter: Should be smaller than 1/n (n = number of users) for (ε,δ)-DP
- α parameter: Order of Rényi divergence for RDP [12]
Composition Advantages of Variants:
- Pure ε-DP: Simple linear composition (privacy loss adds up)
- (ε,δ)-DP: Better composition via advanced composition theorems
- RDP: Precise tracking of privacy loss under composition
- GDP: Natural composition via central limit theorem [13]
Real-World Considerations:
- Theoretical guarantees can be undermined by implementation issues
- Floating-point vulnerabilities can affect all variants
- Consider robustness to side-channel attacks
- Balance between formal guarantees and practical utility [14]

Use Cases by Variant:

Pure ε-DP:
- Simple counts and statistics
- One-time data collection
- Highly sensitive applications requiring strict guarantees
(ε,δ)-DP with Gaussian Mechanism:
- Vector-valued queries (where L2 sensitivity « L1 sensitivity)
- Applications where moderate relaxation of privacy is acceptable
- Machine learning with high-dimensional gradients
RDP and Advanced Variants:
- Iterative algorithms with many composed mechanisms
- Private machine learning (especially SGD-based)
- Complex federated analytics systems [15]

Few Real-World Applications (more available on Damien’s blog):

Apple: iOS/macOS telemetry and emoji suggestions
Google: Chrome browser usage statistics via RAPPOR
Microsoft: Windows telemetry data collection
Meta: Ad delivery optimization without cross-site tracking

Libraries and Tools:

PyDP (OpenMined): Python wrapper around Google’s C++ DP library
Tumult Analytics: Open-source DP library with LDP support
IBM Differential Privacy Library: Comprehensive DP toolkit
Microsoft SmartNoise: Extensible DP framework
TensorFlow Privacy: DP for machine learning

Resources:

A friendly introduction to differential privacy (Desfontaines) - Accessible explanation of differential privacy concepts and fundamentals
Local Differential Privacy: a tutorial (Xiong et al., 2020) - Comprehensive overview of LDP theory and applications
RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response (Erlingsson et al., 2014) - Google’s LDP system for Chrome usage statistics
The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014) - Comprehensive textbook on differential privacy
Approximate Differential Privacy (Programming Differential Privacy) - Detailed guide to approximate DP implementation
Rényi Differential Privacy (Mironov, 2017) - Original paper introducing RDP
Gaussian Differential Privacy (Dong et al., 2022) - Framework connecting DP to hypothesis testing
Getting more useful results with differential privacy (Desfontaines) - Practical advice for improving utility in DP systems
A reading list on differential privacy (Desfontaines) - Curated list of papers and resources for learning DP
Rényi Differential Privacy of the Sampled Gaussian Mechanism (Mironov et al., 2019) - Analysis of privacy guarantees for subsampled data
On the Rényi Differential Privacy of the Shuffle Model (Wang et al., 2021) - Analysis of shuffling for privacy amplification
Differential Privacy: An Economic Method for Choosing Epsilon (Hsu et al., 2014) - Framework for epsilon selection based on economic principles
Functional Rényi Differential Privacy for Generative Modeling (Jalko et al., 2023) - Extension of RDP to functional outputs
Precision-based attacks and interval refining: how to break, then fix, differential privacy (Haney et al., 2022) - Analysis of vulnerabilities in DP implementations
Differential Privacy: A Primer for a Non-technical Audience (Wood et al., 2018) - Accessible introduction for non-technical readers
Using differential privacy to harness big data and preserve privacy (Brookings, 2020) - Overview of real-world applications
Tumult Analytics tutorials - Practical guide to implementing DP in real-world scenarios

2.2 Secure Multi-Party Computation (SMPC)

NIST AML Attack Mappings:

Primary Mitigation:
- [NISTAML.031] Model Extraction
- [NISTAML.032] Data Reconstruction

Description: Enable multiple parties to jointly compute a function over their inputs while keeping those inputs private.

Libraries:

Papers:

Secure Multiparty Computation (Lindell, 2020)

3. Model Training Phase

3.1 Differentially Private Training

NIST AML Attack Mappings:

Primary Mitigation: [NISTAML.033] Membership Inference
Additional Protection:
- [NISTAML.032] Data Reconstruction
- [NISTAML.034] Property Inference

Description: Train ML models with mathematical privacy guarantees by adding carefully calibrated noise during optimization.

Code Example with FastDP by Amazon:

from fastDP import PrivacyEngine
optimizer = SGD(model.parameters(), lr=0.05)
privacy_engine = PrivacyEngine(
    model,
    batch_size=256,
    sample_size=50000,
    epochs=3,
    target_epsilon=2,
    clipping_fn='automatic',
    clipping_mode='MixOpt',
    origin_params=None,
    clipping_style='all-layer',
)
# attaching to optimizers is not needed for multi-GPU distributed learning
privacy_engine.attach(optimizer) 

#----- standard training pipeline
loss = F.cross_entropy(model(batch), labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()

Code Example with TensorFlow Privacy:

import tensorflow as tf
import tensorflow_privacy as tfp

# Create optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

# Make optimizer differentially private
dp_optimizer = tfp.DPKerasSGDOptimizer(
    optimizer,
    noise_multiplier=1.1,
    l2_norm_clip=1.0,
    num_microbatches=1,
    sample_rate=256/60000  # batch_size/dataset_size
)

# Compile model with DP optimizer
model.compile(
    optimizer=dp_optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction=tf.losses.Reduction.NONE
    ),
    metrics=['accuracy']
)

Parameter Selection Guide:

Noise multiplier: 0.5-3.0 (higher = more privacy)
Gradient clipping: 0.1-5.0 (domain dependent)
Privacy budget: ε = 1-10 (lower = more privacy)

Libraries:

Privacy-Utility Trade-offs:

For ε = 1.0: ~5-15% accuracy drop
For ε = 3.0: ~2-7% accuracy drop
For ε = 8.0: ~1-3% accuracy drop
(Depends heavily on dataset size and task complexity)

Papers:

3.2 Federated Learning

NIST AML Attack Mappings:

Primary Mitigation:
- [NISTAML.038] Data Extraction
- [NISTAML.037] Training Data Attacks

Description: Train models across multiple devices or servers without exchanging raw data.

Code Example with TensorFlow Federated:

import tensorflow as tf
import tensorflow_federated as tff

# Define model and optimization
def create_model():
    return tf.keras.models.Sequential([
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(output_classes, activation='softmax')
    ])

def model_fn():
    model = create_model()
    return tff.learning.from_keras_model(
        model,
        input_spec=preprocessed_example_dataset.element_spec,
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
    )

# Build the federated training process
iterative_process = tff.learning.build_federated_averaging_process(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.1),
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(1.0)
)

# Train the model
state = iterative_process.initialize()
for round_num in range(num_rounds):
    # Select clients for this round
    sample_clients = np.random.choice(client_ids, num_clients_per_round)
    client_datasets = [client_data[client_id] for client_id in sample_clients]
    
    # Run one round of training
    state, metrics = iterative_process.next(state, client_datasets)
    print(f'Round {round_num}: {metrics}')

Libraries:

Privacy Enhancements:

Secure Aggregation: Cryptographic protocol to protect individual updates
Differential Privacy: Add noise to updates to prevent memorization
Update Compression: Reduce information content of transmitted updates

Papers:

4. Model Deployment Phase

4.1 Private Inference

NIST AML Attack Mappings:

Primary Mitigation:
- [NISTAML.031] Model Extraction
- [NISTAML.038] Data Extraction

Description: Protect privacy during model inference, where both the model and user inputs need protection.

Code Example with Homomorphic Encryption (TenSEAL):

import tenseal as ts
import numpy as np

# Client-side code
# Create context for BFV homomorphic encryption scheme
context = ts.context(ts.SCHEME_TYPE.BFV, poly_modulus_degree=8192, plain_modulus=1032193)
context.generate_galois_keys()

# Encrypt input data
x = np.array([[0.1, 0.2, 0.3, 0.4]])
encrypted_x = ts.ckks_vector(context, x)

# Send encrypted_x to server for inference

# Server-side code (computing inference on encrypted data)
def private_inference(encrypted_input, model_weights):
    # First layer computation - matrix multiplication
    weights1 = model_weights[0]
    bias1 = model_weights[1]
    layer1_out = encrypted_input.matmul(weights1) + bias1
    
    # Apply approximate activation function
    # (usually polynomial approximation of ReLU, sigmoid, etc.)
    activated = approximate_activation(layer1_out)
    
    # Additional layers...
    
    # Return encrypted prediction
    return encrypted_prediction

# Client receives and decrypts the result
decrypted_result = encrypted_prediction.decrypt()

Libraries:

Performance Trade-offs:

Homomorphic Encryption: 1000-100000x slowdown, strongest privacy
Secure Multi-Party Computation: 10-1000x slowdown, balanced approach
Trusted Execution Environments: 1.1-2x slowdown, weaker guarantees

Papers:

4.2 Model Anonymization and Protection

NIST AML Attack Mappings:

Primary Mitigation: [NISTAML.031] Model Extraction
Additional Protection:
- [NISTAML.023] Backdoor Poisoning (security-related)

Description: Protect the model itself from attacks that aim to extract training data or reverse-engineer model functionality.

Code Example of Prediction Purification:

# Prediction purification with calibrated noise
def purify_predictions(model_output, epsilon=1.0, sensitivity=1.0):
    # Calculate noise scale based on sensitivity and privacy parameter
    scale = sensitivity / epsilon
    
    # Add calibrated noise
    noise = np.random.laplace(0, scale, size=model_output.shape)
    purified_output = model_output + noise
    
    # Normalize if probability distribution
    if np.all(model_output >= 0) and np.isclose(np.sum(model_output), 1.0):
        purified_output = np.clip(purified_output, 0, 1)
        purified_output = purified_output / np.sum(purified_output)
        
    return purified_output

# Use in inference pipeline
def private_inference(input_data):
    raw_predictions = model.predict(input_data)
    private_predictions = purify_predictions(raw_predictions, epsilon=2.0)
    return private_predictions

Techniques:

Model Distillation: Training a student model on the outputs of a teacher model
Prediction Purification: Adding noise to model outputs
Adversarial Regularization: Adding regularization during training to reduce information leakage
Model Watermarking: Adding imperceptible watermarks to detect model theft

Libraries:

Papers:

5. Privacy Governance

5.1 Privacy Budget Management

NIST AML Attack Mappings:

Risk Management:
- [NISTAML.033] Membership Inference
- [NISTAML.032] Data Reconstruction

Description: Track and allocate privacy loss across the ML pipeline.

Code Example:

# Using RDP accountant for DP-SGD with budget management
from prv_accountant import PRVAccountant

# Initialize privacy accountant
accountant = PRVAccountant(noise_multiplier=1.1, 
                         sampling_probability=256/50000)

# Track training iterations
for epoch in range(epochs):
    # Update the accountant with batch training
    accountant.step(noise_multiplier=1.1, 
                  sampling_probability=256/50000)
    
    # Check current privacy spent
    epsilon = accountant.get_epsilon(delta=1e-5)
    
    # If budget exceeded, stop training
    if epsilon > privacy_budget:
        print(f"Privacy budget {privacy_budget} exceeded at epoch {epoch}")
        break

Libraries:

Papers:

5.2 Privacy Impact Evaluation

NIST AML Attack Mappings:

Vulnerability Assessment:
- [NISTAML.033] Membership Inference
- [NISTAML.034] Property Inference

Description: Quantitatively measure privacy risks in ML systems.

Code Example:

from privacy_meter.audit import MembershipInferenceAttack

# Configure the attack
attack = MembershipInferenceAttack(
    target_model=model,
    target_train_data=x_train,
    target_test_data=x_test,
    attack_type='black_box'
)

# Run the attack
attack_results = attack.run()

# Analyze results
accuracy = attack_results.get_attack_accuracy()
auc = attack_results.get_auc_score()
print(f"Attack accuracy: {accuracy}, AUC: {auc}")

# Comparative evaluation
if auc > 0.6:
    print("Privacy protection INSUFFICIENT - model vulnerable to membership inference")
elif auc > 0.55:
    print("Privacy protection MARGINAL - consider additional mitigations")
else:
    print("Privacy protection ADEQUATE against membership inference")

Libraries:

Papers:

6. Evaluation & Metrics

6.1 Privacy Metrics

NIST AML Attack Mappings:

Comprehensive Coverage:
- [NISTAML.033] Membership Inference
- [NISTAML.032] Data Reconstruction
- [NISTAML.031] Model Extraction
- [NISTAML.034] Property Inference
Differential Privacy (ε, δ): Smaller values indicate stronger privacy
KL Divergence: Measures information gain from model about training data
AUC of Membership Inference: How well attacks can identify training data (closer to 0.5 is better)
Maximum Information Leakage: Maximum information an adversary can extract

6.2 Utility Metrics

Privacy-Utility Curves: Plot of accuracy vs. privacy parameter
Performance Gap: Difference between private and non-private model metrics
Privacy-Constrained Accuracy: Best accuracy achievable under privacy budget constraint

7. Libraries & Tools

7.1 Differential Privacy

PyDP (Google’s Differential Privacy) - Python wrapper for Google’s Differential Privacy library
Opacus - PyTorch-based library for differential privacy in deep learning
TensorFlow Privacy - TensorFlow-based library for differential privacy
Diffprivlib - IBM’s library for differential privacy
Tumult Analytics - Open-source DP library with LDP support
Microsoft SmartNoise - Extensible DP framework

7.2 Federated Learning

TensorFlow Federated - Google’s framework for federated learning
Flower - A friendly federated learning framework
PySyft - Library for secure and private ML with federated learning
FATE - Industrial-grade federated learning framework
FedML - Research-oriented federated learning framework
NVFlare - NVIDIA’s federated learning framework

7.3 Secure Computation

TenSEAL - Library for homomorphic encryption with tensor operations
Microsoft SEAL - Homomorphic encryption library
CrypTen - Framework for privacy-preserving machine learning based on PyTorch
MP-SPDZ - Secure multi-party computation framework
TF Encrypted - Privacy-preserving machine learning in TensorFlow

7.4 Synthetic Data

SDV - Synthetic data generation ecosystem of libraries
Gretel Synthetics - Synthetic data generation with privacy guarantees
CTGAN - GAN-based tabular data synthesis
Ydata-Synthetic - Synthetic data generation for tabular and time-series data

7.5 Privacy Evaluation

ML Privacy Meter - Tool for quantifying privacy risks in ML
Adversarial Robustness Toolbox - For evaluating model robustness including privacy attacks
TensorFlow Privacy Attacks - Implementation of privacy attacks in TensorFlow

8. Tutorials & Resources

8.1 Differential Privacy Tutorials

Google’s Differential Privacy Tutorial
- Language: C++, Go, Java
- Highlights: Count-min sketch, quantiles, bounded mean and sum implementations
OpenDP Tutorial Series
- Language: Python
- Highlights: Step-by-step tutorials on measurements, transformations, composition
Opacus Tutorials
- Language: Python (PyTorch)
- Highlights: DP-SGD implementation, privacy accounting, CIFAR-10 training
TensorFlow Privacy Tutorials
- Language: Python (TensorFlow)
- Highlights: DP-SGD, membership inference attacks, privacy accounting
IBM Differential Privacy Library Tutorials
- Language: Python
- Highlights: DP with scikit-learn integration, classification, regression

8.2 Federated Learning Tutorials

TensorFlow Federated Tutorials
- Language: Python (TensorFlow)
- Highlights: Image classification, custom aggregations, federated analytics
Flower Federated Learning Tutorials
- Language: Python (framework-agnostic)
- Highlights: PyTorch, TensorFlow, scikit-learn integrations, simulation
PySyft Tutorials
- Language: Python
- Highlights: Privacy-preserving federated learning, secure aggregation
FedML Tutorials
- Language: Python
- Highlights: Cross-device FL, cross-silo FL, mobile device examples
NVFlare Examples
- Language: Python
- Highlights: Medical imaging, federated analytics, custom aggregation

8.3 Secure Computation Tutorials

Microsoft SEAL Examples
- Language: C++
- Highlights: Basic operations, encoding, encryption, performance
TenSEAL Tutorials
- Language: Python
- Highlights: Encrypted neural networks, homomorphic operations on tensors
CrypTen Tutorials
- Language: Python (PyTorch)
- Highlights: Secure multi-party computation for machine learning models
TF Encrypted Examples
- Language: Python (TensorFlow)
- Highlights: Private predictions, secure training, encrypted computations

8.4 Synthetic Data Tutorials

SDV Tutorials
- Language: Python
- Highlights: Tabular data generation, relational data synthesis, evaluation
CTGAN Examples
- Language: Python
- Highlights: GAN-based tabular data synthesis, training and sampling
Gretel Tutorials
- Language: Python
- Highlights: Synthetic data with privacy guarantees, quality evaluation
Ydata-Synthetic Examples
- Language: Python
- Highlights: GAN models for tabular and time-series data

8.5 Privacy Evaluation Tutorials

ML Privacy Meter Tutorial
- Language: Python (TensorFlow)
- Highlights: Membership inference attacks, measuring model privacy leaks
Adversarial Robustness Toolbox Tutorials
- Language: Python
- Highlights: Membership inference, attribute inference, model inversion attacks
TensorFlow Privacy Attacks
- Language: Python (TensorFlow)
- Highlights: Membership inference attack implementation and evaluation

Contribute

Contributions welcome! Read the contribution guidelines first.

Awesome ML Privacy Mitigations

A comprehensive guide to privacy-preserving machine learning techniques and tools

Awesome ML Privacy Mitigation

Contents

Introduction

1. Data Collection Phase

1.1 Data Minimization

1.2 Synthetic Data Generation

2. Data Processing Phase

2.1 Local Differential Privacy (LDP)

2.2 Secure Multi-Party Computation (SMPC)

3. Model Training Phase

3.1 Differentially Private Training

3.2 Federated Learning

4. Model Deployment Phase

4.1 Private Inference

4.2 Model Anonymization and Protection

5. Privacy Governance

5.1 Privacy Budget Management

5.2 Privacy Impact Evaluation

6. Evaluation & Metrics

6.1 Privacy Metrics

6.2 Utility Metrics

7. Libraries & Tools

7.1 Differential Privacy

7.2 Federated Learning

7.3 Secure Computation

7.4 Synthetic Data

7.5 Privacy Evaluation

8. Tutorials & Resources

8.1 Differential Privacy Tutorials

8.2 Federated Learning Tutorials

8.3 Secure Computation Tutorials

8.4 Synthetic Data Tutorials

8.5 Privacy Evaluation Tutorials

Contribute

License