Skip to the content.

Awesome ML Privacy Mitigation Awesome

A curated list of practical privacy-preserving techniques for machine learning

This repository aims to bridge the gap between theoretical privacy research and practical implementation in machine learning. Unlike other resources that only provide high-level overviews, we focus on actionable techniques with code examples, specific parameter recommendations, and realistic privacy-utility trade-offs.

Contents

Introduction

Machine learning systems increasingly handle sensitive data, making privacy protection essential. Building on the NIST Adversarial Machine Learning Taxonomy (2025), this repository provides implementation-focused guidance for ML practitioners.

Primary goals:

1. Data Collection Phase

1.1 Data Minimization

Description:

NIST AML Attack Mappings:

Why It Matters for ML:

Implementation Approach:

Algorithms and Tools:

Utility/Privacy Trade-off:

Important Considerations:

References:

[1] The Data Minimization Principle in Machine Learning (Ganesh et al., 2024) / Blog - Empirical exploration of data minimization and its misalignment with privacy, along with potential solutions

[2] Data Minimization for GDPR Compliance in Machine Learning Models (Goldsteen et al., 2022) - Method to reduce personal data needed for ML predictions while preserving model accuracy through knowledge distillation

[3] From Principle to Practice: Vertical Data Minimization for Machine Learning (Staab et al., 2023) - Comprehensive framework for implementing data minimization in machine learning with data generalization techniques

[4] Data Shapley: Equitable Valuation of Data for Machine Learning (Ghorbani & Zou, 2019) - Introduces method to quantify the value of individual data points to model performance, enabling systematic data reduction

[5] Algorithmic Data Minimization for ML over IoT Data Streams (Kil et al., 2024) - Framework for minimizing data collection in IoT environments while balancing utility and privacy

[6] Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) - Pioneering work on membership inference attacks that can be used to audit privacy leakage in ML models

[7] Selecting critical features for data classification based on machine learning methods (Dewi et al., 2020) - Demonstrates that feature selection improves model accuracy and performance while reducing dimensionality

1.2 Synthetic Data Generation

Description:

NIST AML Attack Mappings:

Why It Matters for ML:

Implementation Approaches:

  1. Generative Adversarial Networks (GANs)
    • Mechanism: Two-network architecture where generator creates samples and discriminator evaluates authenticity
    • Best For: Complex, high-dimensional data including tabular, time-series, and images
    • Variants: CTGAN for tabular data, PATE-GAN for enhanced privacy guarantees [3]
    • Libraries:
  2. Variational Autoencoders (VAEs)
    • Mechanism: Encoder-decoder architecture with probabilistic latent space
    • Best For: Tabular data with mixed numerical and categorical variables
    • Variants: TVAE specifically designed for tabular data [4]
    • Libraries:
  3. Hybrid Approaches
    • Mechanism: Combines VAE’s encoding capabilities with GAN’s generation abilities
    • Best For: Applications requiring both high fidelity and enhanced privacy protection
    • Recent Advances: VAE-GAN models with improved membership inference resistance [5]
    • Libraries:
  4. Traditional Statistical Methods
    • Bayesian Networks: Model conditional dependencies between variables
    • Copula Methods: Capture complex correlation structures
    • SMOTE: Generate synthetic minority samples for imbalanced data
    • Libraries:

Critical Privacy Evaluation:

Important Considerations:

Implementation Example: #need to add sample SDV tutorial notebooks Privacy Auditing using ML Privacy Meter []

Best Practices:

References:

[1] Synthetic Data: Revisiting the Privacy-Utility Trade-off (Sarmin et al., 2024) - Analysis of privacy-utility trade-offs between synthetic data and traditional anonymization

[2] Machine Learning for Synthetic Data Generation: A Review (Zhao et al., 2023) - Comprehensive review of synthetic data generation techniques and their applications

[3] Modeling Tabular Data using Conditional GAN (Xu et al., 2019) - Introduces CTGAN, designed specifically for mixed-type tabular data generation

[4] Tabular and latent space synthetic data generation: a literature review (Garcia-Gasulla et al., 2023) - Review of data generation methods for tabular data

[5] Synthetic data for enhanced privacy: A VAE-GAN approach against membership inference attacks (Yan et al., 2024) - Novel hybrid approach combining VAE and GAN

[6] SMOTE: Synthetic Minority Over-sampling Technique (Chawla et al., 2002) - Classic approach for generating synthetic samples for minority classes

[7] Empirical privacy metrics: the bad, the ugly… and the good, maybe? (Desfontaines, 2024) - Critical analysis of common empirical privacy metrics in synthetic data

[8] Challenges of Using Synthetic Data Generation Methods for Tabular Microdata (Winter & Tolan, 2023) - Empirical study of trade-offs in different synthetic data generation methods

[9] Privacy Auditing of Machine Learning using Membership Inference Attacks (Yaghini et al., 2021) - Framework for privacy auditing in ML models

[10] PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (Jordon et al., 2019) - Integrates differential privacy into GANs using the PATE framework

[11] A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning (Domingo-Ferrer & Soria-Comas, 2022) - Analysis of privacy in ML including synthetic data approaches

[12] Protect and Extend - Using GANs for Synthetic Data Generation of Time-Series Medical Records (2024) - Application and evaluation of synthetic data in healthcare domain

Libraries:

2. Data Processing Phase

2.1 Local Differential Privacy (LDP)

NIST AML Attack Mappings:

Description:

Key Concepts:

Variants of Differential Privacy:

  1. Pure ε-Differential Privacy
    • Definition: The strictest form, defined by the inequality above
    • Properties:
      • No probability of failure
      • Strict worst-case guarantees
      • Typically requires more noise than relaxed versions
    • Local Application: Randomized response, RAPPOR in high-privacy settings [4]
  2. Approximate (ε,δ)-Differential Privacy
    • Definition: Relaxes pure DP by allowing small probability δ of exceeding the privacy bound
    • Properties:
      • More practical for many applications
      • Allows δ probability of information leakage
      • Enables more efficient mechanisms
    • Local Application: Gaussian mechanism, discrete Laplace in local settings [5]
  3. Rényi Differential Privacy (RDP)
    • Definition: Based on Rényi divergence between output distributions
    • Properties:
      • Better handles composition of mechanisms
      • More precise accounting of privacy loss
      • Particularly useful for iterative algorithms
    • Local Application: Advanced LDP systems with multiple rounds of communication [6]
  4. Gaussian Differential Privacy (GDP)
    • Definition: Special form that connects DP to hypothesis testing
    • Properties:
      • Elegant handling of composition via central limit theorem
      • Natural framework for analyzing mechanisms with Gaussian noise
      • Tighter bounds than (ε,δ)-DP in many cases
    • Local Application: Modern private federated learning systems [7]

Implementation Approaches:

  1. Randomized Response (for binary/categorical data)
    • Mechanism: Random perturbation of true value based on privacy parameter
    • Use Case: Surveys with sensitive yes/no or categorical questions
    • Variants: Unary encoding, RAPPOR, Generalized Randomized Response [8]
    • Libraries:
  2. Laplace Mechanism (for numerical data)
  3. Gaussian Mechanism (for numerical data)
    • Mechanism: Adds noise calibrated to L2 sensitivity
    • Properties:
      • Achieves (ε,δ)-DP
      • Better for vector-valued functions (lower noise in high dimensions)
      • Allows leveraging L2 sensitivity
    • Use Case: ML model training, high-dimensional statistics [10]
    • Libraries:
  4. Advanced Techniques
    • Amplification by Shuffling: Improving privacy by anonymizing source of contributions
    • Sampled Gaussian Mechanism: Subsampling data before applying Gaussian noise
    • Discrete Gaussian: Better handling of integer-valued functions [11]
    • Libraries:

Privacy Budget Considerations:

Use Cases by Variant:

Few Real-World Applications (more available on Damien’s blog):

Libraries and Tools:

Resources:

  1. A friendly introduction to differential privacy (Desfontaines) - Accessible explanation of differential privacy concepts and fundamentals

  2. Local Differential Privacy: a tutorial (Xiong et al., 2020) - Comprehensive overview of LDP theory and applications

  3. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response (Erlingsson et al., 2014) - Google’s LDP system for Chrome usage statistics

  4. The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014) - Comprehensive textbook on differential privacy

  5. Approximate Differential Privacy (Programming Differential Privacy) - Detailed guide to approximate DP implementation

  6. Rényi Differential Privacy (Mironov, 2017) - Original paper introducing RDP

  7. Gaussian Differential Privacy (Dong et al., 2022) - Framework connecting DP to hypothesis testing

  8. Getting more useful results with differential privacy (Desfontaines) - Practical advice for improving utility in DP systems

  9. A reading list on differential privacy (Desfontaines) - Curated list of papers and resources for learning DP

  10. Rényi Differential Privacy of the Sampled Gaussian Mechanism (Mironov et al., 2019) - Analysis of privacy guarantees for subsampled data

  11. On the Rényi Differential Privacy of the Shuffle Model (Wang et al., 2021) - Analysis of shuffling for privacy amplification

  12. Differential Privacy: An Economic Method for Choosing Epsilon (Hsu et al., 2014) - Framework for epsilon selection based on economic principles

  13. Functional Rényi Differential Privacy for Generative Modeling (Jalko et al., 2023) - Extension of RDP to functional outputs

  14. Precision-based attacks and interval refining: how to break, then fix, differential privacy (Haney et al., 2022) - Analysis of vulnerabilities in DP implementations

  15. Differential Privacy: A Primer for a Non-technical Audience (Wood et al., 2018) - Accessible introduction for non-technical readers

  16. Using differential privacy to harness big data and preserve privacy (Brookings, 2020) - Overview of real-world applications

  17. Tumult Analytics tutorials - Practical guide to implementing DP in real-world scenarios

2.2 Secure Multi-Party Computation (SMPC)

NIST AML Attack Mappings:

Description: Enable multiple parties to jointly compute a function over their inputs while keeping those inputs private.

Libraries:

Papers:

3. Model Training Phase

3.1 Differentially Private Training

NIST AML Attack Mappings:

Description: Train ML models with mathematical privacy guarantees by adding carefully calibrated noise during optimization.

Code Example with FastDP by Amazon:

from fastDP import PrivacyEngine
optimizer = SGD(model.parameters(), lr=0.05)
privacy_engine = PrivacyEngine(
    model,
    batch_size=256,
    sample_size=50000,
    epochs=3,
    target_epsilon=2,
    clipping_fn='automatic',
    clipping_mode='MixOpt',
    origin_params=None,
    clipping_style='all-layer',
)
# attaching to optimizers is not needed for multi-GPU distributed learning
privacy_engine.attach(optimizer) 

#----- standard training pipeline
loss = F.cross_entropy(model(batch), labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()

Code Example with TensorFlow Privacy:

import tensorflow as tf
import tensorflow_privacy as tfp

# Create optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

# Make optimizer differentially private
dp_optimizer = tfp.DPKerasSGDOptimizer(
    optimizer,
    noise_multiplier=1.1,
    l2_norm_clip=1.0,
    num_microbatches=1,
    sample_rate=256/60000  # batch_size/dataset_size
)

# Compile model with DP optimizer
model.compile(
    optimizer=dp_optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction=tf.losses.Reduction.NONE
    ),
    metrics=['accuracy']
)

Parameter Selection Guide:

Libraries:

Privacy-Utility Trade-offs:

Papers:

3.2 Federated Learning

NIST AML Attack Mappings:

Description: Train models across multiple devices or servers without exchanging raw data.

Code Example with TensorFlow Federated:

import tensorflow as tf
import tensorflow_federated as tff

# Define model and optimization
def create_model():
    return tf.keras.models.Sequential([
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(output_classes, activation='softmax')
    ])

def model_fn():
    model = create_model()
    return tff.learning.from_keras_model(
        model,
        input_spec=preprocessed_example_dataset.element_spec,
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
    )

# Build the federated training process
iterative_process = tff.learning.build_federated_averaging_process(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.1),
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(1.0)
)

# Train the model
state = iterative_process.initialize()
for round_num in range(num_rounds):
    # Select clients for this round
    sample_clients = np.random.choice(client_ids, num_clients_per_round)
    client_datasets = [client_data[client_id] for client_id in sample_clients]
    
    # Run one round of training
    state, metrics = iterative_process.next(state, client_datasets)
    print(f'Round {round_num}: {metrics}')

Libraries:

Privacy Enhancements:

Papers:

4. Model Deployment Phase

4.1 Private Inference

NIST AML Attack Mappings:

Description: Protect privacy during model inference, where both the model and user inputs need protection.

Code Example with Homomorphic Encryption (TenSEAL):

import tenseal as ts
import numpy as np

# Client-side code
# Create context for BFV homomorphic encryption scheme
context = ts.context(ts.SCHEME_TYPE.BFV, poly_modulus_degree=8192, plain_modulus=1032193)
context.generate_galois_keys()

# Encrypt input data
x = np.array([[0.1, 0.2, 0.3, 0.4]])
encrypted_x = ts.ckks_vector(context, x)

# Send encrypted_x to server for inference

# Server-side code (computing inference on encrypted data)
def private_inference(encrypted_input, model_weights):
    # First layer computation - matrix multiplication
    weights1 = model_weights[0]
    bias1 = model_weights[1]
    layer1_out = encrypted_input.matmul(weights1) + bias1
    
    # Apply approximate activation function
    # (usually polynomial approximation of ReLU, sigmoid, etc.)
    activated = approximate_activation(layer1_out)
    
    # Additional layers...
    
    # Return encrypted prediction
    return encrypted_prediction

# Client receives and decrypts the result
decrypted_result = encrypted_prediction.decrypt()

Libraries:

Performance Trade-offs:

Papers:

4.2 Model Anonymization and Protection

NIST AML Attack Mappings:

Description: Protect the model itself from attacks that aim to extract training data or reverse-engineer model functionality.

Code Example of Prediction Purification:

# Prediction purification with calibrated noise
def purify_predictions(model_output, epsilon=1.0, sensitivity=1.0):
    # Calculate noise scale based on sensitivity and privacy parameter
    scale = sensitivity / epsilon
    
    # Add calibrated noise
    noise = np.random.laplace(0, scale, size=model_output.shape)
    purified_output = model_output + noise
    
    # Normalize if probability distribution
    if np.all(model_output >= 0) and np.isclose(np.sum(model_output), 1.0):
        purified_output = np.clip(purified_output, 0, 1)
        purified_output = purified_output / np.sum(purified_output)
        
    return purified_output

# Use in inference pipeline
def private_inference(input_data):
    raw_predictions = model.predict(input_data)
    private_predictions = purify_predictions(raw_predictions, epsilon=2.0)
    return private_predictions

Techniques:

Libraries:

Papers:

5. Privacy Governance

5.1 Privacy Budget Management

NIST AML Attack Mappings:

Description: Track and allocate privacy loss across the ML pipeline.

Code Example:

# Using RDP accountant for DP-SGD with budget management
from prv_accountant import PRVAccountant

# Initialize privacy accountant
accountant = PRVAccountant(noise_multiplier=1.1, 
                         sampling_probability=256/50000)

# Track training iterations
for epoch in range(epochs):
    # Update the accountant with batch training
    accountant.step(noise_multiplier=1.1, 
                  sampling_probability=256/50000)
    
    # Check current privacy spent
    epsilon = accountant.get_epsilon(delta=1e-5)
    
    # If budget exceeded, stop training
    if epsilon > privacy_budget:
        print(f"Privacy budget {privacy_budget} exceeded at epoch {epoch}")
        break

Libraries:

Papers:

5.2 Privacy Impact Evaluation

NIST AML Attack Mappings:

Description: Quantitatively measure privacy risks in ML systems.

Code Example:

from privacy_meter.audit import MembershipInferenceAttack

# Configure the attack
attack = MembershipInferenceAttack(
    target_model=model,
    target_train_data=x_train,
    target_test_data=x_test,
    attack_type='black_box'
)

# Run the attack
attack_results = attack.run()

# Analyze results
accuracy = attack_results.get_attack_accuracy()
auc = attack_results.get_auc_score()
print(f"Attack accuracy: {accuracy}, AUC: {auc}")

# Comparative evaluation
if auc > 0.6:
    print("Privacy protection INSUFFICIENT - model vulnerable to membership inference")
elif auc > 0.55:
    print("Privacy protection MARGINAL - consider additional mitigations")
else:
    print("Privacy protection ADEQUATE against membership inference")

Libraries:

Papers:

6. Evaluation & Metrics

6.1 Privacy Metrics

NIST AML Attack Mappings:

6.2 Utility Metrics

7. Libraries & Tools

7.1 Differential Privacy

7.2 Federated Learning

7.3 Secure Computation

7.4 Synthetic Data

7.5 Privacy Evaluation

8. Tutorials & Resources

8.1 Differential Privacy Tutorials

8.2 Federated Learning Tutorials

8.3 Secure Computation Tutorials

8.4 Synthetic Data Tutorials

8.5 Privacy Evaluation Tutorials

Contribute

Contributions welcome! Read the contribution guidelines first.

License

CC0