Awesome ML Privacy Mitigation 
A curated list of practical privacy-preserving techniques for machine learning
This repository aims to bridge the gap between theoretical privacy research and practical implementation in machine learning. Unlike other resources that only provide high-level overviews, we focus on actionable techniques with code examples, specific parameter recommendations, and realistic privacy-utility trade-offs.
Contents
- Introduction
- 1. Data Collection Phase
- 2. Data Processing Phase
- 3. Model Training Phase
- 4. Model Deployment Phase
- 5. Privacy Governance
- 6. Evaluation & Metrics
- 7. Libraries & Tools
- 8. Tutorials & Resources
- Contribute
Introduction
Machine learning systems increasingly handle sensitive data, making privacy protection essential. Building on the NIST Adversarial Machine Learning Taxonomy (2025), this repository provides implementation-focused guidance for ML practitioners.
Primary goals:
- ✅ Provide code examples for privacy-preserving techniques
- ✅ Document realistic privacy-utility trade-offs
- ✅ Help practitioners select appropriate techniques for their use case
- ✅ Maintain up-to-date links to libraries, tools, and research
1. Data Collection Phase
1.1 Data Minimization
Description:
- Collecting only the data necessary for the intended purpose
- Built on two core privacy pillars: purpose limitation and data relevance
- Different from anonymization - focuses on reducing data collection upfront rather than transforming collected data
- Mandated by regulations like GDPR and CCPA as a fundamental privacy principle [1]
NIST AML Attack Mappings:
- Primary Mitigation: [NISTAML.032] Data Reconstruction
- Additional Protection:
- [NISTAML.033] Membership Inference
- [NISTAML.034] Property Inference
Why It Matters for ML:
- Machine learning systems often collect excessive data “just in case,” creating unnecessary privacy risks
- Reduces the attack surface and potential harm from data breaches [2]
- Prevents feature creep that can lead to model overfitting and privacy vulnerabilities
- Simplifies compliance with privacy regulations and builds user trust
Implementation Approach:
- Pre-collection Phase
- Conduct a data necessity audit before collection
- Define explicit variables needed for model functionality based on domain expertise
- Document justification for each feature’s necessity relative to the model’s objective
- Avoid collecting indirect identifiers where possible
- Libraries:
- ML Privacy Meter - Privacy risk assessment
- Adversarial Robustness Toolbox - Feature importance analysis
- Feature Selection and Evaluation
- Apply feature importance ranking to identify non-essential features [3]
- Evaluate correlation between features to avoid redundant data collection
- Measure model performance impact when removing features
- Test different feature subsets to find minimal viable feature set
- Libraries:
- scikit-learn - Feature selection utilities
- SHAP - Feature importance analysis
- Data Shapley - Data valuation
- Ongoing Governance
- Implement data expiration policies to remove data that’s no longer needed
- Review feature requirements when model objectives change
- Conduct periodic audits to identify and eliminate feature creep
Algorithms and Tools:
- Feature Selection Methods
- Column-based filtering with domain expertise validation
- Feature importance analysis using permutation importance or Shapley values [4]
- Privacy Impact Assessment (PIA) frameworks for systematic evaluation
- Privacy Risk Assessment
- Evaluate uniqueness of feature combinations
- Analyze correlation between features and user identifiability [5]
- Measure how much each feature contributes to model performance vs. privacy risk
- Privacy Auditing
- Use Membership Inference Attacks (MIAs) to evaluate privacy leakage in models [6]
- Test if models trained with minimal data are more resilient to privacy attacks
- Iteratively adjust feature selection based on audit results
Utility/Privacy Trade-off:
- Minimal impact on model utility if properly implemented with domain expertise
- Models can maintain accuracy with significantly reduced feature sets [7]
- Impact varies by domain and use case - requires empirical testing
Important Considerations:
- Data minimization is not a full fix since various features are inherently correlated
- Proper correlation analysis should be conducted to understand feature relationships
- Domain expertise is crucial for effective minimization without harming model utility
- Regular reassessment is needed as data relevance may change over time
References:
[1] The Data Minimization Principle in Machine Learning (Ganesh et al., 2024) / Blog - Empirical exploration of data minimization and its misalignment with privacy, along with potential solutions
[2] Data Minimization for GDPR Compliance in Machine Learning Models (Goldsteen et al., 2022) - Method to reduce personal data needed for ML predictions while preserving model accuracy through knowledge distillation
[3] From Principle to Practice: Vertical Data Minimization for Machine Learning (Staab et al., 2023) - Comprehensive framework for implementing data minimization in machine learning with data generalization techniques
[4] Data Shapley: Equitable Valuation of Data for Machine Learning (Ghorbani & Zou, 2019) - Introduces method to quantify the value of individual data points to model performance, enabling systematic data reduction
[5] Algorithmic Data Minimization for ML over IoT Data Streams (Kil et al., 2024) - Framework for minimizing data collection in IoT environments while balancing utility and privacy
[6] Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) - Pioneering work on membership inference attacks that can be used to audit privacy leakage in ML models
[7] Selecting critical features for data classification based on machine learning methods (Dewi et al., 2020) - Demonstrates that feature selection improves model accuracy and performance while reducing dimensionality
1.2 Synthetic Data Generation
Description:
- Creating artificial data that preserves statistical properties without containing real individual information
- Alternative to traditional anonymization for enabling data sharing and machine learning
- Generates new data points rather than transforming existing ones
- Increasingly adopted for privacy-sensitive applications [1]
NIST AML Attack Mappings:
- Primary Mitigation:
- [NISTAML.037] Training Data Attacks
- [NISTAML.038] Data Extraction
- Additional Protection:
- [NISTAML.033] Membership Inference
Why It Matters for ML:
- Provides training data for models without exposing real individuals’ information directly
- Helps address data scarcity and imbalance issues in specialized domains
- Enables experimentation and development while minimizing privacy risks
- Facilitates data sharing across organizations or with researchers [2]
Implementation Approaches:
- Generative Adversarial Networks (GANs)
- Mechanism: Two-network architecture where generator creates samples and discriminator evaluates authenticity
- Best For: Complex, high-dimensional data including tabular, time-series, and images
- Variants: CTGAN for tabular data, PATE-GAN for enhanced privacy guarantees [3]
- Libraries:
- CTGAN - Tabular data generation
- PATE-GAN - Privacy-preserving GAN
- Gretel Synthetics - GAN-based synthesis
- Variational Autoencoders (VAEs)
- Mechanism: Encoder-decoder architecture with probabilistic latent space
- Best For: Tabular data with mixed numerical and categorical variables
- Variants: TVAE specifically designed for tabular data [4]
- Libraries:
- SDV - TVAE implementation
- Ydata-Synthetic - VAE-based synthesis
- Gretel Synthetics - VAE support
- Hybrid Approaches
- Mechanism: Combines VAE’s encoding capabilities with GAN’s generation abilities
- Best For: Applications requiring both high fidelity and enhanced privacy protection
- Recent Advances: VAE-GAN models with improved membership inference resistance [5]
- Libraries:
- Gretel Synthetics - Hybrid models
- Ydata-Synthetic - Advanced synthesis
- Traditional Statistical Methods
- Bayesian Networks: Model conditional dependencies between variables
- Copula Methods: Capture complex correlation structures
- SMOTE: Generate synthetic minority samples for imbalanced data
- Libraries:
- SDV - Statistical methods
- imbalanced-learn - SMOTE implementation
- Copulas - Copula-based synthesis
Critical Privacy Evaluation:
- Common Evaluation Approaches [7]
- Measuring similarity between synthetic data and original data
- Testing for successful membership inference attacks
- Analyzing model performance when trained on synthetic versus real data
- Limitations of Current Metrics [8]
- Distance-based metrics may not capture actual privacy risks
- Simple attacker models don’t reflect sophisticated real-world attacks
- Averaged metrics can miss vulnerabilities affecting minority groups or outliers
- Results often vary significantly with different random initializations
- Beyond Empirical Metrics
- Complementing testing with formal privacy guarantees like differential privacy
- Adopting adversarial mindset when evaluating privacy claims
- Considering multiple attack vectors beyond basic membership inference [9]
Important Considerations:
- Privacy-Utility Trade-off
- Higher privacy protection typically reduces data utility and vice versa
- Optimal balance depends on specific use case and sensitivity of the data
- Quantitative measurement of both aspects is essential for decision-making [10]
- Technical Challenges
- Handling categorical variables effectively
- Preserving complex relationships between features
- Scaling to high-dimensional data
- Computational resources required for training [11]
- Deployment Guidance
- Validate both utility and privacy before use
- Consider complementary privacy techniques alongside synthetic data
- Be cautious of overstated privacy claims from vendors
- Match evaluation rigor to application sensitivity [12]
Implementation Example: #need to add sample SDV tutorial notebooks Privacy Auditing using ML Privacy Meter []
Best Practices:
- Data Preprocessing
- Remove direct identifiers before synthetic data generation
- Consider dimensionality reduction for very high-dimensional data
- Address class imbalance issues at preprocessing stage
- Model Selection and Configuration
- Choose generation method based on data type and privacy requirements
- Consider differential privacy mechanisms when possible
- Tune hyperparameters to balance utility and privacy
- Evaluation and Validation
- Test with multiple privacy metrics at different thresholds
- Evaluate utility for specific downstream tasks
- Pay special attention to outliers and minority groups
- Document privacy evaluation methodology alongside synthetic data
References:
[1] Synthetic Data: Revisiting the Privacy-Utility Trade-off (Sarmin et al., 2024) - Analysis of privacy-utility trade-offs between synthetic data and traditional anonymization
[2] Machine Learning for Synthetic Data Generation: A Review (Zhao et al., 2023) - Comprehensive review of synthetic data generation techniques and their applications
[3] Modeling Tabular Data using Conditional GAN (Xu et al., 2019) - Introduces CTGAN, designed specifically for mixed-type tabular data generation
[4] Tabular and latent space synthetic data generation: a literature review (Garcia-Gasulla et al., 2023) - Review of data generation methods for tabular data
[5] Synthetic data for enhanced privacy: A VAE-GAN approach against membership inference attacks (Yan et al., 2024) - Novel hybrid approach combining VAE and GAN
[6] SMOTE: Synthetic Minority Over-sampling Technique (Chawla et al., 2002) - Classic approach for generating synthetic samples for minority classes
[7] Empirical privacy metrics: the bad, the ugly… and the good, maybe? (Desfontaines, 2024) - Critical analysis of common empirical privacy metrics in synthetic data
[8] Challenges of Using Synthetic Data Generation Methods for Tabular Microdata (Winter & Tolan, 2023) - Empirical study of trade-offs in different synthetic data generation methods
[9] Privacy Auditing of Machine Learning using Membership Inference Attacks (Yaghini et al., 2021) - Framework for privacy auditing in ML models
[10] PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (Jordon et al., 2019) - Integrates differential privacy into GANs using the PATE framework
[11] A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning (Domingo-Ferrer & Soria-Comas, 2022) - Analysis of privacy in ML including synthetic data approaches
[12] Protect and Extend - Using GANs for Synthetic Data Generation of Time-Series Medical Records (2024) - Application and evaluation of synthetic data in healthcare domain
Libraries:
2. Data Processing Phase
2.1 Local Differential Privacy (LDP)
NIST AML Attack Mappings:
- Primary Mitigation:
- [NISTAML.032] Data Reconstruction
- [NISTAML.033] Membership Inference
- Additional Protection:
- [NISTAML.034] Property Inference
Description:
- Adding calibrated noise to data on the user’s device before it leaves their control
- Provides strong privacy guarantees without requiring a trusted central aggregator
- Each user independently applies a randomization mechanism to their own data
- Allows organizations to collect sensitive data while maintaining formal privacy guarantees [1]
Key Concepts:
- Definition: Algorithm M satisfies ε-LDP if for all possible inputs x, x’ and all possible outputs y:
Pr[M(x) = y] ≤ e^ε × Pr[M(x') = y]
-
Versus Central DP: LDP typically requires more noise than central DP for the same privacy level but eliminates the need for a trusted data collector [2]
- Privacy Budget Management:
- ε value controls privacy-utility trade-off
- Lower ε = stronger privacy but greater accuracy loss
- Composition: Multiple LDP queries consume privacy budget cumulatively [3]
Variants of Differential Privacy:
- Pure ε-Differential Privacy
- Definition: The strictest form, defined by the inequality above
- Properties:
- No probability of failure
- Strict worst-case guarantees
- Typically requires more noise than relaxed versions
- Local Application: Randomized response, RAPPOR in high-privacy settings [4]
- Approximate (ε,δ)-Differential Privacy
- Definition: Relaxes pure DP by allowing small probability δ of exceeding the privacy bound
- Properties:
- More practical for many applications
- Allows δ probability of information leakage
- Enables more efficient mechanisms
- Local Application: Gaussian mechanism, discrete Laplace in local settings [5]
- Rényi Differential Privacy (RDP)
- Definition: Based on Rényi divergence between output distributions
- Properties:
- Better handles composition of mechanisms
- More precise accounting of privacy loss
- Particularly useful for iterative algorithms
- Local Application: Advanced LDP systems with multiple rounds of communication [6]
- Gaussian Differential Privacy (GDP)
- Definition: Special form that connects DP to hypothesis testing
- Properties:
- Elegant handling of composition via central limit theorem
- Natural framework for analyzing mechanisms with Gaussian noise
- Tighter bounds than (ε,δ)-DP in many cases
- Local Application: Modern private federated learning systems [7]
Implementation Approaches:
- Randomized Response (for binary/categorical data)
- Mechanism: Random perturbation of true value based on privacy parameter
- Use Case: Surveys with sensitive yes/no or categorical questions
- Variants: Unary encoding, RAPPOR, Generalized Randomized Response [8]
- Libraries:
- OpenDP - Supports randomized response and RAPPOR
- IBM Differential Privacy Library - Implements RAPPOR and variants
- Tumult Analytics - Includes RAPPOR implementation
- Laplace Mechanism (for numerical data)
- Mechanism: Adds noise calibrated to L1 sensitivity
- Properties:
- Achieves pure ε-DP
- Noise proportional to sensitivity/ε
- Simple to implement
- Use Case: Count queries, sums, averages with bounded sensitivity [9]
- Libraries:
- Google’s Differential Privacy Library - Core implementation
- OpenDP - Python bindings with Laplace mechanism
- IBM Differential Privacy Library - Laplace mechanism with utilities
- Gaussian Mechanism (for numerical data)
- Mechanism: Adds noise calibrated to L2 sensitivity
- Properties:
- Achieves (ε,δ)-DP
- Better for vector-valued functions (lower noise in high dimensions)
- Allows leveraging L2 sensitivity
- Use Case: ML model training, high-dimensional statistics [10]
- Libraries:
- TensorFlow Privacy - DP-SGD implementation
- Opacus - PyTorch-based DP training
- Microsoft SmartNoise - Core implementation
- Advanced Techniques
- Amplification by Shuffling: Improving privacy by anonymizing source of contributions
- Sampled Gaussian Mechanism: Subsampling data before applying Gaussian noise
- Discrete Gaussian: Better handling of integer-valued functions [11]
- Libraries:
- OpenDP - Supports composition and amplification
- Tumult Analytics - Advanced composition utilities
- IBM Differential Privacy Library - Composition tools
Privacy Budget Considerations:
- Selecting Appropriate Parameters:
- ε value: Controls privacy-utility trade-off in all variants
- δ parameter: Should be smaller than 1/n (n = number of users) for (ε,δ)-DP
- α parameter: Order of Rényi divergence for RDP [12]
- Composition Advantages of Variants:
- Pure ε-DP: Simple linear composition (privacy loss adds up)
- (ε,δ)-DP: Better composition via advanced composition theorems
- RDP: Precise tracking of privacy loss under composition
- GDP: Natural composition via central limit theorem [13]
- Real-World Considerations:
- Theoretical guarantees can be undermined by implementation issues
- Floating-point vulnerabilities can affect all variants
- Consider robustness to side-channel attacks
- Balance between formal guarantees and practical utility [14]
Use Cases by Variant:
- Pure ε-DP:
- Simple counts and statistics
- One-time data collection
- Highly sensitive applications requiring strict guarantees
- (ε,δ)-DP with Gaussian Mechanism:
- Vector-valued queries (where L2 sensitivity « L1 sensitivity)
- Applications where moderate relaxation of privacy is acceptable
- Machine learning with high-dimensional gradients
- RDP and Advanced Variants:
- Iterative algorithms with many composed mechanisms
- Private machine learning (especially SGD-based)
- Complex federated analytics systems [15]
Few Real-World Applications (more available on Damien’s blog):
- Apple: iOS/macOS telemetry and emoji suggestions
- Google: Chrome browser usage statistics via RAPPOR
- Microsoft: Windows telemetry data collection
- Meta: Ad delivery optimization without cross-site tracking
Libraries and Tools:
- PyDP (OpenMined): Python wrapper around Google’s C++ DP library
- Tumult Analytics: Open-source DP library with LDP support
- IBM Differential Privacy Library: Comprehensive DP toolkit
- Microsoft SmartNoise: Extensible DP framework
- TensorFlow Privacy: DP for machine learning
Resources:
-
A friendly introduction to differential privacy (Desfontaines) - Accessible explanation of differential privacy concepts and fundamentals
-
Local Differential Privacy: a tutorial (Xiong et al., 2020) - Comprehensive overview of LDP theory and applications
-
RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response (Erlingsson et al., 2014) - Google’s LDP system for Chrome usage statistics
-
The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014) - Comprehensive textbook on differential privacy
-
Approximate Differential Privacy (Programming Differential Privacy) - Detailed guide to approximate DP implementation
-
Rényi Differential Privacy (Mironov, 2017) - Original paper introducing RDP
-
Gaussian Differential Privacy (Dong et al., 2022) - Framework connecting DP to hypothesis testing
-
Getting more useful results with differential privacy (Desfontaines) - Practical advice for improving utility in DP systems
-
A reading list on differential privacy (Desfontaines) - Curated list of papers and resources for learning DP
-
Rényi Differential Privacy of the Sampled Gaussian Mechanism (Mironov et al., 2019) - Analysis of privacy guarantees for subsampled data
-
On the Rényi Differential Privacy of the Shuffle Model (Wang et al., 2021) - Analysis of shuffling for privacy amplification
-
Differential Privacy: An Economic Method for Choosing Epsilon (Hsu et al., 2014) - Framework for epsilon selection based on economic principles
-
Functional Rényi Differential Privacy for Generative Modeling (Jalko et al., 2023) - Extension of RDP to functional outputs
-
Precision-based attacks and interval refining: how to break, then fix, differential privacy (Haney et al., 2022) - Analysis of vulnerabilities in DP implementations
-
Differential Privacy: A Primer for a Non-technical Audience (Wood et al., 2018) - Accessible introduction for non-technical readers
-
Using differential privacy to harness big data and preserve privacy (Brookings, 2020) - Overview of real-world applications
-
Tumult Analytics tutorials - Practical guide to implementing DP in real-world scenarios
2.2 Secure Multi-Party Computation (SMPC)
NIST AML Attack Mappings:
- Primary Mitigation:
- [NISTAML.031] Model Extraction
- [NISTAML.032] Data Reconstruction
Description: Enable multiple parties to jointly compute a function over their inputs while keeping those inputs private.
Libraries:
Papers:
3. Model Training Phase
3.1 Differentially Private Training
NIST AML Attack Mappings:
- Primary Mitigation: [NISTAML.033] Membership Inference
- Additional Protection:
- [NISTAML.032] Data Reconstruction
- [NISTAML.034] Property Inference
Description: Train ML models with mathematical privacy guarantees by adding carefully calibrated noise during optimization.
Code Example with FastDP by Amazon:
from fastDP import PrivacyEngine
optimizer = SGD(model.parameters(), lr=0.05)
privacy_engine = PrivacyEngine(
model,
batch_size=256,
sample_size=50000,
epochs=3,
target_epsilon=2,
clipping_fn='automatic',
clipping_mode='MixOpt',
origin_params=None,
clipping_style='all-layer',
)
# attaching to optimizers is not needed for multi-GPU distributed learning
privacy_engine.attach(optimizer)
#----- standard training pipeline
loss = F.cross_entropy(model(batch), labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Code Example with TensorFlow Privacy:
import tensorflow as tf
import tensorflow_privacy as tfp
# Create optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
# Make optimizer differentially private
dp_optimizer = tfp.DPKerasSGDOptimizer(
optimizer,
noise_multiplier=1.1,
l2_norm_clip=1.0,
num_microbatches=1,
sample_rate=256/60000 # batch_size/dataset_size
)
# Compile model with DP optimizer
model.compile(
optimizer=dp_optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction=tf.losses.Reduction.NONE
),
metrics=['accuracy']
)
Parameter Selection Guide:
- Noise multiplier: 0.5-3.0 (higher = more privacy)
- Gradient clipping: 0.1-5.0 (domain dependent)
- Privacy budget: ε = 1-10 (lower = more privacy)
Libraries:
Privacy-Utility Trade-offs:
- For ε = 1.0: ~5-15% accuracy drop
- For ε = 3.0: ~2-7% accuracy drop
- For ε = 8.0: ~1-3% accuracy drop
- (Depends heavily on dataset size and task complexity)
Papers:
- Deep Learning with Differential Privacy (Abadi et al., 2016)
- Differentially Private Model Publishing for Deep Learning (Yu et al., 2018)
3.2 Federated Learning
NIST AML Attack Mappings:
- Primary Mitigation:
- [NISTAML.038] Data Extraction
- [NISTAML.037] Training Data Attacks
Description: Train models across multiple devices or servers without exchanging raw data.
Code Example with TensorFlow Federated:
import tensorflow as tf
import tensorflow_federated as tff
# Define model and optimization
def create_model():
return tf.keras.models.Sequential([
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(output_classes, activation='softmax')
])
def model_fn():
model = create_model()
return tff.learning.from_keras_model(
model,
input_spec=preprocessed_example_dataset.element_spec,
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)
# Build the federated training process
iterative_process = tff.learning.build_federated_averaging_process(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.1),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(1.0)
)
# Train the model
state = iterative_process.initialize()
for round_num in range(num_rounds):
# Select clients for this round
sample_clients = np.random.choice(client_ids, num_clients_per_round)
client_datasets = [client_data[client_id] for client_id in sample_clients]
# Run one round of training
state, metrics = iterative_process.next(state, client_datasets)
print(f'Round {round_num}: {metrics}')
Libraries:
Privacy Enhancements:
- Secure Aggregation: Cryptographic protocol to protect individual updates
- Differential Privacy: Add noise to updates to prevent memorization
- Update Compression: Reduce information content of transmitted updates
Papers:
- Communication-Efficient Learning of Deep Networks from Decentralized Data (McMahan et al., 2017)
- Practical Secure Aggregation for Federated Learning on User-Held Data (Bonawitz et al., 2017)
- Federated Learning: Strategies for Improving Communication Efficiency (Konečný et al., 2016)
4. Model Deployment Phase
4.1 Private Inference
NIST AML Attack Mappings:
- Primary Mitigation:
- [NISTAML.031] Model Extraction
- [NISTAML.038] Data Extraction
Description: Protect privacy during model inference, where both the model and user inputs need protection.
Code Example with Homomorphic Encryption (TenSEAL):
import tenseal as ts
import numpy as np
# Client-side code
# Create context for BFV homomorphic encryption scheme
context = ts.context(ts.SCHEME_TYPE.BFV, poly_modulus_degree=8192, plain_modulus=1032193)
context.generate_galois_keys()
# Encrypt input data
x = np.array([[0.1, 0.2, 0.3, 0.4]])
encrypted_x = ts.ckks_vector(context, x)
# Send encrypted_x to server for inference
# Server-side code (computing inference on encrypted data)
def private_inference(encrypted_input, model_weights):
# First layer computation - matrix multiplication
weights1 = model_weights[0]
bias1 = model_weights[1]
layer1_out = encrypted_input.matmul(weights1) + bias1
# Apply approximate activation function
# (usually polynomial approximation of ReLU, sigmoid, etc.)
activated = approximate_activation(layer1_out)
# Additional layers...
# Return encrypted prediction
return encrypted_prediction
# Client receives and decrypts the result
decrypted_result = encrypted_prediction.decrypt()
Libraries:
Performance Trade-offs:
- Homomorphic Encryption: 1000-100000x slowdown, strongest privacy
- Secure Multi-Party Computation: 10-1000x slowdown, balanced approach
- Trusted Execution Environments: 1.1-2x slowdown, weaker guarantees
Papers:
- CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy (Gilad-Bachrach et al., 2016)
- GAZELLE: A Low Latency Framework for Secure Neural Network Inference (Juvekar et al., 2018)
4.2 Model Anonymization and Protection
NIST AML Attack Mappings:
- Primary Mitigation: [NISTAML.031] Model Extraction
- Additional Protection:
- [NISTAML.023] Backdoor Poisoning (security-related)
Description: Protect the model itself from attacks that aim to extract training data or reverse-engineer model functionality.
Code Example of Prediction Purification:
# Prediction purification with calibrated noise
def purify_predictions(model_output, epsilon=1.0, sensitivity=1.0):
# Calculate noise scale based on sensitivity and privacy parameter
scale = sensitivity / epsilon
# Add calibrated noise
noise = np.random.laplace(0, scale, size=model_output.shape)
purified_output = model_output + noise
# Normalize if probability distribution
if np.all(model_output >= 0) and np.isclose(np.sum(model_output), 1.0):
purified_output = np.clip(purified_output, 0, 1)
purified_output = purified_output / np.sum(purified_output)
return purified_output
# Use in inference pipeline
def private_inference(input_data):
raw_predictions = model.predict(input_data)
private_predictions = purify_predictions(raw_predictions, epsilon=2.0)
return private_predictions
Techniques:
- Model Distillation: Training a student model on the outputs of a teacher model
- Prediction Purification: Adding noise to model outputs
- Adversarial Regularization: Adding regularization during training to reduce information leakage
- Model Watermarking: Adding imperceptible watermarks to detect model theft
Libraries:
Papers:
- Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks (Papernot et al., 2016)
- Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017)
5. Privacy Governance
5.1 Privacy Budget Management
NIST AML Attack Mappings:
- Risk Management:
- [NISTAML.033] Membership Inference
- [NISTAML.032] Data Reconstruction
Description: Track and allocate privacy loss across the ML pipeline.
Code Example:
# Using RDP accountant for DP-SGD with budget management
from prv_accountant import PRVAccountant
# Initialize privacy accountant
accountant = PRVAccountant(noise_multiplier=1.1,
sampling_probability=256/50000)
# Track training iterations
for epoch in range(epochs):
# Update the accountant with batch training
accountant.step(noise_multiplier=1.1,
sampling_probability=256/50000)
# Check current privacy spent
epsilon = accountant.get_epsilon(delta=1e-5)
# If budget exceeded, stop training
if epsilon > privacy_budget:
print(f"Privacy budget {privacy_budget} exceeded at epoch {epoch}")
break
Libraries:
Papers:
- The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014)
- Renyi Differential Privacy (Mironov, 2017)
5.2 Privacy Impact Evaluation
NIST AML Attack Mappings:
- Vulnerability Assessment:
- [NISTAML.033] Membership Inference
- [NISTAML.034] Property Inference
Description: Quantitatively measure privacy risks in ML systems.
Code Example:
from privacy_meter.audit import MembershipInferenceAttack
# Configure the attack
attack = MembershipInferenceAttack(
target_model=model,
target_train_data=x_train,
target_test_data=x_test,
attack_type='black_box'
)
# Run the attack
attack_results = attack.run()
# Analyze results
accuracy = attack_results.get_attack_accuracy()
auc = attack_results.get_auc_score()
print(f"Attack accuracy: {accuracy}, AUC: {auc}")
# Comparative evaluation
if auc > 0.6:
print("Privacy protection INSUFFICIENT - model vulnerable to membership inference")
elif auc > 0.55:
print("Privacy protection MARGINAL - consider additional mitigations")
else:
print("Privacy protection ADEQUATE against membership inference")
Libraries:
- ML Privacy Meter
- Privacy-Preserving Machine Learning in TF
- IMIA (Indirect Membership Inference Attack)
Papers:
- Evaluating Differentially Private Machine Learning in Practice (Jayaraman & Evans, 2019)
- Machine Learning with Membership Privacy using Adversarial Regularization (Nasr et al., 2018)
6. Evaluation & Metrics
6.1 Privacy Metrics
NIST AML Attack Mappings:
- Comprehensive Coverage:
- [NISTAML.033] Membership Inference
- [NISTAML.032] Data Reconstruction
- [NISTAML.031] Model Extraction
- [NISTAML.034] Property Inference
- Differential Privacy (ε, δ): Smaller values indicate stronger privacy
- KL Divergence: Measures information gain from model about training data
- AUC of Membership Inference: How well attacks can identify training data (closer to 0.5 is better)
- Maximum Information Leakage: Maximum information an adversary can extract
6.2 Utility Metrics
- Privacy-Utility Curves: Plot of accuracy vs. privacy parameter
- Performance Gap: Difference between private and non-private model metrics
- Privacy-Constrained Accuracy: Best accuracy achievable under privacy budget constraint
7. Libraries & Tools
7.1 Differential Privacy
- PyDP (Google’s Differential Privacy) - Python wrapper for Google’s Differential Privacy library
- Opacus - PyTorch-based library for differential privacy in deep learning
- TensorFlow Privacy - TensorFlow-based library for differential privacy
- Diffprivlib - IBM’s library for differential privacy
- Tumult Analytics - Open-source DP library with LDP support
- Microsoft SmartNoise - Extensible DP framework
7.2 Federated Learning
- TensorFlow Federated - Google’s framework for federated learning
- Flower - A friendly federated learning framework
- PySyft - Library for secure and private ML with federated learning
- FATE - Industrial-grade federated learning framework
- FedML - Research-oriented federated learning framework
- NVFlare - NVIDIA’s federated learning framework
7.3 Secure Computation
- TenSEAL - Library for homomorphic encryption with tensor operations
- Microsoft SEAL - Homomorphic encryption library
- CrypTen - Framework for privacy-preserving machine learning based on PyTorch
- MP-SPDZ - Secure multi-party computation framework
- TF Encrypted - Privacy-preserving machine learning in TensorFlow
7.4 Synthetic Data
- SDV - Synthetic data generation ecosystem of libraries
- Gretel Synthetics - Synthetic data generation with privacy guarantees
- CTGAN - GAN-based tabular data synthesis
- Ydata-Synthetic - Synthetic data generation for tabular and time-series data
7.5 Privacy Evaluation
- ML Privacy Meter - Tool for quantifying privacy risks in ML
- Adversarial Robustness Toolbox - For evaluating model robustness including privacy attacks
- TensorFlow Privacy Attacks - Implementation of privacy attacks in TensorFlow
8. Tutorials & Resources
8.1 Differential Privacy Tutorials
- Google’s Differential Privacy Tutorial
- Language: C++, Go, Java
- Highlights: Count-min sketch, quantiles, bounded mean and sum implementations
- OpenDP Tutorial Series
- Language: Python
- Highlights: Step-by-step tutorials on measurements, transformations, composition
- Opacus Tutorials
- Language: Python (PyTorch)
- Highlights: DP-SGD implementation, privacy accounting, CIFAR-10 training
- TensorFlow Privacy Tutorials
- Language: Python (TensorFlow)
- Highlights: DP-SGD, membership inference attacks, privacy accounting
- IBM Differential Privacy Library Tutorials
- Language: Python
- Highlights: DP with scikit-learn integration, classification, regression
8.2 Federated Learning Tutorials
- TensorFlow Federated Tutorials
- Language: Python (TensorFlow)
- Highlights: Image classification, custom aggregations, federated analytics
- Flower Federated Learning Tutorials
- Language: Python (framework-agnostic)
- Highlights: PyTorch, TensorFlow, scikit-learn integrations, simulation
- PySyft Tutorials
- Language: Python
- Highlights: Privacy-preserving federated learning, secure aggregation
- FedML Tutorials
- Language: Python
- Highlights: Cross-device FL, cross-silo FL, mobile device examples
- NVFlare Examples
- Language: Python
- Highlights: Medical imaging, federated analytics, custom aggregation
8.3 Secure Computation Tutorials
- Microsoft SEAL Examples
- Language: C++
- Highlights: Basic operations, encoding, encryption, performance
- TenSEAL Tutorials
- Language: Python
- Highlights: Encrypted neural networks, homomorphic operations on tensors
- CrypTen Tutorials
- Language: Python (PyTorch)
- Highlights: Secure multi-party computation for machine learning models
- TF Encrypted Examples
- Language: Python (TensorFlow)
- Highlights: Private predictions, secure training, encrypted computations
8.4 Synthetic Data Tutorials
- SDV Tutorials
- Language: Python
- Highlights: Tabular data generation, relational data synthesis, evaluation
- CTGAN Examples
- Language: Python
- Highlights: GAN-based tabular data synthesis, training and sampling
- Gretel Tutorials
- Language: Python
- Highlights: Synthetic data with privacy guarantees, quality evaluation
- Ydata-Synthetic Examples
- Language: Python
- Highlights: GAN models for tabular and time-series data
8.5 Privacy Evaluation Tutorials
- ML Privacy Meter Tutorial
- Language: Python (TensorFlow)
- Highlights: Membership inference attacks, measuring model privacy leaks
- Adversarial Robustness Toolbox Tutorials
- Language: Python
- Highlights: Membership inference, attribute inference, model inversion attacks
- TensorFlow Privacy Attacks
- Language: Python (TensorFlow)
- Highlights: Membership inference attack implementation and evaluation
Contribute
Contributions welcome! Read the contribution guidelines first.