PII Detection & Continuous Learning Flow¶
π― The Complete Picture¶
PII detection in l0l1 serves dual purposes for a comprehensive continuous learning system:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Submits Query β
β SELECT name, email FROM users WHERE ssn = '123-45-6789' β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PII Detection Engine β
β β’ Detects: email addresses, SSN, phone numbers, names, etc. β
β β’ Uses: Presidio + custom SQL patterns β
β β’ Result: [EMAIL: email, SSN: 123-45-6789, PERSON: name] β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββ΄ββββββββ
β β
βΌ βΌ
βββββββββββββββββββ ββββββββββββββββββββββββ
β User Safety β β Learning Pipeline β
β (Warning) β β (Sanitization) β
βββββββββββββββββββ ββββββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ ββββββββββββββββββββββββ
β Show Warning: β β Sanitize Query: β
β "PII Detected!" β β SELECT name, email β
β β β FROM users WHERE β
β Option to β β ssn = 'XXX-XX-XXXX' β
β anonymize β β β
βββββββββββββββββββ ββββββββββββββββββββββββ
β β
β βΌ
β ββββββββββββββββββββββββ
β β Vector Embedding β
β β Store in Learning β
β β Database β
β ββββββββββββββββββββββββ
β β
β βΌ
β ββββββββββββββββββββββββ
β β Future Similarity β
β β Search & Suggestionsβ
β ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Execute Original Query β
β (User might legitimately need the PII data for their work) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Continuous Learning Benefits¶
1. Safe Pattern Recognition¶
-- Original (with PII):
SELECT name, email FROM users WHERE ssn = '123-45-6789'
-- Stored for learning (sanitized):
SELECT name, email FROM users WHERE ssn = 'XXX-XX-XXXX'
-- Future suggestions based on pattern:
SELECT name, email FROM users WHERE ssn = ?
SELECT name, email FROM users WHERE user_id = ?
2. Query Optimization Learning¶
-- User's query (after PII sanitization):
SELECT u.name, u.email, COUNT(o.id)
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > 'YYYY-MM-DD'
GROUP BY u.id
-- AI learns this pattern works well and suggests:
-- - Add LIMIT clause for performance
-- - Use indexes on created_at
-- - Consider user_id in WHERE for better performance
3. Smart Autocompletion¶
-- User types: "SELECT name, email FROM users WHERE"
-- AI suggests (from sanitized learning data):
-- β
created_at > '2024-01-01' (safe)
-- β
status = 'active' (safe)
-- β
id IN (SELECT ...) (safe)
-- β ssn = '123-45-6789' (PII - not suggested)
π‘ Implementation Deep Dive¶
Learning Service Flow¶
async def record_successful_query(self, workspace_id, query, execution_time, result_count):
# Step 1: Check for PII
if not self.pii_detector.is_safe_for_learning(query):
# Step 2: Sanitize before learning
sanitized_query = self.pii_detector.sanitize_for_learning(query)
query = sanitized_query
# Step 3: Generate vector embedding of CLEAN query
embedding = await self.model.generate_embedding(query)
# Step 4: Store sanitized query + embedding + performance metadata
learning_record = {
"query": query, # This is the SANITIZED version
"embedding": embedding,
"execution_time": execution_time,
"result_count": result_count,
"success_count": 1
}
# Step 5: Future similarity searches happen on clean data
self.learning_data[query_hash] = learning_record
PII Detection Methods¶
class PIIDetector:
def is_safe_for_learning(self, query: str) -> bool:
"""Check if query contains PII - if yes, don't learn as-is"""
pii_findings = self.detect_pii(query)
return len(pii_findings) == 0
def sanitize_for_learning(self, query: str) -> str:
"""Replace PII with generic placeholders for learning"""
anonymized, _ = self.anonymize_sql(query)
return anonymized
def anonymize_sql(self, query: str) -> Tuple[str, List[Dict]]:
"""Replace actual PII values with safe placeholders"""
# '123-45-6789' β 'XXX-XX-XXXX'
# 'john@email.com' β 'user@example.com'
# '555-123-4567' β '555-0123'
π‘οΈ Privacy Benefits¶
For Users:¶
- Real-time warnings: "You're about to query PII data"
- Anonymization options: "Want to see a safe version?"
- Audit compliance: All PII access is logged and warned
For Organizations:¶
- Learning without exposure: AI improves without storing sensitive data
- Compliance: GDPR/CCPA safe learning pipeline
- Governance: Clear audit trail of PII access
For AI Performance:¶
- Pattern recognition: Learns SQL structures, not data values
- Better suggestions: Recommends safe, optimized patterns
- Continuous improvement: Gets smarter without privacy risks
π§ Configuration Options¶
# Fine-tune PII detection
L0L1_PII_ENTITIES=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "SSN", "CREDIT_CARD"]
L0L1_LEARNING_THRESHOLD=0.8 # Confidence threshold for learning
# Learning behavior
L0L1_ENABLE_LEARNING=true
L0L1_ENABLE_PII_DETECTION=true
L0L1_ANONYMIZE_BEFORE_LEARNING=true # Default: true
π Analytics Workbench Integration¶
In your SQL workbench, this creates a seamless experience:
// User types query with PII
const query = "SELECT * FROM users WHERE email = 'john@company.com'"
// Real-time analysis shows:
const analysis = await analyzeSQL(query)
// {
// pii_detected: [{ entity_type: "EMAIL_ADDRESS", text: "john@company.com" }],
// suggestions: [
// "SELECT * FROM users WHERE user_id = ?", // From learning data
// "SELECT * FROM users WHERE status = 'active'" // Safe patterns
// ]
// }
// User gets:
// 1. Warning about PII
// 2. Better query suggestions (learned from sanitized data)
// 3. Option to anonymize
// 4. Query still executes (they might need the data)
This approach ensures AI gets smarter while keeping data private - the best of both worlds! π―