- The PII enrichment helps identify leaks in strings either entered by the user, or produced by the system.
- Mishandling or leaking PII can lead to severe issues like privacy violations and identity theft.
- Laws such as GDPR and HIPAA highlight the importance of protecting PII with stringent guidelines.
- Accidentally sending PII to large language models (LLMs) can spread this sensitive information widely, increasing the risks.
- To enable set enrichment parameter to
pii
.
Requirements
- Reachability to
https://github.com/explosion/spacy-models/releases/download/
to download spacy models as required
List of PII entities
Entity Type | Description | Detection Method | Example |
---|---|---|---|
CREDIT_CARD | A credit card number is between 12 to 19 digits. https://en.wikipedia.org/wiki/Payment_card_number | Pattern match and checksum | 4111111111111111 378282246310005 (American Express) |
CRYPTO | A Crypto wallet number. Currently only Bitcoin address is supported | Pattern match, context and checksum | 1BoatSLRHtKNngkdXEeobR76b53LETtpyT |
DATE_TIME | Absolute or relative dates or periods or times smaller than a day. | Pattern match and context | 01/01/2024 |
EMAIL_ADDRESS | An email address identifies an email box to which email messages are delivered | Pattern match, context and RFC-822 validation | [email protected] |
IBAN_CODE | The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors. | Pattern match, context and checksum | DE89 3704 0044 0532 0130 00 |
IP_ADDRESS | An Internet Protocol (IP) address (either IPv4 or IPv6). | Pattern match, context and checksum | 1.2.3.4 127.0.0.12/16 1234:BEEF:3333:4444:5555:6666:7777:8888 |
LOCATION | Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains | Custom logic and context | PALO ALTO Japan |
PERSON | A full person name, which can include first names, middle names or initials, and last names. | Custom logic and context | Joanna Doe |
PHONE_NUMBER | A telephone number | Custom logic, pattern match and context | 5556667890 |
URL | A URL (Uniform Resource Locator), unique identifier used to locate a resource on the Internet | Pattern match, context and top level url validation | www.fiddler.ai |
US SSN | A US Social Security Number (SSN) with 9 digits. | Pattern match and context | 1234-00-5678 |
fdl.ModelInfo.from_dataset_info(
dataset_info=dataset_info,
display_name='llm_model',
model_task=fdl.core_objects.ModelTask.LLM,
custom_features = [
fdl.Enrichment(
name='Rag PII',
enrichment='pii',
columns=['question'], # one or more columns
),
]
)
The above example will lead to generation of new columns:
FDL Rag PII (question)
(bool) : whether any PII was detectedFDL Rag PII (question) Matches
(str) : what matches in raw text were flagged as potential PII (ex. ‘Douglas MacArthur,Korean’)FDL Rag PII (question) Entities
(str) : what entites these matches were tagged as (ex. 'PERSON')
Note
PII enrichment is integrated with Presidio