Personal tools

8. HIPAA de-identification

From OpenEMR Project Wiki

Jump to: navigation, search


Owner of this task

OpenEMR and EHR Support

ViCarePlus HealthCare IT Services & Support

6559, SpringPath Lane, San Jose, CA, USA



Meaningful Use Requirments (as per the older version)

FND.15 Provide the capability to remove the identifiers enumerated in Section 164.514(b)(2)(i) of the HIPAA Privacy Rule.

FND.16 Demonstrate and describe how the technology provides the capability to generate and assign a code or other means of record identification to allow information de-identified in accordance with the HIPAA Privacy Rule to be re-identified by the covered entity; such code or other means must not be derived from or related to the information and must not be otherwise capable of being translated so as to disclose the identity of the individual.

HIPAA De-identification

De-identification of Patient Health information (PHI) refers to the patients health information’s excluding the information identifying the patient uniquely. (or) It is the process of removing the identification information from PHI (eg: name, address, contact numbers etc)

The De-identified Patient Health information (PHI) is used for various research, census and other activities.

According to HIPAA, de-identify data can be obtained by removing all 18 elements that could be used to identify the individual or the individual's relatives, employers, or household members. The identifiers that must be removed are the following:


2.All geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly available data from the Bureau of the Census:

1.The geographic unit formed by combining all zip codes with the same three initial digits contains more han 20,000 people; and

2.The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.

3.All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;

4.Telephone numbers;

5.Fax numbers;

6.Electronic mail addresses;

7.Social security numbers;

8.Medical record numbers;

9.Health plan beneficiary numbers;

10.Account numbers;

11.Certificate/license numbers;

12.Vehicle identifiers and serial numbers, including license plate numbers;

13.Device identifiers and serial numbers;

14.Web Universal Resource Locator's (URLs);

15.Internet Protocol (IP) address numbers;

16.Biometric identifiers, including finger and voice prints;

17.Full face photographic images and any comparable images; and

18.Any other unique identifying number, characteristic, or code (except as permitted by the re-identification rules)


A covered entity (trusted group of members) may assign a code or other means of record identification to allow de-identified information to be re-identified, provided that the code or other means of record identification is not derived from or related to information about the individual.

The re-identified data provides information which is used to uniquely identify the person. Eg: name, address, contact numbers etc.

The re-identified PHI is used for the purpose to carry out beneficiary activities (eg: provide relief funds or measures to affected patients) for each patient or class of patients, for which access of identifying information of the patient is needed (eg: name, address, contact numbers etc).

Eg: Govt may want to provide some relief measure for the heart disease patients, if number of heart disease patients is above certain limit in the country. For which the govt first gets the de-identified data, checks whether the amendment needs to be passed or not. If amendment is passed, then it is needed to uniquely identify the persons to whom the relief measures should reach. Hence re-identification is performed to identify the person.

PHI Data Classification

The Data present in PHI can be divided into two types,

1.Structured data

2.Unstructured data

Structured data:

Structured data may be numerical (e.g., blood pressure readings, lab results) or single words or finite word combinations (e.g., name, address). This information can easily be analyzed and decided whether it can be included in de-identified data or not.

Unstructured data:

Data contained in an unstructured/free text format can also add to the research capabilities of EMR data, but unstructured data also has the potential risk of containing personal identifying information.

Unstructured data includes Patient notes, progress notes, transfer notes, patients relatives’ history of disease, history data, notification logs, and user text areas.

For unstructured data lexical look-up tables, regular expressions, and simple heuristics should be used to to locate the sensitive data (18 identifiers mentioned by HIPAA).

Lexical Analysis

What data to be loaded into lexical look up table:

• Known names of patients and hospital staff and other elements specified by HIPAA not to be included in de-identified data.

• Generic female and male first names, last names, last name prefixes, hospital names, locations and states which can be obtained from other external sources like census list etc. [This is ignored for the time being]

• Design some regular expressions to identify URL's, date, person names etc. (Eg: name indicators/titles like “Mr. “, “Dr.”, “Ms.” are found, can identify it as name)

Procedure Lexical analysis:

Pre requirement: Lexical look up table contains the values for all the 18 unique identifiers from the openemr database. And regular expressions to identify URL's, date, person names are designed


1. Input free text (unstructured data) for lexical analysis

2. Perform a word by word check for the input data with the data in lexical look up table. If a match is found replace the particular word with as “xxx” or “---”

3. Return the free text.

Proposed Solution for De-identification

1.Design a De-identification input screen, which enables the user to enter the selection criteria for the request of de-identified data (Ex: request for de-identified data of patients with particular disease).

2.Create a table called metadata for “de-identification”, which gives information about what columns in which table needs to be considered for de_identification, re_identification, whether to load it into lexical look up table or not.

3.Load lexical look up table with the values of 18 unique identifiers from openemr database.

(based on details represent in metadata table, col name:load_to_lexical_look_up_table)

4.Input selection criteria. (from the de-identification input screen)

5.Obtain the patient id's which matches the selection criteria.

(for each patient id, check if unique re_identification code is already generated. if not generate unique random re_identification code and store re_identification code, patient id in re_identification_code_table)

6.Obtain de-identified data for all patient's who matches the selection criteria. (based on details represent in metadata table, col name:include_in_de_identification) and store it in de_identified_data table.

7.Output de_identified_data in text format (export de-identified data)

Proposed Solution for Re-identification

Pre requirement: metadata table for de-identification is available - table which gives information about what columns in which table needs to be considered for de_identification, re_identification, whether to load it into lexical look up table or not.


1.Input re_identification codes list (import in .txt file)

2.Obtain the patient id for each re_identification code. (for re_identification_code_table which contains patient id and re-identification code)

3.Obtain the identifying data for each patient id (based on details represent in metadata table, col name:include_in_re_identification) and store it in re_identified_data table.

4.Output re_identified_data in text format (export re-identified data)

De-identification Input and Output options

De-identification Input screen provide options to the user to enter the selection criteria for the data to be de-identified.

Input screen contains:

1.Time window - restrict PHI which belongs to particular time interval

2.Check boxes to restrict what data to be de-identified. (include history data, prescriptions, immunizations, encounters, issues, transactions, billing data, insurance data)

3.Restrict PHI for particular diagnosis.

4.Restrict PHI for particular drug prescribed.

5.Restrict PHI for particular immunization.

6.Submit button which triggers the de-identification process

(eg: obtain de-identified data for all cancer patients include history data, prescriptions for the time interval of 10/10/2007 - 10/10/2009)


De identification input screen.PNG

Output De-identified data: De-identified data is exported in the .xls format to the end user.

Re-identification Input and Output options

Input for re-identification process is the list of re-identification codes which is imported in the text format.

Output for the re-identification process is the data (like name, address, contact numbers, email id etc.) which is used to identify the patient uniquely along with the re-identification code, which is exported in the .xls format


Completed and checked in to the Sourceforge CVS