How were the records de-identified?
The records were de-identified semi-automatically. An automatic and two manual passes were made over each record. The manual passes followed the automatic but were in parallel with each other. A third manual annotator resolved the disagreements that resulted from the two manual passes and finalized the identification of private health information. We then replaced the identified private health information in several ways. For names of doctors and patients, we drew random names from the US Census Bureau names dictionary. Therefore, the surrogate names in the records will look like real names but they do not belong to the actual patients. We made no effort to keep co-reference, i.e., for each occurrence of Dr. “John Smith”, we drew another name from the US Census bureau dictionary. For phone numbers, ID numbers, and ages, we randomly generated surrogates by replacing each digit with a random digit and each letter with a random letter. For dates, we generated random dates as surrogates. F