What is a robots.txt file and why should you care?

While details have not been made public, it is possible that a properly configured robots.txt file could have prevented the patient data exposure which was the genesis of the recent $4.8 million HIPAA settlement made by New York-Presbyterian Hospital/Columbia University Medical Center (see: http://cynergistek.com/columbia-presbyterian-settle-hipaa-violations-for-4-8-million/).

From the information which was made public, it appears that a personal computer server operated by a CU physician was left accessible through an open internet connection. Web-crawlers copied PHI from the site and made it available to an internet search engine. PHI regarding a deceased patient was discovered by their next-of-kin when performing an internet search. This lead to the investigation by the OCR and the eventual $4.8 million settlement and corresponding Resolution Agreement and Corrective Action Plan.

Web-crawlers (also called web spiders or web robots) continuously browse the internet for the purpose of creating indexes for internet search engines such as Google and Bing. At times, these web crawlers can access an internet site even if the site requires human users to authenticate using a user name and password.

The Robot Exclusion Standard is a de facto standard which advises web crawlers whether they can access all or part of a website. These instructions are included in a text file called robots.txt which is placed in the root of the website hierarchy. The instructions can request that web crawlers ignore specified directories or files when indexing a site. While the robots.txt file has been widely recognized as a standard for controlling web crawlers since 1994, compliance is voluntary on the part of the web crawlers and thus is not a guarantee that the instructions will be followed. Absence of a robots.txt implies that web crawlers are allowed to access and index the entire site including all files and directories.

In summary, while having a properly configured robots.txt file which tells web crawlers that they should not access portions of your site containing PHI is a best practice, it should not be relied upon as the only safeguard to protect the PHI from web crawlers. However, not having a robots.txt file or implementing other reasonable and appropriate safeguards is an open invitation to web crawlers and can tangle you in a web of trouble.

James E. O’Connor