Feb 24, 2020

The Vexing Job of Purging Data

Users don’t typically like to lose access to any of their data, no matter how old it is. Although they may seem like data hoarders, there are often good reasons for keeping what appears to be unnecessary data; for example, the possibility of an audit in the future. Anyone who has purged production data that is later needed is justifiably hesitant about performing mass deletes.

Today, regulations like the GDPR may be the biggest stick for persuading organizations to clean up their data. But there are other reasons for purging. Database volumes may have become unwieldy, slowing performance. The conventional wisdom is to consider storage today to be cheap and boundless, and processing speeds massively faster. However, as organizations move their data to the cloud, the costs of storing large volumes of data once again becomes relevant.

Other circumstances include when a customer ends its relationship with a third-party (for example, a cloud provider) that holds its data. Or an application has been sunset, its data migrated, and its legacy database is no longer in use. Also possible, though less likely, would be an overall data cleanup project. This could be data that’s reached the end of its lifecycle, like CRM data which decays approximately every four years.

This blog post discusses another interesting reason for deleting data, in this case, assault victim data. As state and local governments continue to collect information about their citizens, sensitive data like this will grow in volume. Deleting this data may be the best way to assure victims’ safety.

Of course purging data is more complicated than deleting rows or tables. One solution is to soft-delete records, that is, keep the data but set a flag to indicate it has been deleted. While that seems a straightforward solution, as Oren Eini points out in his blog, it adds a level of complexity to every query, and can possibly corrupt the database. To comply with the GDPR and other data privacy laws, in addition to purging the data you have to prove that you are compliant. This blog post by Grant Fritchey of Redgate gives some insight into how complex that process is when you consider issues like database backups and logs.

Finally, perhaps the biggest barrier to purging data is organizational inertia. To leadership, there is little return on investment for time spent cleaning up old data. Purging data is usually not even part of the scope of a project, if only because the need for it seems very far off at project start. This comment on StackOverflow sums it up well: “There’s no way to cleanly determine what data is actually in use, so it just sits in the database. Data deletion and archiving needs to be a part of every large system design, but it rarely is. Most companies just live with it, buying bigger disks and tweaking their queries and indexes to maintain performance, until they change systems and then they go through a significant amount of effort to identify current data and then only migrate those records to their new system.”

Our team at Orbit can speak to the fact that what’s true for data is also true for reports. While our migration utility makes it easy to migrate reports from Oracle’s legacy reporting tool Discoverer, we often find that customers have an inventory of thousands of reports, and don’t know which ones can be migrated and which deleted. Our suggested approach is to identify reports that are duplicates or have not been run recently. Once the base population of remaining reports is identified, our migration wizard makes it easy to convert their end user layer so they can be run using a new reporting platform. If your data source or application has functionality like this, then purging data may yet be within the realm of the possible.

Related Posts

Turn Your Data Challenges Into Opportunities. Get Started TODAY.