4/23/2026 at 2:19:23 PM
S3 lifecycle policies and scheduled RDBMS jobs are the low hanging fruit here.I used to work at a data platform team and built a cleaning service that used tags and object hierarchy trees to find and clean old PII data. Not an easy thing to do as our data analytics bucket had over 7PiB of data.
Overall the architecture was based of 3 components: detector, enforcer, cleaner. Detector sifted through the datalake to find PII datasets(llm based), enforcer tracked down ETL of the datasets in our VCS to set appropriate tags/metada(custom coding agent), finally cleaner used search to find and clean the data based on the metadata.
by sinansaka