pull down to refresh

@optimism posted a nice primer on k-anonymity today, with the basic idea that you can take a few pieces of information about a person, each of which doesn't reveal their identity, and then combine them to reveal the identity of the person.

I had a funny experience with this that highlights how lack of understanding led to a potential privacy violation at a credit reporting bureau, though it was never reported as such because we never raised it as an issue. (Spoiler alert: I was directly involved).

You see, I once worked on a research project where we had names and addresses sourced from public housing transaction records. Other pieces of information were available too, like number of bedrooms, bathrooms, mortgage loan amount, etc.

We wanted to attach individuals' credit scores to the data, for research purposes. We therefore contacted one of the major credit bureaus to see if this was possible. They were happy to sell us the data.

Here's how it would work: we'd send them the main data file, they'd use names and addresses to merge on the credit scores, then they'd return us the data with the names and addresses anonymized.

However, my research partner and I are savvy. We know about k-anonymity. We knew that if they sent us back the main data file with only names and addresses masked, we'd still be able to use all the other bits of information, like #bedrooms, #bathrooms, etc, to re-identify each individual and recover their actual names and addresses, now along with their credit scores.

We tried to convey our concerns to the salesperson. We were worried that they were savvy too, and that they'd slightly perturb the other variables so that we wouldn't be able to do this re-identifying procedure. (It's not that we wanted to re-identify the people, but we were worried that our data would be perturbed).

However, it turns out that they weren't savvy. The salesperson couldn't really understand what we were asking. We decided it wasn't worth continuing to press the point, and that we should just send them the data and assume they wouldn't perturb the other variable.

We were right. When they sent us back the data, they hadn't touched any variable except name and address. Thus, if we wanted to, we could have de-anonymized the whole data and gotten access to a bunch of people's credit scores, directly sent to us from the credit reporting agency.

Anyway, just a fun story of how lacking data security protocols are, even at top credit bureaus in the country, thanks to a general lack of understanding of privacy and math.

@delete in 48 hours

I think that for the past 50-60 years, we've seen the results of tech outpacing legal/political understanding and the protections proposed. Things like the EU cookie fiasco that makes every website visit a pain in the behind while doing nothing to improve (and worse, stagnating) solutions for real cookie management. And that is just a visible example.

What we're left with is archaic solutions from the typewriter & phone era that happen to be the SOP you are to follow.

I think that the truth is more pale than missing understanding: even if your counterparty was understanding the privacy issue, doing something about it can land them in a world of trouble. Procedure, procedure. And you better follow these.

Many of those procedures are ossified. They're not ready for 2005 tech, in 2026. This is going to hurt even more now that anyone can correlate data with maybe 3 attempts at instructing GPT to write them an algo they could never think up themselves, with 90% accuracy on the result. After half a day of prompting, at most.

90% is no good on the defensive side. But its good enough on the offensive side. Infinitely better than having nothing. And this change in asymmetry, that isn't even dependent on AI, is not accounted for. Big data isn't accounted for. Heck, "small" 100GB databases with PL/SQL on Oracle 5 isn't truly accounted for. Check out (state) procedures around SSN handling - to name something common and vulnerable - and be afraid, be very afraid.

Sometimes I wonder: is there hope? I can defend myself, though only to a point. I have yet to succeed in defending someone else.

reply
196 sats \ 1 reply \ @Scoresby 25 May
is there hope?

I feel this all the time. There is so much data over which I have so little control (SSN, birthdate, address, phone #) -- anything I do with me kids requires massive data hemorrhages. And I don't see how I can avoid it.

Here's an example: signing my kid up for a sports competition and everything on the form is required. Do they actually need my address? No, but I can't sign him up without inputting it. Try calling: nobody knows how to change it because the form is a third party thing. Great. Also: they want a picture of him for his "profile" -- I used a cartoon image I generated. Of course then at the competition his is the only picture that isn't a real life photo. It's all small stuff, but it all adds up to total exposure.

I don't know how to live life without being laid bare like this. And I get the sense that it hasn't even begun to be used against us yet.

reply
106 sats \ 0 replies \ @optimism 15h

Kids is the real problem. Not because its hard, but because they're the most targeted group that, unlike you or I, will be tracked cradle to grave. Anyone born before the turn of the millennium at least got some years without tracking.

reply
106 sats \ 1 reply \ @unboiled 25 May
I can defend myself

can_defend: true
Talk about an identifying subset. Irony hurts sometimes.

reply

There was a second part to that sentence.

can_defend: null

reply
4 sats \ 4 replies \ @ek 21h

Appreciate posting this in ~security. Was ~privacy too expensive?

reply

yes that was the only reason

reply
4 sats \ 2 replies \ @ek 15h

I think @davidw knew that ~privacy would be very popular on SN, so he bought it outright for 3m sats and set high fees, hoping for passive income. Smart move, but not very social.

Maybe we should fight back by gathering a group of stackers to have ~cheap_privacy, haha

reply
4 sats \ 1 reply \ @optimism 15h

Speculative... Per Simple's own research, higher fee -> higher signal

reply
73 sats \ 0 replies \ @ek 15h

reply

I wouldn’t attribute it to a lack of understanding, but more to indifference , or simply not being high on their priority list.

Here’s how I see them approaching it:

  1. Their goal is to follow privacy laws, not to go above and beyond for user privacy.
  2. They only anonymize name and address for two practical reasons: to avoid legal exposure, and to keep you coming back for more data enrichment (recurring revenue matters).
  3. Often the analysts getting the enriched data aren’t even supposed to do the kind of re-identification you described, it’s more about controlling who can access PII under compliance rules

That said, I’m sure the analysts and management at the bureaus understand the math. Some salespeople probably do too, but they might be trained to sidestep those questions.

This comes from my experience partnering with credit bureaus and data companies from inside big tech on multiple projects. Really interesting story though. thanks for sharing it.

reply