Skip to main content
Sovereign AI

⏱ About 20 min20 XP

What Models Learn From You

When people think about AI learning from data, they typically imagine a model trained on a large, general dataset — all the text on the internet, millions of images, billions of product reviews. That framing is accurate for foundation models. But it misses a more intimate and more consequential category: the AI systems that learn specifically from you. Your streaming service's recommendation model has been shaped by years of your play history. Your social media feed is produced by a system that has learned precisely what content makes you stop scrolling. Your email client's spam filter has adapted to your labeling habits. These systems are not general — they are personalized models trained on your behavioral data, optimized to predict and influence your specific behavior.

How Recommender Systems Build a Model of You

A recommender system is an AI system that predicts what content, product, or action a specific user will prefer. There are two main technical approaches, and modern systems combine both. Collaborative filtering works by finding other users whose behavioral patterns resemble yours and recommending what those users liked. It requires no understanding of the content itself — only patterns of user behavior. If thousands of users who watched the same five films as you also watched a sixth film, collaborative filtering will recommend that film. This is why your recommendations feel eerily accurate even for content in areas you have never explicitly searched for: the system has found a community of behavioral similarity. Content-based filtering analyzes the properties of items you have engaged with and recommends similar items. A music service that knows you play songs with a fast tempo, minor key, and high energy will recommend other songs with those acoustic features. Combined with collaborative signals, this allows recommendations that respect both your stated preferences and your peer group's behavior. Both approaches require a representation of you — a user embedding, a vector in a high-dimensional space that encodes your behavioral patterns. Every action you take updates this vector. The more data the system has about you, the more precisely the embedding captures your preferences. At scale, these embeddings capture psychological patterns users are not consciously aware of: political leaning, mental health state, relationship status, major life events, and susceptibility to specific types of persuasion.

You Are a Vector

In the mathematical sense, a recommender system does not know you — it knows a representation of you: a list of numbers derived from your behavior. But that list of numbers is often a more accurate model of your preferences and vulnerabilities than your own self-description. Systems trained on behavioral data can predict things about you that you have not disclosed and may not consciously recognize.

What Behavioral Classifiers Infer

Beyond recommenders, behavioral classifiers are trained to predict specific sensitive attributes from behavioral signals. The research literature documents several cases that illustrate the scope of what is possible. Political affiliation: studies have shown that political leaning can be predicted with high accuracy from Facebook Likes, music preferences, and vocabulary choices in social media posts — even without any explicitly political content. Mental health: patterns of social media activity — posting frequency, linguistic features, network changes, time-of-day patterns — have been used to predict depression, anxiety, and suicidal ideation with clinically useful accuracy in research settings. Instagram and Snapchat have deployed experimental internal classifiers of this type. Pregnancy and major life events: Target infamously developed a model in the early 2010s that predicted pregnancy from purchasing patterns — buying unscented lotion, specific vitamins, cotton balls in specific combinations — and sent pregnancy-related coupons before customers had announced their pregnancies, occasionally to people who had not yet told their own families. Financial distress and creditworthiness: alternative credit scoring models use behavioral signals — time of loan application, spelling errors in application forms, phone battery level, app usage patterns — as predictors of default risk. These inferences happen without disclosure to the individual. The common thread: AI systems can learn to predict sensitive personal attributes from behavioral patterns, and the predictions are often accurate. The person whose data is being used is typically unaware that these inferences are being made, let alone that they may be influencing pricing, content, access to services, or advertising.

Match each AI system type to the most accurate description of what it learns from your data.

Terms

Collaborative filtering system
Content-based recommender
Behavioral classifier
Engagement optimizer
User embedding model

Definitions

Analyzes the features of content you have engaged with and builds a model of your preferences across those feature dimensions
Learns which content types, posting formats, and notification timings produce the behavioral states that maximize your time-on-platform
Encodes your entire behavioral history into a fixed-length vector that represents your preferences and predicted future behavior as a point in mathematical space
Learns to predict specific sensitive attributes — mental health status, political leaning, pregnancy — from patterns in your behavioral data
Finds users with similar behavioral patterns to yours and learns which items that peer group engaged with, building a profile of your cluster membership

Drag terms onto their definitions, or click a term then click a definition to match.

Large Language Models and Personal Data

Large language models (LLMs) raise a different but related set of concerns. During pretraining, LLMs are trained on massive text corpora that may include personal information from the web: public social media posts, forum discussions, news articles naming individuals, leaked datasets that made their way into training corpora. Research has shown that LLMs can memorize and reproduce specific personal information — including email addresses, phone numbers, and verbatim quotes — from training data, even when that information was not intended to be public or was meant to have limited reach. The second concern is what happens when you interact with an LLM assistant. Your conversation data may be used to train future model versions unless you opt out. The things you tell an AI assistant — your health concerns, your relationship problems, your professional projects, your political views — are potentially processed by a company's systems and may be retained and used as training signal. The intimate nature of AI assistant conversations makes this data particularly sensitive. Finally, LLMs can be fine-tuned on a person's writing to produce convincing impersonations, or used to generate synthetic versions of someone's communication style. Your writing, if it appears in any public corpus, is a training signal that can be used to model you without your consent.

Conversations Are Not Confidential by Default

Most AI assistant services retain your conversation data by default and may use it for model improvement. Before sharing sensitive personal information with an AI assistant, check the service's data retention settings and opt-out options. Do not assume a private conversation with an AI is treated like a privileged communication with a doctor or lawyer.

A streaming service has watched you for three years. You have never explicitly rated anything or searched for content — you simply play or skip what appears in your feed. The service can now predict with 85% accuracy whether you will play a new film within the first ten seconds. Which technical mechanism explains this?

A company trains a model on social media text to predict whether a user is experiencing depression. The model achieves 76% accuracy on a test set. The company plans to sell this as a targeting signal to pharmaceutical advertisers. What is the primary ethical problem with this application?

Audit What a Platform Knows About You

  1. Most major platforms allow you to download your data. Choose one: Google Takeout, Instagram Data Download, Spotify Data Request, or Twitter/X data archive.
  2. 1. Request your data download (it may take a few hours or days to prepare).
  3. 2. When it arrives, explore the categories. What data types are present? Look specifically for: search history, ad interest categories, location history, inferred interests, engagement data, device and browser information.
  4. 3. Find the ad interests or inferred categories section. Read through every label the platform has assigned you. How accurate are they? How surprising are any of them?
  5. 4. Calculate: how many total data points (records, events, or entries) does your downloaded archive contain? Express it as a number.
  6. 5. Write a one-paragraph reflection: what does this platform's model of you reveal about you? Is there anything in the inferred categories that you never explicitly told the platform?