P1 opens the problem → P2 explains Dream → P3 proves why it matters
Recommended handoff: “Now that we know the two problems, Ming Zhao will explain how Dream solves them.” / “Now that the method is clear, Corin Jackson will show whether it actually works.”
👤 Jason Kong
Part 1 · Opening
What Happens When Malware Keeps Changing?
Why Android malware classifiers go wrong — and why it matters.
Speaker tip:Start by asking the audience: 'Has anyone here used an Android phone? Your phone is a target.' Pause for effect. Then transition: 'Let's see why this matters for security.'
📱
71% Mobile Market
Android holds 71% of global mobile OS market. Millions of users = millions of targets.
🦠
3M+ New Samples / Month
Over 3 million NEW malware samples discovered every month. Attackers never stop.
😰
Classifiers Degrade Fast
Train on last year's data. Get near-random predictions this year. That's concept drift.
Key Fact
Without updates, Android malware classifiers can degrade to RANDOM performance within ~2 years
This is called CONCEPT DRIFT — the world changed, the model did not
Speaker tip:Emphasize: 'The model STILL gives an answer. It still says "this is malware" or "benign". But it's basically guessing. That's the danger.'
👤 Jason Kong
Part 1 · Two Types of Drift
Two Types of Concept Drift
Intra-class Drift 🔄
A known malware family changes — new variants appear.
Example
Xavier malware: 5 different versions in just 8 months. Same family, new tricks.
Mostly affects binary detection: is it malware or not?
Inter-class Drift 🚀
A completely NEW malware family appears. The model has never seen this class.
Example
In 2023 alone, 10 new Android banking malware families were discovered, each with different goals.
Affects multi-class classification: which family is this?
The paper's focus
Dream focuses mainly on INTER-CLASS drift — new malware families appearing. This is the harder problem: the model needs to recognize a family it has NEVER seen before.
Paper figure — Real figure from the paper to anchor the drift discussion in a concrete mitigation pipeline.
👤 Jason Kong
Part 1 · Current Approach
How Do People Try to Fix It?
The standard two-step approach used by most research.
1
DRIFT DETECTION — Find weird samples
Periodically scan new incoming samples. Use statistics or ML to find samples that look "far" from the training data.
2
DRIFT ADAPTATION — Update the model
Send weird samples to malware experts. Experts label them. Add to training set. Retrain. Simple and straightforward.
Paper figure — Active learning framework for concept drift mitigation (from the Dream paper on arXiv).
pipeline sketchwhat traditional systems actually do
Problem: the detector can be misaligned, and the expert's reasoning disappears after labeling.
Paper figure — Design insights of Dream (from the Dream paper on arXiv).
Speaker tip:Draw on the board: [New Samples] → [Detector] → [Experts Label] → [Retrain]. Ask audience: 'What could go wrong with Step 1?' Then: 'What could go wrong with Step 2?'
👤 Jason Kong
Part 1 · Two Problems
Two Big Problems with This Approach
Problem 1: Blind Detector 🔍
Most drift detectors train their OWN model. They ignore what the TARGET classifier actually uses to decide.
Analogy
Like a doctor diagnosing a patient without knowing what the patient already knows. Useless.
Problem 2: Expert Knowledge Wasted 💔
Experts do deep analysis — static analysis, dynamic analysis, behavioral reasoning. But the model only gets a LABEL. All that WHY is lost.
Result
Need TONS of labeled samples to make retraining work. Very expensive.
Dream's Goal
Address BOTH problems: make detection model-sensitive AND make adaptation use expert knowledge fully — not just a label.
Traditional methods usually do this
Detect with a separate detector → send samples to experts → keep only labels → retrain the classifier. Each step is reasonable, but the whole loop wastes information.
Dream changes the whole loop
Detection is tied to the classifier, and adaptation keeps concept-level expert knowledge. So Dream is not just a better detector — it is a better end-to-end updating framework.
👤 Jason Kong
Part 1 · Transition
Before We Move On, Remember These Two Things
This is the handoff slide from the problem to Dream itself.
Takeaway 1
Concept drift means malware has changed, but the classifier has not kept up. The model still answers, but the answer may no longer be trustworthy.
Takeaway 2
Dream mainly targets inter-class drift: the hard case where a completely new malware family appears and the classifier has never seen it before.
Handoff
Now that the problem is clear, Ming Zhao will explain Dream itself.
Suggested line: “So the next question is: how can we detect drift in a way that really matches the classifier, and then adapt using richer expert knowledge? That is where Dream comes in.”
👤 Ming Zhao
Part 2 · Dream Overview
Meet Dream — A Two-Pronged Solution
Dream fixes BOTH problems of the old approach.
🧠
Knows the Classifier
Dream's detector learns what the classifier actually uses. Not something independent. It's model-sensitive.
🔒
No Training Data Needed
Old detectors need training data at test time. Dream doesn't. That means: faster, safer, more private deployment.
💬
Explains the Problem
When drift is found, Dream shows WHICH concept caused it. Experts fix the ROOT CAUSE, not just label samples.
Paper figure — A real overview figure showing how Dream connects detection and adaptation.
🎯
Advantage 1
Classifier-aware detection: Dream checks what the real target model cares about, instead of using a detached anomaly view.
⚡
Advantage 2
No train-data lookup at test time: this makes deployment faster, lighter, and easier in real environments.
🧩
Advantage 3
Concept-guided adaptation: the expert does not only say what the sample is, but also why — so each labeled sample becomes more valuable.
Speaker tip:Think of Dream as a doctor that KNOWS your medical history. It doesn't just guess — it knows what you already know, so it can tell when something genuinely changed.
👤 Ming Zhao
Part 2 · Concepts
What Are 'Concepts' in This Paper?
A concept = a type of malicious BEHAVIOR, not just a family label.
b0: Privacy info stealing (SMS, contacts...)
b1: Abusing SMS / Calls
b2: Remote Control
b3: Bank / Financial Stealing
b4: Ransomware
b5: Abusing Accessibility
b6: Privilege Escalation
b7: Stealthy Download
b8: Aggressive Advertising
b9: Premium Service Abuse
Why This Matters
A malware family can have MULTIPLE behaviors. Two samples in the same family may behave very differently. That's why concepts are more informative than just family labels.
Paper figure — A real concept-level explanation view, showing that Dream reasons in behavior space rather than only family labels.
👤 Ming Zhao
Part 2 · Detection Trick
The Core Detection Trick 🪄
How Dream detects drift WITHOUT any training data at test time.
① x
→
② x̂ rebuild
→
③ M(x)
→
④ M(x̂)
→
⑤ Compare!
Paper figure — Model-sensitive concept learning used by Dream (from the Dream paper on arXiv).
Why it matters: Dream compares the classifier with itself, instead of comparing the sample with the training set.
Predictions AGREE ✓ → Low drift score
Good. The classifier still "gets it" even when looking at a rebuilt version. The concepts used are still reliable.
Predictions DISAGREE ✗ → High drift score
Bad. The rebuilt sample looks different to the classifier. Something genuinely changed in the malware world. → ALERT!
Speaker tip:Draw on board: x → [AutoEncoder] → x̂ → [Classifier M] → Compare M(x) with M(x̂). Ask: what does it mean if they differ? Exactly — the concepts used by the detector don't match what the classifier expects.
👤 Ming Zhao
Part 2 · Mini Game
Mini Game — Drift or Not?
Ask the room to think for 3 seconds, then click to reveal Dream's logic.
Case A: The classifier gives almost the same prediction on x and x̂. What should Dream think?
Case B: x and x̂ look close in latent space, but the classifier changes its mind a lot. What is the strongest signal?
Question to the audience
You can ask: if the classifier gives almost the same prediction on x and x̂, does that suggest low drift risk or high drift risk? Then click to reveal the answer.
👤 Ming Zhao
Part 2 · Concept Learning
How Dream Learns Concepts (Technical)
Two Training Signals
1
Supervised: Align with Classifier
Link latent directions to the classifier's activation patterns. So Dream knows what the classifier FEARS.
2
Contrastive: Same = Close
Samples with same concept cluster together. Different concepts are pushed apart. Clean separation.
The Full Objective
L = λ₀L_rec + λ₁L_sep + λ₂L_pre + λ₃L_rel
Paper figure — Real paper figure supporting the objective and concept-learning discussion.
L_rec: reconstruction quality
L_sep: concept separation
L_pre: concept presence (b0–b9)
L_rel: classifier agreement on rebuild
Key: L_rel is what makes Dream MODEL-SENSITIVE — it optimizes for classifier agreement, not an independent distance.
Most important term: L_rel keeps the learned concept space tied to the real classifier.
👤 Ming Zhao
Part 2 · Adaptation
Making Adaptation Work Better
Old Way
Expert studies malware.
Expert says: 'this is family X'.
Model gets only a LABEL. All reasoning is lost.
Cost
Need 80–100 labeled samples to make retraining work. Very expensive.
Dream's Way ✨
Dream shows: WHICH concept drifted?
Expert gives: label PLUS concept revision.
Classifier AND detector both update — joint update.
Savings
Same accuracy with 76.6% fewer labeled samples. Far more efficient.
Paper figure — Human-in-the-loop solutions in Dream (from the Dream paper on arXiv).
update loophow expert feedback is reused
for sample in alerted_samples:
family_label = expert.label(sample)
concept_fix = expert.revise_concepts(sample)
update_classifier(sample, family_label)
update_detector(sample, concept_fix)
Difference from old methods: the expert gives both the answer and the explanation.
👤 Ming Zhao
Part 2 · Intuition
What This Slide Really Means
Dream is not trying to output only a black-box anomaly score.
Not Just “Something Is Weird”
A normal detector may only say: this sample looks unusual. That is useful, but still vague. The analyst still has to do most of the reasoning alone.
Black-box problem
The system gives a score, but not much explanation about which behavior changed and why the classifier may fail.
Dream Tries to Speak the Expert's Language
Dream tries to point to the concept level: maybe the problem is remote control behavior, privacy stealing, or stealth download. That makes the conversation between expert and model much more natural.
Simple summary
Dream tries to make the human analyst and the model speak the same behavior language.
Suggested line:“So this page is really building intuition. Dream does not only say that a sample is suspicious. It also tries to say which behavior may be causing the issue. That is why the expert and the system can work together more naturally.”
👤 Corin Jackson
Part 3 · Setup
How We Tested Dream
Datasets
Drebin — 3,317 samples, 8 families, 2010–2012
Malradar — 2,589 samples, 8 families, 2015–2021
Extended — 4,410 samples, 180 families, 2015–2020
Malradar has 10 behavioral concepts (b0–b9) labeled for each sample.
All are real, widely-used Android malware classifiers.
Paper figure — Public figure from the paper that helps anchor the experiment setting in the full drift-mitigation pipeline.
Why three datasets matter
They cover different years, family scales, and feature representations. So Dream is not tested in only one narrow setting.
Why three classifiers matter
Dream is evaluated on feature-based, API-based, and sequence-based models. That makes the evidence stronger for deployment.
Why hold-out by family matters
This setting directly simulates the real pain point: a new family arrives, and the old classifier must react.
Paper figure — A real public figure that keeps the setup section visually grounded while discussing analyst feedback and dataset usage.
Testing Method: Hold-out by Family
For each family: remove it from training → use ONLY for testing. This simulates a BRAND NEW family appearing. 8 classifiers per dataset = 8 test scenarios each.
👤 Corin Jackson
Part 3 · Key Numbers
The Numbers That Matter Most
76.6%
Less Labeling
To reach 90% accuracy: Dream = 19 samples. Best old method = 84. Same result.
11–14%
AUC Boost
vs Transcendent: +11.5%, vs CADE: +12.0%, vs Probability: +13.6%. Consistent.
0.57ms
Per Sample
Detection speed. 3x faster than CADE (1.89ms). 10x faster than Transcendent (5.75ms). Real-time ready.
+18.6%
Intra-class AUC
Also beats HCC on intra-class drift (new variants within same family). Dream works for both types.
Paper curve — Dream ROC on Drebin.
Paper curve — Dream ROC on Mamadroid.
Paper curve — Dream ROC on Damd.
Paper figures — Actual ROC curves from the public arXiv version, showing Dream across three classifier settings.
Teacher-friendly reading
This is not just 'slightly better'. Dream changes both cost and accuracy at the same time, which is much harder to achieve.
Why 76.6% matters
In practice, expert labeling is the expensive part. Reducing that cost is often more valuable than adding one more point of accuracy.
Why 0.57ms matters
Fast online detection means Dream can be inserted into a real pipeline without becoming the new bottleneck.
Paper curve — One concrete ROC result you can point to while explaining detection quality.
Paper curve — Another real result curve that helps show Dream is not winning in only one classifier setting.
result snapshotnumbers you can point to while speaking
Presentation tip: this code-style box makes the result slide feel more alive, even before you upload the real screenshots.
👤 Corin Jackson
Part 3 · Comparison
Dream vs Existing Methods
Property
Transcendent
CADE
HCC
Dream ✓
Model-Sensitive Detection
✗
✗
✗
✓
Data Autonomy (no train data at test)
✗
✗
✗
✓
Explanatory Adaptation
Partial
✗
✗
✓
Works for Both Drift Types
✗
Partial
Partial
✓
Fast Detection
5.75ms
1.89ms
—
0.57ms
Why Dream is stronger than traditional methods
Traditional methods usually optimize one stage at a time: either better anomaly detection or better retraining. Dream improves the connection between stages, so the whole workflow becomes more efficient.
Practical takeaway
Dream selects better samples, needs fewer labels, gives more explanation, and runs faster online. That combination is the real advantage — not just one higher metric.
Paper curve — Real ROC evidence for the third classifier family, backing up the comparison slide.
Paper figure — Concept-based drift explanation heatmap on Drebin (from the Dream paper on arXiv).
What traditional methods miss
They often rank samples by generic anomaly, not classifier impact.
They usually throw away concept-level expert reasoning after labeling.
They may work locally, but the whole update loop stays inefficient.
What Dream adds
Classifier-aware sample selection.
Concept-level explanatory adaptation.
A tighter connection between detection, explanation, and retraining.
👤 Corin Jackson
Part 3 · One-line Summary
Dream Is Better at Choosing Samples — and Better at Using Samples
Chooses better
Dream's detection is closer to what the classifier actually cares about. So the selected drift samples are more meaningful and more useful for updating the model.
Uses better
Dream does not use only labels during adaptation. It also uses concept-level expert feedback, so each labeled sample carries much richer information.
Simple takeaway
Better detection because it is classifier-aware. Better adaptation because it uses behavior information, not just labels.
So compared with traditional methods, Dream is stronger in four ways: sample selection, explanation, labeling efficiency, and online speed.
👤 Corin Jackson
Part 3 · Summary
Three Things to Remember
Paper figure — Real paper diagram summarizing how Dream connects detection, concepts, and adaptation.
🔍
Smart Detection
Classifier + autoencoder in one system. If the classifier changes its mind on a rebuilt sample — that's drift. No training data needed.
💬
Experts Do More
Not just label — explain WHY. Concept-level feedback goes into the model. Every sample is far more powerful.
🚀
Big Real Savings
76.6% less labeling work. 3x faster. Works for new families AND new variants. All in one system.
If the teacher asks 'Why should I care?'
Because Dream improves not just a score, but the whole maintenance loop of a malware classifier.
If the teacher asks 'What is the core novelty?'
The novelty is the bridge: concept-aware detection tied to the classifier, plus concept-level feedback reused during adaptation.
If the teacher asks 'What is the real-world value?'
It lowers human labeling cost while keeping the detection loop fast enough for deployment.
Paper
CCS '25 · DOI: 10.1145/3719027.3744792
Yiling He, Junchi Lei, Zhan Qin, Kui Ren, Chun Chen
Zhejiang University · State Key Laboratory of Blockchain and Data Security
🎤
Questions & Discussion
What would you like to ask about Dream?
🔍
Detection
How does the concept reliability loss actually work in practice?
💬
Adaptation
How do experts actually provide concept revisions? Is there a UI?
🚀
Real World
Could Dream be used for other domains beyond malware?