Dream — CCS 2025

🤖 + 🛡️

Combating Concept Drift

CCS 2025 · Android Malware Classification · Interactive Presentation

浙江大学 · 区块链与数据安全国家重点实验室

← → / Space 切页 CN / EN 一键切换双语 ⛶ 全屏展示更适合答辩

Scan to open the live demo

Let everyone open this interactive presentation on their own phone. It will make the figures, mini-game, and language switching much easier to follow.

kndhjk.github.io/dream-ccs25-presentation

👤

Jason Kong

Slides 1–5
Problem, drift types, old pipeline, two problems, transition

👤

Ming Zhao

Slides 6–11
Dream overview, concepts, detection, concept learning, adaptation, intuition

👤

Corin Jackson

Slides 12–16
Setup, results, comparison, one-line takeaway, conclusion

👤 Jason Kong

Part 1 · Opening

What Happens When Malware Keeps Changing?

Why Android malware classifiers go wrong — and why it matters.

Speaker tip: Start by asking the audience: 'Has anyone here used an Android phone? Your phone is a target.' Pause for effect. Then transition: 'Let's see why this matters for security.'

📱

71% Mobile Market

Android holds 71% of global mobile OS market. Millions of users = millions of targets.

🦠

3M+ New Samples / Month

Over 3 million NEW malware samples discovered every month. Attackers never stop.

😰

Classifiers Degrade Fast

Train on last year's data. Get near-random predictions this year. That's concept drift.

Speaker tip: Emphasize: 'The model STILL gives an answer. It still says "this is malware" or "benign". But it's basically guessing. That's the danger.'

👤 Jason Kong

Part 1 · Two Types of Drift

Two Types of Concept Drift

Intra-class Drift 🔄

A known malware family changes — new variants appear.

Example

Xavier malware: 5 different versions in just 8 months. Same family, new tricks.

Mostly affects binary detection: is it malware or not?

Inter-class Drift 🚀

A completely NEW malware family appears. The model has never seen this class.

Example

In 2023 alone, 10 new Android banking malware families were discovered, each with different goals.

Affects multi-class classification: which family is this?

The paper's focus

Dream focuses mainly on INTER-CLASS drift — new malware families appearing. This is the harder problem: the model needs to recognize a family it has NEVER seen before.

Paper figure related to concept drift mitigation pipeline

Paper figure — Real figure from the paper to anchor the drift discussion in a concrete mitigation pipeline.

👤 Jason Kong

Part 1 · Current Approach

How Do People Try to Fix It?

The standard two-step approach used by most research.

1

DRIFT DETECTION — Find weird samples

Periodically scan new incoming samples. Use statistics or ML to find samples that look "far" from the training data.

2

DRIFT ADAPTATION — Update the model

Send weird samples to malware experts. Experts label them. Add to training set. Retrain. Simple and straightforward.

Paper figure — Active learning framework for concept drift mitigation (from the Dream paper on arXiv).

pipeline sketch what traditional systems actually do

new_samples
  → drift_detector()
  → suspicious_subset
  → expert_label()
  → retrain_classifier()

Problem: the detector can be misaligned, and the expert's reasoning disappears after labeling.

Paper figure — Design insights of Dream (from the Dream paper on arXiv).

Speaker tip: Draw on the board: [New Samples] → [Detector] → [Experts Label] → [Retrain]. Ask audience: 'What could go wrong with Step 1?' Then: 'What could go wrong with Step 2?'

👤 Jason Kong

Part 1 · Two Problems

Two Big Problems with This Approach

Problem 1: Blind Detector 🔍

Most drift detectors train their OWN model. They ignore what the TARGET classifier actually uses to decide.

Analogy

Like a doctor diagnosing a patient without knowing what the patient already knows. Useless.

Problem 2: Expert Knowledge Wasted 💔

Experts do deep analysis — static analysis, dynamic analysis, behavioral reasoning. But the model only gets a LABEL. All that WHY is lost.

Result

Need TONS of labeled samples to make retraining work. Very expensive.

Dream's Goal

Address BOTH problems: make detection model-sensitive AND make adaptation use expert knowledge fully — not just a label.

Traditional methods usually do this

Detect with a separate detector → send samples to experts → keep only labels → retrain the classifier. Each step is reasonable, but the whole loop wastes information.

Dream changes the whole loop

Detection is tied to the classifier, and adaptation keeps concept-level expert knowledge. So Dream is not just a better detector — it is a better end-to-end updating framework.

👤 Jason Kong

Part 1 · Transition

Before We Move On, Remember These Two Things

This is the handoff slide from the problem to Dream itself.

Takeaway 1

Concept drift means malware has changed, but the classifier has not kept up. The model still answers, but the answer may no longer be trustworthy.

Takeaway 2

Dream mainly targets inter-class drift: the hard case where a completely new malware family appears and the classifier has never seen it before.

👤 Ming Zhao

Part 2 · Dream Overview

Meet Dream — A Two-Pronged Solution

Dream fixes BOTH problems of the old approach.

🧠

Knows the Classifier

Dream's detector learns what the classifier actually uses. Not something independent. It's model-sensitive.

🔒

No Training Data Needed

Old detectors need training data at test time. Dream doesn't. That means: faster, safer, more private deployment.

💬

Explains the Problem

When drift is found, Dream shows WHICH concept caused it. Experts fix the ROOT CAUSE, not just label samples.

Paper figure — A real overview figure showing how Dream connects detection and adaptation.

🎯

Advantage 1

Classifier-aware detection: Dream checks what the real target model cares about, instead of using a detached anomaly view.

⚡

Advantage 2

No train-data lookup at test time: this makes deployment faster, lighter, and easier in real environments.

🧩

Advantage 3

Concept-guided adaptation: the expert does not only say what the sample is, but also why — so each labeled sample becomes more valuable.

Speaker tip: Think of Dream as a doctor that KNOWS your medical history. It doesn't just guess — it knows what you already know, so it can tell when something genuinely changed.

👤 Ming Zhao

Part 2 · Concepts

What Are 'Concepts' in This Paper?

A concept = a type of malicious BEHAVIOR, not just a family label.

b0: Privacy info stealing (SMS, contacts...)

b1: Abusing SMS / Calls

b2: Remote Control

b3: Bank / Financial Stealing

b4: Ransomware

b5: Abusing Accessibility

b6: Privilege Escalation

b7: Stealthy Download

b8: Aggressive Advertising

b9: Premium Service Abuse

Why This Matters

A malware family can have MULTIPLE behaviors. Two samples in the same family may behave very differently. That's why concepts are more informative than just family labels.

Paper figure: concept-based drift explanation heatmap

Paper figure — A real concept-level explanation view, showing that Dream reasons in behavior space rather than only family labels.

👤 Ming Zhao

Part 2 · Detection Trick

The Core Detection Trick 🪄

How Dream detects drift WITHOUT any training data at test time.

① x

→

② x̂
rebuild

→

③ M(x)

→

④ M(x̂)

→

⑤
Compare!

Paper figure — Model-sensitive concept learning used by Dream (from the Dream paper on arXiv).

pseudo code Dream drift scoring

x_hat = autoencoder(x)
y1 = classifier(x)
y2 = classifier(x_hat)
drift_score = distance(y1, y2)
if drift_score > tau:
    alert_expert(x)

Why it matters: Dream compares the classifier with itself, instead of comparing the sample with the training set.

Predictions AGREE ✓ → Low drift score

Good. The classifier still "gets it" even when looking at a rebuilt version. The concepts used are still reliable.

Predictions DISAGREE ✗ → High drift score

Bad. The rebuilt sample looks different to the classifier. Something genuinely changed in the malware world. → ALERT!

Speaker tip: Draw on board: x → [AutoEncoder] → x̂ → [Classifier M] → Compare M(x) with M(x̂). Ask: what does it mean if they differ? Exactly — the concepts used by the detector don't match what the classifier expects.

👤 Ming Zhao

Part 2 · Mini Game

Mini Game — Drift or Not?

Ask the room to think for 3 seconds, then click to reveal Dream's logic.

Case A: The classifier gives almost the same prediction on x and x̂. What should Dream think?

Case B: x and x̂ look close in latent space, but the classifier changes its mind a lot. What is the strongest signal?

Question to the audience

You can ask: if the classifier gives almost the same prediction on x and x̂, does that suggest low drift risk or high drift risk? Then click to reveal the answer.

👤 Ming Zhao

Part 2 · Concept Learning

How Dream Learns Concepts (Technical)

Two Training Signals

1

Supervised: Align with Classifier

Link latent directions to the classifier's activation patterns. So Dream knows what the classifier FEARS.

2

Contrastive: Same = Close

Samples with same concept cluster together. Different concepts are pushed apart. Clean separation.

The Full Objective

L = λ₀L_rec + λ₁L_sep + λ₂L_pre + λ₃L_rel

Paper figure: Dream objective and model-sensitive concept learning

Paper figure — Real paper figure supporting the objective and concept-learning discussion.

L_rec: reconstruction quality
L_sep: concept separation
L_pre: concept presence (b0–b9)
L_rel: classifier agreement on rebuild

Key: L_rel is what makes Dream MODEL-SENSITIVE — it optimizes for classifier agreement, not an independent distance.

loss design four signals in one objective

L_total = λ0 * L_rec   # rebuild input
        + λ1 * L_sep   # separate concepts
        + λ2 * L_pre   # predict concept presence
        + λ3 * L_rel   # preserve classifier decision

Most important term: L_rel keeps the learned concept space tied to the real classifier.

👤 Ming Zhao

Part 2 · Adaptation

Making Adaptation Work Better

Old Way

Expert studies malware.

Expert says: 'this is family X'.

Model gets only a LABEL. All reasoning is lost.

Cost

Need 80–100 labeled samples to make retraining work. Very expensive.

Dream's Way ✨

Dream shows: WHICH concept drifted?

Expert gives: label PLUS concept revision.

Classifier AND detector both update — joint update.

Savings

Same accuracy with 76.6% fewer labeled samples. Far more efficient.

Paper figure — Human-in-the-loop solutions in Dream (from the Dream paper on arXiv).

update loop how expert feedback is reused

for sample in alerted_samples:
    family_label = expert.label(sample)
    concept_fix  = expert.revise_concepts(sample)
    update_classifier(sample, family_label)
    update_detector(sample, concept_fix)

Difference from old methods: the expert gives both the answer and the explanation.

👤 Ming Zhao

Part 2 · Intuition

What This Slide Really Means

Dream is not trying to output only a black-box anomaly score.

Not Just “Something Is Weird”

A normal detector may only say: this sample looks unusual. That is useful, but still vague. The analyst still has to do most of the reasoning alone.

Black-box problem

The system gives a score, but not much explanation about which behavior changed and why the classifier may fail.

Dream Tries to Speak the Expert's Language

Dream tries to point to the concept level: maybe the problem is remote control behavior, privacy stealing, or stealth download. That makes the conversation between expert and model much more natural.

Simple summary

Dream tries to make the human analyst and the model speak the same behavior language.

Suggested line: “So this page is really building intuition. Dream does not only say that a sample is suspicious. It also tries to say which behavior may be causing the issue. That is why the expert and the system can work together more naturally.”

👤 Corin Jackson

Part 3 · Setup

How We Tested Dream

Datasets

Drebin — 3,317 samples, 8 families, 2010–2012
Malradar — 2,589 samples, 8 families, 2015–2021
Extended — 4,410 samples, 180 families, 2015–2020

Malradar has 10 behavioral concepts (b0–b9) labeled for each sample.

Classifiers

Drebin — binary feature vectors, MLP (100 + 30 hidden)
Mamadroid — API call pairs → Markov + MLP (1000 + 200 hidden)
Damd — raw opcode sequences, CNN (2 conv layers, 64 filters)

All are real, widely-used Android malware classifiers.

Paper figure — Public figure from the paper that helps anchor the experiment setting in the full drift-mitigation pipeline.

Why three datasets matterThey cover different years, family scales, and feature representations. So Dream is not tested in only one narrow setting.
Why three classifiers matterDream is evaluated on feature-based, API-based, and sequence-based models. That makes the evidence stronger for deployment.
Why hold-out by family mattersThis setting directly simulates the real pain point: a new family arrives, and the old classifier must react.

Public paper figure for Dream human-in-the-loop setting

Paper figure — A real public figure that keeps the setup section visually grounded while discussing analyst feedback and dataset usage.

Testing Method: Hold-out by Family

For each family: remove it from training → use ONLY for testing. This simulates a BRAND NEW family appearing. 8 classifiers per dataset = 8 test scenarios each.

👤 Corin Jackson

Part 3 · Key Numbers

The Numbers That Matter Most

76.6%

Less Labeling

To reach 90% accuracy: Dream = 19 samples. Best old method = 84. Same result.

11–14%

AUC Boost

vs Transcendent: +11.5%, vs CADE: +12.0%, vs Probability: +13.6%. Consistent.

0.57ms

Per Sample

Detection speed. 3x faster than CADE (1.89ms). 10x faster than Transcendent (5.75ms). Real-time ready.

+18.6%

Intra-class AUC

Also beats HCC on intra-class drift (new variants within same family). Dream works for both types.

Paper curve — Dream ROC on Drebin.

Paper curve — Dream ROC on Mamadroid.

Paper curve — Dream ROC on Damd.

Paper figures — Actual ROC curves from the public arXiv version, showing Dream across three classifier settings.

Teacher-friendly readingThis is not just 'slightly better'. Dream changes both cost and accuracy at the same time, which is much harder to achieve.
Why 76.6% mattersIn practice, expert labeling is the expensive part. Reducing that cost is often more valuable than adding one more point of accuracy.
Why 0.57ms mattersFast online detection means Dream can be inserted into a real pipeline without becoming the new bottleneck.

Paper curve — One concrete ROC result you can point to while explaining detection quality.

Paper curve — Another real result curve that helps show Dream is not winning in only one classifier setting.

result snapshot numbers you can point to while speaking

Dream = {
  "labels_for_90pct_acc": 19,
  "best_baseline": 84,
  "auc_gain": "+11% ~ +14%",
  "latency_ms": 0.57
}

Presentation tip: this code-style box makes the result slide feel more alive, even before you upload the real screenshots.

👤 Corin Jackson

Part 3 · Comparison

Dream vs Existing Methods

Property	Transcendent	CADE	HCC	Dream ✓
Model-Sensitive Detection	✗	✗	✗	✓
Data Autonomy (no train data at test)	✗	✗	✗	✓
Explanatory Adaptation	Partial	✗	✗	✓
Works for Both Drift Types	✗	Partial	Partial	✓
Fast Detection	5.75ms	1.89ms	—	0.57ms

Why Dream is stronger than traditional methods

Traditional methods usually optimize one stage at a time: either better anomaly detection or better retraining. Dream improves the connection between stages, so the whole workflow becomes more efficient.

Practical takeaway

Dream selects better samples, needs fewer labels, gives more explanation, and runs faster online. That combination is the real advantage — not just one higher metric.

Paper curve — Real ROC evidence for the third classifier family, backing up the comparison slide.

Paper figure — Concept-based drift explanation heatmap on Drebin (from the Dream paper on arXiv).

What traditional methods miss

They often rank samples by generic anomaly, not classifier impact.
They usually throw away concept-level expert reasoning after labeling.
They may work locally, but the whole update loop stays inefficient.

What Dream adds

Classifier-aware sample selection.
Concept-level explanatory adaptation.
A tighter connection between detection, explanation, and retraining.

👤 Corin Jackson

Part 3 · One-line Summary

Dream Is Better at Choosing Samples — and Better at Using Samples

Chooses better

Dream's detection is closer to what the classifier actually cares about. So the selected drift samples are more meaningful and more useful for updating the model.

Uses better

Dream does not use only labels during adaptation. It also uses concept-level expert feedback, so each labeled sample carries much richer information.

👤 Corin Jackson

Part 3 · Summary

Three Things to Remember

Paper figure — Real paper diagram summarizing how Dream connects detection, concepts, and adaptation.

🔍

Smart Detection

Classifier + autoencoder in one system. If the classifier changes its mind on a rebuilt sample — that's drift. No training data needed.

💬

Experts Do More

Not just label — explain WHY. Concept-level feedback goes into the model. Every sample is far more powerful.

🚀

Big Real Savings

76.6% less labeling work. 3x faster. Works for new families AND new variants. All in one system.

If the teacher asks 'Why should I care?'Because Dream improves not just a score, but the whole maintenance loop of a malware classifier.
If the teacher asks 'What is the core novelty?'The novelty is the bridge: concept-aware detection tied to the classifier, plus concept-level feedback reused during adaptation.
If the teacher asks 'What is the real-world value?'It lowers human labeling cost while keeping the detection loop fast enough for deployment.

🎤

Questions & Discussion

What would you like to ask about Dream?

🔍

Detection

How does the concept reliability loss actually work in practice?

💬

Adaptation

How do experts actually provide concept revisions? Is there a UI?

🚀

Real World

Could Dream be used for other domains beyond malware?