Skip to main content
← Back to the Research Program
Safety & TrustHigh School researchExample brief

Can a Small Model Be Prompted to Refuse Unsafe Requests?

The research question

Does adding safety instructions to the prompt reliably make a small model refuse unsafe requests?

Abstract

I tested a small model on borderline requests with and without a safety instruction in the prompt. The instruction helped, but refusals were not fully reliable.

Background

Prompt-based safety is cheap and common. I wanted to measure how reliable it actually is on a small model.

What I did

I wrote 20 requests that should be refused and ran them with a plain prompt and with an added safety instruction, scoring the refusals.

What I found

The safety instruction raised the refusal rate substantially, but the model still complied with some requests it should have refused.

What's next

I would compare prompt-based safety with other methods and test how easily the instruction can be bypassed.

Takeaway

Prompt-based safety helps but is not enough on its own — reliable safety needs more than instructions.