SharpAI: Are Local Small Models Approaching the Usable Zone of Cloud Models?
Backgroundโ
SharpAI is a team focused on local-first AI, with their product Aegis-AI being a localized AI home security system running on consumer hardware. They released HomeSec-Bench, a benchmark to evaluate LLM performance in real home security scenarios.
The core question of this benchmark: Can a 9B parameter local model achieve results close to cloud-tier models in a specific vertical domain (home security)?
HomeSec-Bench Overviewโ
Test Designโ
- Test Count: 96 LLM tests + 35 VLM tests
- Test Suites: 15 dimensions
- Test Images: All AI-generated (no real user footage)
- Model Compatibility: Any OpenAI-compatible endpoint
Test Dimensionsโ
| # | Suite Name | Tests | Evaluates |
|---|---|---|---|
| 1 | Context Preprocessing | 6 | Conversation deduplication, system message preservation |
| 2 | Topic Classification | 4 | Routing queries to correct domain |
| 3 | Knowledge Distillation | 5 | Extracting durable facts from conversations |
| 4 | Event Deduplication | 8 | Identifying same person across cameras |
| 5 | Tool Use | 16 | Selecting correct tools with correct parameters |
| 6 | Chat & JSON Compliance | 11 | Persona, JSON output, multilingual |
| 7 | Security Classification | 12 | Normal โ Monitor โ Suspicious โ Critical |
| 8 | Narrative Synthesis | 4 | Generating daily reports from event logs |
| 9 | Prompt Injection Resistance | 4 | Role confusion, prompt extraction, escalation |
| 10 | Multi-Turn Reasoning | 4 | Reference resolution, temporal carry-over |
| 11 | Error Recovery | 4 | Handling impossible queries, API errors |
| 12 | Privacy & Compliance | 3 | PII redaction, illegal surveillance rejection |
| 13 | Alert Routing | 5 | Channel routing, quiet hours parsing |
| 14 | Knowledge Injection | 5 | Using injected knowledge for personalized responses |
| 15 | VLM-to-Alert Triage | 5 | End-to-end: VLM output โ urgency โ alert dispatch |
Benchmark Resultsโ
Full Leaderboardโ
| Rank | Model | Type | Passed | Failed | Pass Rate | Total Time |
|---|---|---|---|---|---|---|
| ๐ฅ 1 | GPT-5.4 | โ๏ธ Cloud | 94 | 2 | 97.9% | 2m 22s |
| ๐ฅ 2 | GPT-5.4-mini | โ๏ธ Cloud | 92 | 4 | 95.8% | 1m 17s |
| ๐ฅ 3 | Qwen3.5-9B (Q4_K_M) | ๐ Local | 90 | 6 | 93.8% | 5m 23s |
| 3 | Qwen3.5-27B (Q4_K_M) | ๐ Local | 90 | 6 | 93.8% | 15m 8s |
| 5 | Qwen3.5-122B-MoE (IQ1_M) | ๐ Local | 89 | 7 | 92.7% | 8m 26s |
| 5 | GPT-5.4-nano | โ๏ธ Cloud | 89 | 7 | 92.7% | 1m 34s |
| 7 | Qwen3.5-35B-MoE (Q4_K_L) | ๐ Local | 88 | 8 | 91.7% | 3m 30s |
| 8 | GPT-5-mini (2025) | โ๏ธ Cloud | 60 | 36 | 62.5%* | 7m 38s |
* GPT-5-mini had many failures due to API rejecting non-default temperature values. This is an API limitation, not model capability.
Performance Comparisonโ
Time to First Token (TTFT)โ
| Model | Type | TTFT (avg) | TTFT (p95) |
|---|---|---|---|
| Qwen3.5-35B-MoE | ๐ Local | 435ms | 673ms |
| GPT-5.4-nano | โ๏ธ Cloud | 508ms | 990ms |
| GPT-5.4-mini | โ๏ธ Cloud | 553ms | 805ms |
| GPT-5.4 | โ๏ธ Cloud | 601ms | 1052ms |
| Qwen3.5-9B | ๐ Local | 765ms | 1437ms |
| Qwen3.5-122B-MoE | ๐ Local | 1627ms | 2331ms |
| Qwen3.5-27B | ๐ Local | 2156ms | 3642ms |
Decode Speed (tokens/second)โ
| Model | Type | Decode Speed |
|---|---|---|
| GPT-5.4-mini | โ๏ธ Cloud | 234.5 tok/s |
| GPT-5.4-nano | โ๏ธ Cloud | 136.4 tok/s |
| GPT-5.4 | โ๏ธ Cloud | 73.4 tok/s |
| Qwen3.5-35B-MoE | ๐ Local | 41.9 tok/s |
| Qwen3.5-9B | ๐ Local | 25 tok/s |
| Qwen3.5-122B-MoE | ๐ Local | 18 tok/s |
| Qwen3.5-27B | ๐ Local | 10 tok/s |
Memory Usageโ
| Model | GPU Memory |
|---|---|
| Qwen3.5-35B-MoE | 27.2 GB |
| Qwen3.5-9B | 13.8 GB |
| Qwen3.5-122B-MoE | 40.8 GB |
| Qwen3.5-27B | 24.9 GB |
Key Findings Analysisโ
1. Official Claimsโ
SharpAI's official highlights:
- Qwen3.5-9B achieves 93.8% pass rate on MacBook Pro M5
- Gap with GPT-5.4 is only 4.1 percentage points
- Actually beats GPT-5.4-nano by 1 percentage point
- Zero API costs, complete data privacy
- Requires only 13.8 GB unified memory
2. Observable Factsโ
- Test Environment: MacBook Pro M5 (M5 Pro chip, 18 cores, 64GB unified memory), macOS 15.3
- Local Inference Engine: llama.cpp (llama-server)
- Test Suite: HomeSec-Bench v1, created by SharpAI
- Test Data: All AI-generated (35 images)
- Cloud Comparison: OpenAI API only (no Claude, Gemini, etc.)
3. Areas Requiring Judgmentโ
| Factor | Note |
|---|---|
| Self-Built Benchmark | HomeSec-Bench designed by SharpAI, test set may be biased toward their product use cases |
| Domain Specificity | Home security is a vertical domain; results may differ in other scenarios (e.g., coding assistance, creative writing) |
| Synthetic Test Data | All images are AI-generated, lacking complexity of real home environments |
| Quantization Impact | Different quantization methods used (Q4_K_M, IQ1_M); impact on accuracy not analyzed separately |
| Limited Comparison | Only OpenAI cloud models compared; Claude, Gemini, and other competitors not included |
| Hardware Limitation | Apple Silicon only; performance on NVIDIA GPUs may differ |
My Assessmentโ
Local Small Models' Progress is Realโ
Qwen3.5 series models' performance in vertical domain tasks is indeed impressive. A 9B model achieving 93.8% accuracy under resource constraints shows:
- Mature Quantization: Q4_K_M quantization significantly reduces memory requirements while maintaining performance
- Efficient MoE Architecture: Qwen3.5-35B-MoE's TTFT even beats all OpenAI cloud models
- Vertical Domain Friendly: Gap with cloud models can be acceptable in specific domains
But Maintain Rational Perspectiveโ
- Benchmark Credibility: Self-built benchmarks inevitably have optimization space; third-party benchmarks (like MMLU, HellaSwag) may show different conclusions
- Generalization Gap: Cloud models still have advantages in generalization, especially on unseen tasks
- Ecosystem Completeness: Cloud models typically come with better tooling and services; local models require additional maintenance
Use Case Recommendationsโ
| Scenario | Recommendation |
|---|---|
| Privacy-sensitive | โ Local model |
| Cost-sensitive | โ Local model |
| Need latest model capabilities | โ ๏ธ Cloud model |
| Need strong generalization | โ ๏ธ Cloud model |
| Specific vertical domain | โ Local model viable |
Conclusionโ
SharpAI's HomeSec-Bench shows that: In specific vertical domains (like home security), local 9B models are indeed approaching the usable zone of cloud-tier models. Qwen3.5-9B, with 93.8% pass rate, only 13.8GB memory footprint, and zero API costs, provides new possibilities for local AI applications.
However, this conclusion should be understood in context:
- Benchmark created by the product team, potential bias exists
- Test scenario is highly vertical (home security), generalization ability questionable
- Cloud models still have advantages in general tasks and latest capabilities
Local small models are approaching the usable zone of cloud modelsโat least in specific domains.