Skip to main content

SharpAI: Are Local Small Models Approaching the Usable Zone of Cloud Models?

Backgroundโ€‹

SharpAI is a team focused on local-first AI, with their product Aegis-AI being a localized AI home security system running on consumer hardware. They released HomeSec-Bench, a benchmark to evaluate LLM performance in real home security scenarios.

The core question of this benchmark: Can a 9B parameter local model achieve results close to cloud-tier models in a specific vertical domain (home security)?

HomeSec-Bench Overviewโ€‹

Test Designโ€‹

  • Test Count: 96 LLM tests + 35 VLM tests
  • Test Suites: 15 dimensions
  • Test Images: All AI-generated (no real user footage)
  • Model Compatibility: Any OpenAI-compatible endpoint

Test Dimensionsโ€‹

#Suite NameTestsEvaluates
1Context Preprocessing6Conversation deduplication, system message preservation
2Topic Classification4Routing queries to correct domain
3Knowledge Distillation5Extracting durable facts from conversations
4Event Deduplication8Identifying same person across cameras
5Tool Use16Selecting correct tools with correct parameters
6Chat & JSON Compliance11Persona, JSON output, multilingual
7Security Classification12Normal โ†’ Monitor โ†’ Suspicious โ†’ Critical
8Narrative Synthesis4Generating daily reports from event logs
9Prompt Injection Resistance4Role confusion, prompt extraction, escalation
10Multi-Turn Reasoning4Reference resolution, temporal carry-over
11Error Recovery4Handling impossible queries, API errors
12Privacy & Compliance3PII redaction, illegal surveillance rejection
13Alert Routing5Channel routing, quiet hours parsing
14Knowledge Injection5Using injected knowledge for personalized responses
15VLM-to-Alert Triage5End-to-end: VLM output โ†’ urgency โ†’ alert dispatch

Benchmark Resultsโ€‹

Full Leaderboardโ€‹

RankModelTypePassedFailedPass RateTotal Time
๐Ÿฅ‡ 1GPT-5.4โ˜๏ธ Cloud94297.9%2m 22s
๐Ÿฅˆ 2GPT-5.4-miniโ˜๏ธ Cloud92495.8%1m 17s
๐Ÿฅ‰ 3Qwen3.5-9B (Q4_K_M)๐Ÿ  Local90693.8%5m 23s
3Qwen3.5-27B (Q4_K_M)๐Ÿ  Local90693.8%15m 8s
5Qwen3.5-122B-MoE (IQ1_M)๐Ÿ  Local89792.7%8m 26s
5GPT-5.4-nanoโ˜๏ธ Cloud89792.7%1m 34s
7Qwen3.5-35B-MoE (Q4_K_L)๐Ÿ  Local88891.7%3m 30s
8GPT-5-mini (2025)โ˜๏ธ Cloud603662.5%*7m 38s

* GPT-5-mini had many failures due to API rejecting non-default temperature values. This is an API limitation, not model capability.

Performance Comparisonโ€‹

Time to First Token (TTFT)โ€‹

ModelTypeTTFT (avg)TTFT (p95)
Qwen3.5-35B-MoE๐Ÿ  Local435ms673ms
GPT-5.4-nanoโ˜๏ธ Cloud508ms990ms
GPT-5.4-miniโ˜๏ธ Cloud553ms805ms
GPT-5.4โ˜๏ธ Cloud601ms1052ms
Qwen3.5-9B๐Ÿ  Local765ms1437ms
Qwen3.5-122B-MoE๐Ÿ  Local1627ms2331ms
Qwen3.5-27B๐Ÿ  Local2156ms3642ms

Decode Speed (tokens/second)โ€‹

ModelTypeDecode Speed
GPT-5.4-miniโ˜๏ธ Cloud234.5 tok/s
GPT-5.4-nanoโ˜๏ธ Cloud136.4 tok/s
GPT-5.4โ˜๏ธ Cloud73.4 tok/s
Qwen3.5-35B-MoE๐Ÿ  Local41.9 tok/s
Qwen3.5-9B๐Ÿ  Local25 tok/s
Qwen3.5-122B-MoE๐Ÿ  Local18 tok/s
Qwen3.5-27B๐Ÿ  Local10 tok/s

Memory Usageโ€‹

ModelGPU Memory
Qwen3.5-35B-MoE27.2 GB
Qwen3.5-9B13.8 GB
Qwen3.5-122B-MoE40.8 GB
Qwen3.5-27B24.9 GB

Key Findings Analysisโ€‹

1. Official Claimsโ€‹

SharpAI's official highlights:

  • Qwen3.5-9B achieves 93.8% pass rate on MacBook Pro M5
  • Gap with GPT-5.4 is only 4.1 percentage points
  • Actually beats GPT-5.4-nano by 1 percentage point
  • Zero API costs, complete data privacy
  • Requires only 13.8 GB unified memory

2. Observable Factsโ€‹

  • Test Environment: MacBook Pro M5 (M5 Pro chip, 18 cores, 64GB unified memory), macOS 15.3
  • Local Inference Engine: llama.cpp (llama-server)
  • Test Suite: HomeSec-Bench v1, created by SharpAI
  • Test Data: All AI-generated (35 images)
  • Cloud Comparison: OpenAI API only (no Claude, Gemini, etc.)

3. Areas Requiring Judgmentโ€‹

FactorNote
Self-Built BenchmarkHomeSec-Bench designed by SharpAI, test set may be biased toward their product use cases
Domain SpecificityHome security is a vertical domain; results may differ in other scenarios (e.g., coding assistance, creative writing)
Synthetic Test DataAll images are AI-generated, lacking complexity of real home environments
Quantization ImpactDifferent quantization methods used (Q4_K_M, IQ1_M); impact on accuracy not analyzed separately
Limited ComparisonOnly OpenAI cloud models compared; Claude, Gemini, and other competitors not included
Hardware LimitationApple Silicon only; performance on NVIDIA GPUs may differ

My Assessmentโ€‹

Local Small Models' Progress is Realโ€‹

Qwen3.5 series models' performance in vertical domain tasks is indeed impressive. A 9B model achieving 93.8% accuracy under resource constraints shows:

  1. Mature Quantization: Q4_K_M quantization significantly reduces memory requirements while maintaining performance
  2. Efficient MoE Architecture: Qwen3.5-35B-MoE's TTFT even beats all OpenAI cloud models
  3. Vertical Domain Friendly: Gap with cloud models can be acceptable in specific domains

But Maintain Rational Perspectiveโ€‹

  1. Benchmark Credibility: Self-built benchmarks inevitably have optimization space; third-party benchmarks (like MMLU, HellaSwag) may show different conclusions
  2. Generalization Gap: Cloud models still have advantages in generalization, especially on unseen tasks
  3. Ecosystem Completeness: Cloud models typically come with better tooling and services; local models require additional maintenance

Use Case Recommendationsโ€‹

ScenarioRecommendation
Privacy-sensitiveโœ… Local model
Cost-sensitiveโœ… Local model
Need latest model capabilitiesโš ๏ธ Cloud model
Need strong generalizationโš ๏ธ Cloud model
Specific vertical domainโœ… Local model viable

Conclusionโ€‹

SharpAI's HomeSec-Bench shows that: In specific vertical domains (like home security), local 9B models are indeed approaching the usable zone of cloud-tier models. Qwen3.5-9B, with 93.8% pass rate, only 13.8GB memory footprint, and zero API costs, provides new possibilities for local AI applications.

However, this conclusion should be understood in context:

  • Benchmark created by the product team, potential bias exists
  • Test scenario is highly vertical (home security), generalization ability questionable
  • Cloud models still have advantages in general tasks and latest capabilities

Local small models are approaching the usable zone of cloud modelsโ€”at least in specific domains.

Original Sourcesโ€‹