Files
ha-performance-fix/README.md
T

214 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Home Assistant Performance Fix & Infrastructure Migration
## Problem Summary
Home Assistant at `ha.hideawaygaming.com.au` (HAOS 2026.5.3) periodically becomes unresponsive. Because critical infrastructure services (AdGuard DNS, Tailscale VPN, Guacamole RDP) all run as HA add-ons inside the same VM, any HA freeze causes house-wide network and access failures.
### Root Causes Identified
| Issue | Impact |
|-------|--------|
| Memory at 87% (4.45 GB) | VM swaps under load → unresponsive |
| 2,330 entities, 775 unavailable (33%) | Wasted memory and CPU tracking stale entities |
| ~1,007 state changes/hour (16.8/min) | Recorder DB I/O bottleneck |
| browser_mod: 228 entities (200 stale) | Biggest source of entity bloat |
| iCloud3: 1,000+ state changes/4hr | Aggressive polling floods state machine |
| Frigate occupancy flapping ~97x/hr | Detection zones too sensitive |
| 3 time sensors × 60 changes/hr = 720/hr | Pointless recorder writes |
| Guacamole using 25% CPU / 9% RAM | Heavy add-on consuming HA resources |
| AdGuard (network DNS) inside HA | Single point of failure |
---
## Fix Plan (Priority Order)
### Phase 1: Immediate — Recorder Exclude (10 minutes)
Apply `recorder_exclude.yaml` to stop recording high-churn, low-value entities.
**Steps:**
1. SSH into HAOS or use the File Editor add-on
2. Open `/config/configuration.yaml`
3. If you already have a `recorder:` section, merge the excludes from `recorder_exclude.yaml` into it
4. If you don't have one, copy the entire contents of `recorder_exclude.yaml` into `configuration.yaml`
5. Restart HA: Settings → System → Restart
**Expected impact:** ~2,500 fewer state changes recorded per hour, significant reduction in disk I/O and memory usage.
### Phase 2: Immediate — Entity Cleanup (20 minutes)
**browser_mod stale sessions:**
1. Go to Developer Tools → Services
2. For each stale browser_mod entity, call the service to unregister it
3. Alternatively: Settings → Devices & Services → browser_mod → Remove stale device entries
4. Target: reduce from 228 to ~20-30 active entities
**Plex media player cleanup:**
1. Settings → Devices & Services → Plex
2. Click through each device — delete any showing as "Unavailable"
3. Target: reduce from 59 to ~5-10 active clients
**Pioneer VSX-832 duplicates:**
1. Settings → Devices & Services → Onkyo
2. You should see multiple "Pioneer VSX-832" devices
3. Keep only the working one (likely the one showing state "off" or "on")
4. Delete the rest (showing "unavailable")
5. Target: reduce from 7 to 1-2 entities
**F1 Sensor (off-season):**
1. Settings → Devices & Services → F1 Sensor
2. Consider disabling the integration during off-season
3. Or leave it — the recorder exclude will prevent it writing history
4. 76 entities, 42 currently unavailable
### Phase 3: Tune Noisy Integrations (15 minutes)
**iCloud3 — reduce polling frequency:**
1. iCloud3 config (via HA integrations or config file)
2. Increase `inzone_interval` from default to 30-60 minutes
3. Increase general polling interval
4. This alone cuts ~1,000 state changes per 4 hours
**Frigate — fix driveway zone flapping:**
1. In your Frigate config, for the driveway camera zones:
- Increase `min_area` on car detection (currently triggering on shadows/reflections)
- Add `inactivity_timeout: 30` to prevent rapid on/off toggling
- Consider disabling `cat` and `dog` detection on the driveway if not needed
2. The driveway_car_occupancy and driveway_pavement_car_occupancy are toggling ~97x/hour each
**Time sensors — remove duplicates:**
1. If `sensor.time`, `sensor.time_2`, and `sensor.date_time` are defined in `configuration.yaml` under `sensor:``platform: time_date`, remove the duplicates
2. Keep only one if needed for automations, or rely on HA's built-in `now()` in templates instead
**UpdatePowerUsageFast automation:**
1. Settings → Automations → UpdatePowerUsageFast
2. Change the time pattern trigger from every 1 minute to every 5 minutes
3. Cuts 192 automation runs per hour
### Phase 4: Increase VM Memory (5 minutes)
On the Proxmox host:
1. Shut down the HAOS VM (or hot-plug if supported)
2. Increase RAM from current allocation to **8 GB**
3. HAL-HOST has 134 GB total with 78% used — there's headroom
4. Start the VM
### Phase 5: Migrate AdGuard to LXC (30 minutes)
**This is the most important architectural change.** Network DNS must not depend on HA stability.
See `setup-adguard-lxc.sh` — run on the Proxmox host.
```bash
# Copy to Proxmox host
scp setup-adguard-lxc.sh root@10.0.0.x:/root/
# Run it (default CT ID 120, or pass custom)
ssh root@10.0.0.x
chmod +x /root/setup-adguard-lxc.sh
/root/setup-adguard-lxc.sh 120
```
**Post-setup migration:**
1. Access new AdGuard at `http://10.0.0.53:80`
2. Complete the setup wizard
3. Export config from HA's AdGuard add-on web UI and import to new instance
4. Migrate filter lists, client settings, parental controls, DNS rewrites
5. Test: `nslookup google.com 10.0.0.53`
6. Update OPNsense DHCP: change DNS from `10.0.0.55` to `10.0.0.53`
7. Wait 24 hours, confirm stability
8. Stop HA AdGuard add-on
9. Optionally re-add HA AdGuard integration pointing to `10.0.0.53` for dashboard stats
**NPM reverse proxy (optional):**
- Add proxy host in NPM (10.0.0.54):
- Domain: `adguard.hideawaygaming.com.au`
- Forward: `http://10.0.0.53:80`
- SSL via Let's Encrypt
### Phase 6: Migrate Guacamole to LXC (30 minutes)
See `setup-guacamole-lxc.sh` — run on the Proxmox host.
```bash
scp setup-guacamole-lxc.sh root@10.0.0.x:/root/
ssh root@10.0.0.x
chmod +x /root/setup-guacamole-lxc.sh
/root/setup-guacamole-lxc.sh 121
```
**Post-setup migration:**
1. Access new Guacamole at `http://10.0.0.52:8080/guacamole/`
2. Login with `guacadmin` / `guacadmin`**change password immediately**
3. Re-create your RDP connections (hostname, port 3389, credentials)
4. Re-create any user accounts
5. Set up NPM reverse proxy with WebSocket support
6. Test all RDP connections
7. Stop HA Guacamole add-on
**NPM reverse proxy:**
- Domain: `guac.hideawaygaming.com.au`
- Forward: `http://10.0.0.52:8080`
- Custom location: `/guacamole/`
- **Enable WebSocket support** (critical for RDP streaming)
---
## Network Architecture After Migration
```
Internet
┌────┴────┐
│ OPNsense │ 10.0.0.254
│ Gateway │
└────┬────┘
┌──────────────┼──────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ AdGuard │ │ NPM │ │ HAOS │
│ (LXC) │ │ (LXC) │ │ (VM) │
│ .0.53 │ │ .0.54 │ │ .0.55 │
│ DNS 53 │ │ HTTP/S │ │ HA only │
└─────────┘ └─────────┘ └────┬────┘
┌──────────────┬─────────────┘
│ │
┌────┴────┐ ┌────┴────┐
│ Guac │ │Tailscale│
│ (LXC) │ │(remains │
│ .0.52 │ │ in HA) │
│ RDP GW │ └─────────┘
└─────────┘
```
Tailscale stays in HA since it's lightweight and tightly integrated with HA's remote access. AdGuard and Guacamole are now independent — HA can restart without taking down DNS or RDP access.
---
## Expected Results
| Metric | Before | After |
|--------|--------|-------|
| HA Memory | 87% (4.45 GB) | ~50-60% (with 8 GB allocated) |
| Entities | 2,330 (775 unavailable) | ~1,800 (fewer stale) |
| State changes/hr | ~1,007 | ~300-400 |
| Recorder writes/hr | ~1,007 | ~200-300 (excludes applied) |
| DNS failure on HA crash | Yes | No (independent LXC) |
| RDP failure on HA crash | Yes | No (independent LXC) |
| Guacamole CPU in HA | 25% | 0% (moved out) |
| Guacamole RAM in HA | 9% | 0% (moved out) |
---
## Files in This Repository
| File | Purpose |
|------|---------|
| `recorder_exclude.yaml` | Recorder exclude config — merge into `configuration.yaml` |
| `setup-adguard-lxc.sh` | Proxmox script to create AdGuard Home LXC |
| `setup-guacamole-lxc.sh` | Proxmox script to create Guacamole LXC |
| `README.md` | This file |