How PingPuffin's Monitoring System Works
Version: 1.0
Last updated: November 21, 2024
Table of Contents
- Overview
- Check Intervals and Frequency
- Error Detection
- Recovery and Status Changes
- Manual Updates
- Protection Against Monitor Failures
- Notification System
- Automatic Dashboard Updates
- Data Collection and Storage
- Technical Specifications
- Privacy and Security
- System Reliability
- Common Scenarios and Examples
Overview
PingPuffin monitors HTTP and HTTPS endpoints 24/7 to ensure your websites are available. Our system is built with a focus on reliability, precision, and transparency.
Key Features
- ✅ Automatic checks every 5 minutes via cron job
- ✅ 2-check verification to avoid false alarms
- ✅ 24/7/365 monitoring without breaks
- ✅ Instant recovery when your site is back
- ✅ Manual updates for quick verification
- ✅ Protection against internal errors in the monitor system
Why Transparency Matters
We believe in openness about how our monitoring works. This document explains exactly how we detect downtime, how we avoid false alarms, and how we ensure you get notified as quickly as possible when there's a real problem.
Check Intervals and Frequency
Automatic Checks
All active monitors are checked automatically every 5 minutes.
Cron schedule:
*/5 * * * *
This means:
- First check: 00:00, 00:05, 00:10, 00:15, ...
- No breaks, no weekends, no holidays
- All monitors checked in parallel for efficiency
Manual Updates
You can always trigger an instant check via the "Update now" button in your dashboard:
- Result shown immediately
- Bypasses 2-check threshold for quick feedback
- Useful after deployments or configuration changes
Coverage
- Availability: 24/7/365
- Parallel checks: All monitors checked simultaneously
- Timeout: Standard 30 seconds (configurable)
- Maximum redirects: Up to 5 follow requests
Error Detection
2-Check Verification System
To avoid false alarms, we require 2 consecutive failures before marking a site as down.
How It Works
First Failure (00:00):
- Failure counter set to 1
- Status remains unchanged (e.g., UP)
- No notification sent
- System logs failure for internal monitoring
Second Failure (00:05):
- Failure counter updated to 2
- Status changes to DOWN
- Incident created automatically
- Notification sent to all configured channels
Total time: ~5-10 minutes from first failure to DOWN status.
Why 2 Checks?
Transient network problems (DNS blips, brief timeouts, temporary server errors) occur even on stable sites. By requiring 2 failures:
- ✅ We eliminate false alarms from brief problems
- ✅ We confirm there's a real problem
- ✅ We improve user trust in notifications
What Counts as a Failure?
The following situations are marked as failures:
HTTP Error Codes
- 4xx Client Errors: 400, 403, 404, 405, etc. (unless explicitly allowed)
- 5xx Server Errors: 500, 502, 503, 504, etc.
Network Errors
- Timeout: No response within timeout period (default: 30 seconds)
- Connection Refused: Server actively rejects connections
- DNS Failure: Cannot resolve domain name
- Network Unreachable: Host not available on network
SSL/TLS Errors
- Invalid Certificate: Certificate is invalid
- Expired Certificate: Certificate has expired
- Untrusted Certificate: Certificate not from trusted CA
- Hostname Mismatch: Certificate hostname doesn't match URL
Handling Different Error Types
Important: ALL error types count toward the failure counter.
Example:
Check 1: HTTP 500 → Failure counter: 1, Status: UP (waiting for confirmation)
Check 2: Timeout → Failure counter: 2, Status: DOWN (confirmed failure)
Rationale:
- If a server switches between different error types, it indicates instability
- It's not less serious if the error type changes
- Any failure means the site is not functioning correctly
Recovery and Status Changes
Instant Recovery
When your site comes back, we react immediately – no 2-check threshold for recovery.
Recovery flow:
Status: DOWN
Check 1: Site responds with HTTP 200 → Failure counter reset, Status: UP
Result: Instant recovery, notification sent
Why instant recovery?
- ✅ Users want quick feedback when their site is back
- ✅ No reason to wait for confirmation that something works
- ✅ Best practice in monitoring
- ✅ Reduces worry and waiting time
Status States
🟢 UP (Online)
Meaning:
- Site responds with expected status code (typically 200-399)
- Response time within acceptable range
- No errors detected
Notifications:
- Sent on recovery from DOWN status
🔴 DOWN (Offline)
Meaning:
- Site failed 2+ consecutive checks
- Incident created and tracked
- All configured notification channels alerted
Duration:
- Recorded from first DOWN check to recovery
- Shown in incident log with precise duration
🟡 WARNING (Warning)
Meaning:
- Site responds but with warnings
- Examples: Slow response time, Cloudflare challenge detected
- Monitoring continues normally
Notifications:
- Can be configured per user
🔵 REDIRECT (Redirect)
Meaning:
- Permanent redirect (301) detected to another URL
- Site is functional but URL has changed
- You can choose to update URL or continue monitoring original
Notifications:
- Can be configured per user
Cloudflare-Protected Sites
Automatic Handling:
PingPuffin automatically handles sites behind Cloudflare based on a fundamental principle in uptime monitoring:
The Principle: If the server responds with an HTTP status code (e.g., 403), it means the server is online. Cloudflare's protection blocks our monitoring requests, but this doesn't mean your site is down.
-
HTTP 403 Forbidden: If your site returns 403, but the server actually responds (no timeout), this is automatically detected as Cloudflare protection and marked as "Problematic" (warning status) instead of "Down". This is because 403 typically indicates Cloudflare bot protection, and the server is actually online (it's responding).
-
HTTP 503 Service Unavailable: 503 is only treated as "Problematic" if Cloudflare is actually detected (headers, body patterns, or short response < 1200 bytes). If no Cloudflare detection, 503 is treated normally (may be real downtime).
Why "Problematic" instead of "Down"?
If the server responds with an HTTP status code, it means the server is online. Uptime monitoring is about availability - if the server responds, it's available, even if there's protection active. Therefore, it's marked as "Problematic" to indicate that there's active protection, not real downtime.
Optional: Configure Cloudflare for Better Monitoring
If you want to avoid "Problematic" status for your Cloudflare-protected site, you can:
-
WAF Rules: Create a rule that allows requests from PingPuffin's User-Agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 -
Browser Integrity Check: Consider disabling this for monitoring IPs
-
Rate Limiting: Adjust rate limiting so monitoring requests aren't blocked
PingPuffin's Monitoring IP: 188.245.198.146 (for whitelisting if needed)
Manual Updates
User-Initiated Checks
You can always trigger an instant check via the "Update now" button in the dashboard.
Behavior:
- Bypass 2-check threshold: Result shown and applied immediately
- Instant status update: If status changes, it updates immediately
- Notification: Sent if status changes
- Failure counter updated: Counter updated based on result
Use Cases
- ✅ Quick verification after deployment of fixes
- ✅ Testing new monitor configuration
- ✅ Instant status check without waiting for cron
- ✅ Debugging connection problems
Example:
00:00 - Automatic check fails → Failure counter: 1, Status: UP
00:02 - You click "Update now" → Check fails → Status: DOWN instantly
Result: Manual check skips 2-check threshold for quick feedback
Protection Against Monitor Failures
Internal vs. External Errors
It's critical to distinguish between errors in your site and errors in our monitor system.
Site Errors (Monitored)
These errors from your site count as failures:
- ✅ HTTP 500 from target site → Counts as failure
- ✅ Timeout connecting to target site → Counts as failure
- ✅ DNS error for target domain → Counts as failure
Monitor System Errors (Protected)
These errors in PingPuffin's code do NOT mark your site as down:
- ❌ PHP exception in PingPuffin code → Does NOT mark site as down
- ❌ Database connection error → Does NOT mark site as down
- ❌ Internal logic error → Does NOT mark site as down
Administrator Alerting
When the monitor system fails:
Logging:
- Critical errors logged with full stack trace
- Timestamp and monitor ID recorded
- All details saved for debugging
Email Alarm:
- Email sent to system administrator
- Contains error message, stack trace, and context
- Rate-limited: Maximum 1 email per hour per unique error
- Prevents email flooding during system problems
Database:
- Status remains unchanged (no false downs)
- No incident created
- Users not affected
Example:
Monitor checker runs
→ Internal error detected in monitor system
→ Error logged with full context
→ Email sent to system administrator
→ Database NOT updated
→ Your site status remains unchanged
Notification System
When Notifications Are Sent
Automatic Checks
- Status DOWN: After 2 consecutive failures confirmed (~5-10 min)
- Status UP: Instantly when site comes back from DOWN
- Other changes: Configurable per user (redirect, warning, etc.)
Manual Checks
- Instant notification: If status changes on manual update
- No delay: Bypasses 2-check threshold
Notification Channels
- Direct email notifications
- Contains: Site name, status, error message, timestamp
- Contains link to dashboard for more details
Slack
- Message to configured channel or DM
- Formatted with colors based on status (red=DOWN, green=UP)
- Includes direct link to monitor
Webhook
- POST request to custom endpoint
- JSON payload with all details
- Status code, response time, error message included
- Useful for integration with other systems
Notification Snoozing
You can temporarily disable notifications (snooze) for 24 hours:
During Snooze:
- ✅ Monitoring continues normally
- ✅ Status updates in dashboard
- ❌ No notifications sent to any channel
- ⏰ Automatically un-snoozed after 24 hours
Use Cases:
- Planned maintenance
- Known issue during rollout
- Temporary shutdown
Automatic Dashboard Updates
Auto-Refresh Mechanism
Your dashboard updates automatically every 30 seconds without reloading the page.
Technical:
- Dashboard updates automatically every 30 seconds
- Fetches latest data from database via secure API calls
- Does NOT trigger new checks (read-only)
- Lightweight calls for quick updates
What Gets Updated?
Dashboard always shows latest data from last cron check:
- 🎨 Status indicator: Colored badge (green/red/yellow/blue)
- ⏰ Last checked: Precise timestamp of last check
- ⚡ Response time: Response time in milliseconds
- 🔢 Failure counter: Number of consecutive failures
- 📊 Incident info: Active incidents and duration
Important: Auto-refresh respects automatic 2-check logic because it only shows data from cron checks, not new checks.
Data Collection and Storage
Check Records
Every single check is saved in the database with the following information:
- Unique ID for check
- Reference to monitor
- Timestamp (when check was performed)
- HTTP status code or error type
- Response time in milliseconds
- Success/failure status
- Error message (if relevant)
- Redirect information (if relevant)
- SSL error details (if relevant)
Usage:
- Uptime percentage calculations
- Historical graphs and reports
- Error analysis and debugging
- Performance tracking over time
Incident Tracking
When status changes to DOWN, an incident is created with the following data:
- Unique ID for incident
- Reference to monitor
- Start timestamp
- End timestamp (when resolved)
- Total duration
Features:
- Automatic creation on DOWN status
- Continuous duration calculation
- Automatic resolution on recovery
- Complete history maintained
- Exportable to CSV
Activity Log
All significant events are logged:
- ✅ Status changes (UP → DOWN, DOWN → UP, etc.)
- ✅ Manual checks performed by users
- ✅ Configuration changes
- ✅ URL updates
- ✅ Metadata updates
Functionality:
- Exportable to CSV format
- Searchable and filterable
- Shows details for each event
- Timestamps on all entries
Technical Specifications
HTTP Request Parameters
When PingPuffin checks your site, the following request is sent:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Accept: */*
Connection: close
[Optional: Authorization header for Basic Auth]
Note: We use a standard Chrome user agent to avoid being blocked by sites that filter custom user agents.
Configuration:
- Timeout: Can be configured per monitor (default: 30 seconds)
- Follow Redirects: Yes, up to 5 redirects maximum
- SSL Verification: Enabled (validates certificates)
- Connection Reuse: Disabled (fresh connection for each check)
Standard Expected Status Codes
Monitors expect these codes by default (can be customized):
2xx Success:
- 200 OK
- 201 Created
- 204 No Content
3xx Redirects:
- 301 Moved Permanently
- 302 Found
- 307 Temporary Redirect
- 308 Permanent Redirect
Custom:
- You can configure which status codes are acceptable for your specific monitor
- Example: Accept 401 for password-protected pages
Response Time Measurement
What's Measured:
- DNS lookup time
- Connection time (TCP handshake)
- SSL handshake time (if HTTPS)
- Time to first byte (TTFB)
What's NOT Measured:
- Body download time (we only read headers)
- JavaScript execution time
- Asset loading time
Storage:
- Measured in milliseconds
- Saved at each check
- Used for performance tracking
- Shown in dashboard
Advanced Monitoring Settings
For advanced users, we offer:
HTTP Method
- GET: Standard method
- POST: For endpoints that require POST
Request Body
- Send JSON or form data with POST requests
- Useful for API endpoints that require specific data
Basic Authentication
- Username and password for protected endpoints
- Passwords encrypted with AES-256-CBC
- Never stored in plain text
Future Features
- Custom headers
- Request params
- Advanced authentication (OAuth, Bearer tokens)
Server Information
Public IP Address
PingPuffin's monitoring server uses the following public IP address to perform checks:
IP Address: 188.245.198.146
If you need to whitelist PingPuffin's IP address in your firewall or server configuration, you can use this IP address.
Note: The IP address may change during server migrations or infrastructure updates. We recommend using the User-Agent header for identification instead of IP-based rules, if possible.
User-Agent Identification:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Privacy and Security
Data Encryption
Sensitive Credentials:
- All passwords (Basic Auth) encrypted on storage
- Algorithm: AES-256-CBC (industry standard)
- Key: Stored securely in environment variables
- IV: Unique initialization vector per encryption
- Decryption only happens in memory during checks
- Never exposed in logs or API responses
Data Access
Access Control:
- Check results only visible to monitor owner
- No data shared with third parties
- Activity log only exportable by owner
- Secure API endpoints with authentication
Database Security:
- Prepared statements (prevents SQL injection)
- Session-based authentication
- CSRF protection on all forms
HTTPS Enforcement
SSL/TLS:
- All monitor checks support HTTPS
- SSL certificate validation enabled
- Warns on certificate problems
- Detects expired certificates
Dashboard:
- Always accessed via HTTPS
- Secure cookies (HttpOnly, Secure flags)
- HSTS headers recommended
System Reliability
Monitor System Health
Our monitor system monitors itself:
Error Detection:
- Automatic detection of internal errors
- Full logging of all exceptions
- Stack traces for debugging
Administrator Alerts:
- Critical errors emailed to system administrator
- Rate-limited to avoid spam
- Details included for quick resolution
Automatic Recovery:
- Cron continues on errors in individual monitors
- No cascade failures across monitors
- Database transactions ensure data integrity
Uptime Goals
We aim for the following reliability:
Monitor System Uptime:
- Goal: 99.9% (less than 9 hours downtime per year)
- Monitored: Via cron log and system metrics
Check Execution Rate:
- Goal: 99.5% success rate for check execution
- Monitored: Error rate in logs
Cron Reliability:
- Monitored: Each cron execution logged
- Alerting: On missing execution
Common Scenarios and Examples
Scenario 1: Transient Network Blip
Situation: Brief network problem, site is actually up.
00:00 - Check fails (timeout) → Failure counter: 1, Status: UP
00:05 - Check succeeds → Failure counter: 0, Status: UP
Result:
✅ No notification sent
✅ No status change
✅ No false alarm
Why it works: 2-check threshold catches brief problems.
Scenario 2: Real Downtime
Situation: Server is really down (e.g., hosting problem).
00:00 - Check fails (HTTP 500) → Failure counter: 1, Status: UP
→ System logs first failure for internal monitoring
00:05 - Check fails (timeout) → Failure counter: 2, Status: DOWN
→ Incident created automatically
→ Notification sent via email/Slack/webhook
00:10 - Check fails (timeout) → Failure counter: 3, Status: DOWN
→ Incident duration updated continuously
Result:
✅ DOWN status confirmed at 00:05
✅ Notification sent ~5 minutes after first failure
✅ Different error types (500 + timeout) both count
Why it works: Two consecutive failures confirm real problem.
Scenario 3: Quick Recovery
Situation: Site down, comes back quickly.
00:00 - Status is DOWN (from previous failure)
00:05 - Check succeeds (HTTP 200) → Failure counter: 0, Status: UP
→ Incident marked as resolved automatically
→ Recovery notification sent
Result:
✅ Instant recovery on first successful check
✅ Incident duration calculated (00:00 to 00:05 = 5 min)
✅ You're informed quickly about recovery
Why it works: No 2-check threshold for recovery.
Scenario 4: Manual Refresh During First Failure
Situation: Automatic check failed once, user wants to verify.
00:00 - Automatic check fails → Failure counter: 1, Status: UP
→ Status remains UP (waiting for confirmation)
00:02 - You click "Update now" → Check fails → Status: DOWN immediately
→ Manual check bypasses 2-check threshold for quick feedback
→ Notification sent
Result:
✅ Manual check gives instant feedback
✅ Status updates without waiting for next automatic check
✅ Useful for debugging and verification
Why it works: Manual checks are designed for instant feedback.
Scenario 5: Monitor System Error
Situation: Internal error in PingPuffin's own code.
00:00 - Monitor system runs automatic check
→ Internal error detected
→ Error caught automatically
Logging:
→ Error logged with full context for internal monitoring
→ Full technical information saved for debugging
Administrator Alarm:
→ Email sent to system administrator (max 1 per hour)
→ Contains error details and context
Database:
→ No update to your site status
→ Your site status remains unchanged
→ Failure counter not affected
Result:
✅ Monitor error does NOT affect your site status
✅ Administrator alerted to fix problem
✅ No false DOWN status
Why it works: Distinction between monitor errors and site errors.
Scenario 6: Different Error Types Consecutively
Situation: Server unstable, different errors each time.
00:00 - Check fails (HTTP 500) → Failure counter: 1, Status: UP
00:05 - Check fails (Connection timeout) → Failure counter: 2, Status: DOWN
00:10 - Check fails (HTTP 503) → Failure counter: 3, Status: DOWN
Result:
✅ Different error types ALL count
✅ Status DOWN after 2 failures (regardless of type)
✅ Indicates unstable server (maybe worse than one consistent error)
Why it works: Any error means site not functioning correctly.
Frequently Asked Questions
How quickly do I get notified of downtime?
Automatic check: ~5-10 minutes after first failure (requires 2 failures).
Manual check: Instantly if you update manually.
Can I get false alarms?
Very rarely. The 2-check system eliminates most brief problems. If you get an alarm, there's almost always a real problem.
What if my server is temporarily slow?
If response time exceeds timeout (default 30 sec), it counts as failure. You can increase timeout value for your monitor.
How is planned maintenance handled?
Use the "Snooze" function to disable notifications for 24 hours. Monitoring continues, but you get no alarms.
Can I see history for all checks?
Yes, the activity log shows all checks and status changes. You can also export to CSV.
What happens if PingPuffin itself goes down?
Our monitors run on reliable infrastructure. On critical system errors, administrator is alerted, but your site is NOT marked as down.
Contact & Support
Have questions about how monitoring works?
📧 Email: support@pingpuffin.com
🐛 Bug reports: Via email
📚 Documentation: See documentation section for more information
Changelog
v1.0 (November 21, 2024)
- First version of documentation
- 2-check verification system implemented
- Monitor failure protection added
- Rate-limited administrator alerts
This document is updated continuously. Check "Last updated" at the top to see if there are new versions.