🗂 Home Lab¶

🏠 Overview

- Home Lab Overview – Purpose, design, and lab scope.

🖥 Virtualization & Infrastructure

- Manage VMs, Docker containers, and automated provisioning using Proxmox and Cloud-Init.

🔐 Identity & Authentication

- Centralized authentication and SSO using LDAP, Microsoft Entra ID, and OAuth2.

💾 Storage & Backup

- Manage distributed, networked, and file storage plus VM backups using Ceph, iSCSI, NFS, and Proxmox Backup Server.

⚙️ Automation & Orchestration

- Automate provisioning, deployment, and CI/CD workflows using Ansible, Semaphore, Terraform, Jenkins, and GitHub Actions.

🌐 Networking & Services

- Manage Reverse-Proxy, Pi-hole(DNS), and Certificates

📘 Project Management

- Organize issues, track projects, and host documentation using Redmine Configuration.

💻 Infrastructure-as-Code

- Automate provisioning, configuration, and deployment using Ansible, Terraform, and GitHub workflows.

🎯 Learning & Experimentation

- Sandbox for testing infrastructure, DevOps, and security workflows.

Future Expansions

- Track metrics, monitoring, and experimental tools such as Prometheus, Grafana, and other proof-of-concepts.

Wiki

🔧 Replace failed disk and recreate the OSD¶

Follow this sequence to safely remove a failed OSD, replace the physical disk, and recreate the OSD in your Ceph cluster. Commands assume Proxmox with Ceph and use the same conventions from your setup.

1) Identify the failed OSD and device¶

Check cluster health and OSD state:
```
ceph -s
ceph osd tree
ceph osd stat
```

Map OSD ID to device and host:

ceph osd find <osd-id>
# On the host (e.g., pve-1), list disks:
lsblk -o NAME,SIZE,MODEL,TYPE,MOUNTPOINT

Note the hostname and the device path (e.g., /dev/sdb) associated with the failed OSD.

2) Mark the OSD out and stop its service¶

Mark OSD out so data is re-replicated elsewhere:
```
ceph osd out <osd-id>
```
Stop the OSD daemon on the node hosting it:
```
systemctl stop ceph-osd@<osd-id>
```

Wait for the cluster to begin recovering: ceph -s should show recovering/backfilling PGs.

3) Remove the OSD from the cluster¶

Remove CRUSH and auth entries:

ceph osd crush remove osd.<osd-id>
ceph auth del osd.<osd-id>
ceph osd rm <osd-id>

Optionally verify the OSD is gone:
```
ceph osd tree
ceph -s
```

4) Replace the physical disk¶

Power down the node if needed and replace the failed disk.
After boot, verify the new disk appears (e.g., /dev/sdb):
```
lsblk -o NAME,SIZE,MODEL,TYPE
dmesg | tail
```

5) Wipe and prepare the new disk¶

If the new disk contains old metadata or partitions, wipe it fully:

# Zap partitions and signatures (CAUTION: destructive)
sgdisk --zap-all /dev/sdb
wipefs -a /dev/sdb
partprobe /dev/sdb

6) Create the new OSD on the replacement disk¶

You’ve been using Proxmox’s pveceph tooling. Create the OSD with the same method:

# On the node hosting the new disk
pveceph osd create /dev/sdb

This handles preparing the disk (ceph-volume), creating the OSD ID, keyrings, and registering it in CRUSH.

7) Verify OSD daemon and CRUSH placement¶

Check service state:
```
systemctl status ceph-osd@<new-id>
```
Verify it appears in the cluster:
```
ceph osd tree
ceph osd stat
ceph -s
```

The new OSD should show as “up/in”. If it’s “up/out”, run ceph osd in <new-id>.

8) Reweight and allow rebalancing¶

Optionally reweight by utilization to balance data:
```
ceph osd reweight-by-utilization
```
Monitor backfill/recovery:
```
ceph -s
ceph health detail
```

9) Post-replacement checks¶

Disk health (SMART):
```
smartctl -a /dev/sdb
```
Pool capacity and PGs:
```
ceph df
ceph pg stat
```
Logs on the node:
```
journalctl -u ceph-osd@<new-id> -f
```

Notes and tips¶

If pveceph osd create fails, check for lingering partitions or LVM on the device; ensure it’s fully wiped before retrying.
Keep the Ceph network healthy; OSD replacement triggers heavy traffic during rebalancing.
For consistent performance, match SSD models and sizes across nodes when possible.
If you used dedicated DB/WAL devices previously, recreate the OSD with the same layout (pveceph supports passing devices for advanced setups).

Direct answer: mark the failed OSD out, stop and remove it from Ceph, replace the disk, wipe it, recreate the OSD with pveceph osd create /dev/sdb, verify it’s up/in, and monitor rebalancing until health is OK.

Files (0)

Project

General

Profile

Home Lab