- Complete expansion guide for 2/4/6 node scenarios - Quick join scripts for worker and master nodes - Health check and diagnostic scripts - Quick reference card for common operations
845 lines
21 KiB
Markdown
845 lines
21 KiB
Markdown
# K3s 集群扩展指南
|
||
|
||
## 📋 目录
|
||
- [当前集群状态](#当前集群状态)
|
||
- [前置条件](#前置条件)
|
||
- [架构设计方案](#架构设计方案)
|
||
- [2节点集群(1 Master + 2 Worker)](#2节点集群1-master--2-worker)
|
||
- [4节点集群(3 Master + 4 Worker)](#4节点集群3-master--4-worker)
|
||
- [6节点集群(3 Master + 6 Worker)](#6节点集群3-master--6-worker)
|
||
- [节点加入步骤](#节点加入步骤)
|
||
- [高可用配置](#高可用配置)
|
||
- [存储配置](#存储配置)
|
||
- [验证和测试](#验证和测试)
|
||
- [故障排查](#故障排查)
|
||
|
||
---
|
||
|
||
## 📊 当前集群状态
|
||
|
||
```
|
||
Master 节点: vmus9
|
||
IP 地址: 134.195.210.237
|
||
k3s 版本: v1.34.3+k3s1
|
||
节点令牌: K109d35a131f48b4d40b162398a828b766d60735f29dd7b4a37b030c1d1c0e26b23::server:72e04c3a9e3e762cbdefffc96f348a2d
|
||
```
|
||
|
||
**重要**: 请妥善保管节点令牌,这是其他节点加入集群的凭证!
|
||
|
||
---
|
||
|
||
## ✅ 前置条件
|
||
|
||
### 所有新节点需要满足:
|
||
|
||
#### 1. 硬件要求
|
||
```
|
||
最低配置:
|
||
- CPU: 2 核
|
||
- 内存: 2GB (建议 4GB+)
|
||
- 磁盘: 20GB (Longhorn 存储建议 50GB+)
|
||
|
||
推荐配置:
|
||
- CPU: 4 核
|
||
- 内存: 8GB
|
||
- 磁盘: 100GB SSD
|
||
```
|
||
|
||
#### 2. 操作系统
|
||
```bash
|
||
# 支持的系统
|
||
- Ubuntu 20.04/22.04/24.04
|
||
- Debian 10/11/12
|
||
- CentOS 7/8
|
||
- RHEL 7/8
|
||
|
||
# 检查系统版本
|
||
cat /etc/os-release
|
||
```
|
||
|
||
#### 3. 网络要求
|
||
```bash
|
||
# 所有节点之间需要能够互相访问
|
||
# 需要开放的端口:
|
||
|
||
Master 节点:
|
||
- 6443: Kubernetes API Server
|
||
- 10250: Kubelet metrics
|
||
- 2379-2380: etcd (仅 HA 模式)
|
||
|
||
Worker 节点:
|
||
- 10250: Kubelet metrics
|
||
- 30000-32767: NodePort Services
|
||
|
||
所有节点:
|
||
- 8472: Flannel VXLAN (UDP)
|
||
- 51820: Flannel WireGuard (UDP)
|
||
```
|
||
|
||
#### 4. 系统准备
|
||
在每个新节点上执行:
|
||
|
||
```bash
|
||
# 1. 更新系统
|
||
sudo apt update && sudo apt upgrade -y
|
||
|
||
# 2. 禁用 swap (k8s 要求)
|
||
sudo swapoff -a
|
||
sudo sed -i '/ swap / s/^/#/' /etc/fstab
|
||
|
||
# 3. 配置主机名 (每个节点不同)
|
||
sudo hostnamectl set-hostname worker-node-1
|
||
|
||
# 4. 配置时间同步
|
||
sudo apt install -y chrony
|
||
sudo systemctl enable --now chrony
|
||
|
||
# 5. 安装必要工具
|
||
sudo apt install -y curl wget git
|
||
|
||
# 6. 配置防火墙 (如果启用)
|
||
# Ubuntu/Debian
|
||
sudo ufw allow 6443/tcp
|
||
sudo ufw allow 10250/tcp
|
||
sudo ufw allow 8472/udp
|
||
sudo ufw allow 51820/udp
|
||
```
|
||
|
||
---
|
||
|
||
## 🏗️ 架构设计方案
|
||
|
||
### 方案一:2节点集群(1 Master + 2 Worker)
|
||
|
||
**适用场景**: 开发/测试环境,小型应用
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────┐
|
||
│ 负载均衡 (可选) │
|
||
│ *.u9.net3w.com (Traefik) │
|
||
└─────────────────────────────────────────────────┘
|
||
│
|
||
┌─────────────┼─────────────┐
|
||
│ │ │
|
||
┌───────▼──────┐ ┌────▼─────┐ ┌────▼─────┐
|
||
│ Master │ │ Worker-1 │ │ Worker-2 │
|
||
│ vmus9 │ │ │ │ │
|
||
│ Control Plane│ │ 应用负载 │ │ 应用负载 │
|
||
│ + etcd │ │ │ │ │
|
||
│ 134.195.x.x │ │ 新节点1 │ │ 新节点2 │
|
||
└──────────────┘ └──────────┘ └──────────┘
|
||
```
|
||
|
||
**特点**:
|
||
- ✅ 简单易维护
|
||
- ✅ 成本低
|
||
- ❌ Master 单点故障
|
||
- ❌ 不适合生产环境
|
||
|
||
**资源分配建议**:
|
||
- Master: 4C8G (运行控制平面 + 部分应用)
|
||
- Worker-1: 4C8G (运行应用负载)
|
||
- Worker-2: 4C8G (运行应用负载)
|
||
|
||
---
|
||
|
||
### 方案二:4节点集群(3 Master + 4 Worker)
|
||
|
||
**适用场景**: 生产环境,中等规模应用
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────┐
|
||
│ 外部负载均衡 (必需) │
|
||
│ HAProxy/Nginx/云厂商 LB │
|
||
│ *.u9.net3w.com │
|
||
└──────────────────────────────────────────────────┘
|
||
│
|
||
┌─────────────┼─────────────┬─────────────┐
|
||
│ │ │ │
|
||
┌───────▼──────┐ ┌────▼─────┐ ┌────▼─────┐ ┌─────▼────┐
|
||
│ Master-1 │ │ Master-2 │ │ Master-3 │ │ Worker-1 │
|
||
│ vmus9 │ │ │ │ │ │ │
|
||
│ Control Plane│ │ Control │ │ Control │ │ 应用负载 │
|
||
│ + etcd │ │ + etcd │ │ + etcd │ │ │
|
||
└──────────────┘ └──────────┘ └──────────┘ └──────────┘
|
||
┌──────────┐
|
||
│ Worker-2 │
|
||
│ 应用负载 │
|
||
└──────────┘
|
||
┌──────────┐
|
||
│ Worker-3 │
|
||
│ 应用负载 │
|
||
└──────────┘
|
||
┌──────────┐
|
||
│ Worker-4 │
|
||
│ 应用负载 │
|
||
└──────────┘
|
||
```
|
||
|
||
**特点**:
|
||
- ✅ 高可用 (HA)
|
||
- ✅ Master 节点冗余
|
||
- ✅ 适合生产环境
|
||
- ✅ 可承载中等规模应用
|
||
- ⚠️ 需要外部负载均衡
|
||
|
||
**资源分配建议**:
|
||
- Master-1/2/3: 4C8G (仅运行控制平面)
|
||
- Worker-1/2/3/4: 8C16G (运行应用负载)
|
||
|
||
**etcd 集群**: 3 个 Master 节点组成 etcd 集群,可容忍 1 个节点故障
|
||
|
||
---
|
||
|
||
### 方案三:6节点集群(3 Master + 6 Worker)
|
||
|
||
**适用场景**: 大规模生产环境,高负载应用
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────┐
|
||
│ 外部负载均衡 (必需) │
|
||
│ HAProxy/Nginx/云厂商 LB │
|
||
│ *.u9.net3w.com │
|
||
└──────────────────────────────────────────────────┘
|
||
│
|
||
┌─────────────┼─────────────┬─────────────┐
|
||
│ │ │ │
|
||
┌───────▼──────┐ ┌────▼─────┐ ┌────▼─────┐ │
|
||
│ Master-1 │ │ Master-2 │ │ Master-3 │ │
|
||
│ vmus9 │ │ │ │ │ │
|
||
│ Control Plane│ │ Control │ │ Control │ │
|
||
│ + etcd │ │ + etcd │ │ + etcd │ │
|
||
└──────────────┘ └──────────┘ └──────────┘ │
|
||
│
|
||
┌─────────────┬─────────────┬─────────────┘
|
||
│ │ │
|
||
┌───────▼──────┐ ┌────▼─────┐ ┌────▼─────┐
|
||
│ Worker-1 │ │ Worker-2 │ │ Worker-3 │
|
||
│ Web 应用层 │ │ Web 层 │ │ Web 层 │
|
||
└──────────────┘ └──────────┘ └──────────┘
|
||
┌──────────────┐ ┌──────────┐ ┌──────────┐
|
||
│ Worker-4 │ │ Worker-5 │ │ Worker-6 │
|
||
│ 数据库层 │ │ 缓存层 │ │ 存储层 │
|
||
└──────────────┘ └──────────┘ └──────────┘
|
||
```
|
||
|
||
**特点**:
|
||
- ✅ 高可用 + 高性能
|
||
- ✅ 可按功能分层部署
|
||
- ✅ 支持大规模应用
|
||
- ✅ Longhorn 存储性能最佳
|
||
- ⚠️ 管理复杂度较高
|
||
- ⚠️ 成本较高
|
||
|
||
**资源分配建议**:
|
||
- Master-1/2/3: 4C8G (专用控制平面)
|
||
- Worker-1/2/3: 8C16G (Web 应用层)
|
||
- Worker-4: 8C32G (数据库层,高内存)
|
||
- Worker-5: 8C16G (缓存层)
|
||
- Worker-6: 4C8G + 200GB SSD (存储层)
|
||
|
||
**节点标签策略**:
|
||
```bash
|
||
# Web 层
|
||
kubectl label nodes worker-1 node-role=web
|
||
kubectl label nodes worker-2 node-role=web
|
||
kubectl label nodes worker-3 node-role=web
|
||
|
||
# 数据库层
|
||
kubectl label nodes worker-4 node-role=database
|
||
|
||
# 缓存层
|
||
kubectl label nodes worker-5 node-role=cache
|
||
|
||
# 存储层
|
||
kubectl label nodes worker-6 node-role=storage
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 节点加入步骤
|
||
|
||
### 场景 A: 加入 Worker 节点(适用于 2 节点方案)
|
||
|
||
#### 在新节点上执行:
|
||
|
||
```bash
|
||
# 1. 设置 Master 节点信息
|
||
export MASTER_IP="134.195.210.237"
|
||
export NODE_TOKEN="K109d35a131f48b4d40b162398a828b766d60735f29dd7b4a37b030c1d1c0e26b23::server:72e04c3a9e3e762cbdefffc96f348a2d"
|
||
|
||
# 2. 安装 k3s agent (Worker 节点)
|
||
curl -sfL https://get.k3s.io | K3S_URL=https://${MASTER_IP}:6443 \
|
||
K3S_TOKEN=${NODE_TOKEN} \
|
||
sh -
|
||
|
||
# 3. 验证安装
|
||
sudo systemctl status k3s-agent
|
||
|
||
# 4. 检查节点是否加入
|
||
# (在 Master 节点执行)
|
||
kubectl get nodes
|
||
```
|
||
|
||
#### 为 Worker 节点添加标签:
|
||
|
||
```bash
|
||
# 在 Master 节点执行
|
||
kubectl label nodes <worker-node-name> node-role.kubernetes.io/worker=worker
|
||
kubectl label nodes <worker-node-name> workload=application
|
||
```
|
||
|
||
---
|
||
|
||
### 场景 B: 加入 Master 节点(适用于 4/6 节点 HA 方案)
|
||
|
||
#### 前提条件:需要外部负载均衡器
|
||
|
||
##### 1. 配置外部负载均衡器
|
||
|
||
**选项 1: 使用 HAProxy**
|
||
|
||
在一台独立服务器上安装 HAProxy:
|
||
|
||
```bash
|
||
# 安装 HAProxy
|
||
sudo apt install -y haproxy
|
||
|
||
# 配置 HAProxy
|
||
sudo tee /etc/haproxy/haproxy.cfg > /dev/null <<EOF
|
||
global
|
||
log /dev/log local0
|
||
log /dev/log local1 notice
|
||
chroot /var/lib/haproxy
|
||
stats socket /run/haproxy/admin.sock mode 660 level admin
|
||
stats timeout 30s
|
||
user haproxy
|
||
group haproxy
|
||
daemon
|
||
|
||
defaults
|
||
log global
|
||
mode tcp
|
||
option tcplog
|
||
option dontlognull
|
||
timeout connect 5000
|
||
timeout client 50000
|
||
timeout server 50000
|
||
|
||
frontend k3s-api
|
||
bind *:6443
|
||
mode tcp
|
||
default_backend k3s-masters
|
||
|
||
backend k3s-masters
|
||
mode tcp
|
||
balance roundrobin
|
||
option tcp-check
|
||
server master-1 134.195.210.237:6443 check fall 3 rise 2
|
||
server master-2 <MASTER-2-IP>:6443 check fall 3 rise 2
|
||
server master-3 <MASTER-3-IP>:6443 check fall 3 rise 2
|
||
EOF
|
||
|
||
# 重启 HAProxy
|
||
sudo systemctl restart haproxy
|
||
sudo systemctl enable haproxy
|
||
```
|
||
|
||
**选项 2: 使用 Nginx**
|
||
|
||
```bash
|
||
# 安装 Nginx
|
||
sudo apt install -y nginx
|
||
|
||
# 配置 Nginx Stream
|
||
sudo tee /etc/nginx/nginx.conf > /dev/null <<EOF
|
||
stream {
|
||
upstream k3s_servers {
|
||
server 134.195.210.237:6443 max_fails=3 fail_timeout=5s;
|
||
server <MASTER-2-IP>:6443 max_fails=3 fail_timeout=5s;
|
||
server <MASTER-3-IP>:6443 max_fails=3 fail_timeout=5s;
|
||
}
|
||
|
||
server {
|
||
listen 6443;
|
||
proxy_pass k3s_servers;
|
||
}
|
||
}
|
||
EOF
|
||
|
||
# 重启 Nginx
|
||
sudo systemctl restart nginx
|
||
```
|
||
|
||
##### 2. 在第一个 Master 节点(当前节点)启用 HA
|
||
|
||
```bash
|
||
# 在当前 Master 节点执行
|
||
export LB_IP="<负载均衡器IP>"
|
||
|
||
# 重新安装 k3s 为 HA 模式
|
||
curl -sfL https://get.k3s.io | sh -s - server \
|
||
--cluster-init \
|
||
--tls-san=${LB_IP} \
|
||
--write-kubeconfig-mode 644
|
||
|
||
# 获取新的 token
|
||
sudo cat /var/lib/rancher/k3s/server/node-token
|
||
```
|
||
|
||
##### 3. 加入第二个 Master 节点
|
||
|
||
```bash
|
||
# 在新的 Master 节点执行
|
||
export MASTER_IP="134.195.210.237" # 第一个 Master
|
||
export LB_IP="<负载均衡器IP>"
|
||
export NODE_TOKEN="<新的 token>"
|
||
|
||
curl -sfL https://get.k3s.io | sh -s - server \
|
||
--server https://${MASTER_IP}:6443 \
|
||
--token ${NODE_TOKEN} \
|
||
--tls-san=${LB_IP} \
|
||
--write-kubeconfig-mode 644
|
||
```
|
||
|
||
##### 4. 加入第三个 Master 节点
|
||
|
||
```bash
|
||
# 在第三个 Master 节点执行(同上)
|
||
export MASTER_IP="134.195.210.237"
|
||
export LB_IP="<负载均衡器IP>"
|
||
export NODE_TOKEN="<token>"
|
||
|
||
curl -sfL https://get.k3s.io | sh -s - server \
|
||
--server https://${MASTER_IP}:6443 \
|
||
--token ${NODE_TOKEN} \
|
||
--tls-san=${LB_IP} \
|
||
--write-kubeconfig-mode 644
|
||
```
|
||
|
||
##### 5. 验证 HA 集群
|
||
|
||
```bash
|
||
# 检查所有 Master 节点
|
||
kubectl get nodes
|
||
|
||
# 检查 etcd 集群状态
|
||
kubectl get pods -n kube-system | grep etcd
|
||
|
||
# 检查 etcd 成员
|
||
sudo k3s etcd-snapshot save --etcd-s3=false
|
||
```
|
||
|
||
---
|
||
|
||
### 场景 C: 混合加入(先加 Master 再加 Worker)
|
||
|
||
**推荐顺序**:
|
||
1. 配置外部负载均衡器
|
||
2. 转换第一个节点为 HA 模式
|
||
3. 加入第 2、3 个 Master 节点
|
||
4. 验证 Master 集群正常
|
||
5. 依次加入 Worker 节点
|
||
|
||
---
|
||
|
||
## 💾 存储配置
|
||
|
||
### Longhorn 多节点配置
|
||
|
||
当集群有 3+ 节点时,Longhorn 可以提供分布式存储和数据冗余。
|
||
|
||
#### 1. 在所有节点安装依赖
|
||
|
||
```bash
|
||
# 在每个节点执行
|
||
sudo apt install -y open-iscsi nfs-common
|
||
|
||
# 启动 iscsid
|
||
sudo systemctl enable --now iscsid
|
||
```
|
||
|
||
#### 2. 配置 Longhorn 副本数
|
||
|
||
```bash
|
||
# 在 Master 节点执行
|
||
kubectl edit settings.longhorn.io default-replica-count -n longhorn-system
|
||
|
||
# 修改为:
|
||
# value: "3" # 3 副本(需要至少 3 个节点)
|
||
# value: "2" # 2 副本(需要至少 2 个节点)
|
||
```
|
||
|
||
#### 3. 为节点添加存储标签
|
||
|
||
```bash
|
||
# 标记哪些节点用于存储
|
||
kubectl label nodes worker-1 node.longhorn.io/create-default-disk=true
|
||
kubectl label nodes worker-2 node.longhorn.io/create-default-disk=true
|
||
kubectl label nodes worker-3 node.longhorn.io/create-default-disk=true
|
||
|
||
# 排除某些节点(如纯计算节点)
|
||
kubectl label nodes worker-4 node.longhorn.io/create-default-disk=false
|
||
```
|
||
|
||
#### 4. 配置存储路径
|
||
|
||
```bash
|
||
# 在每个存储节点创建目录
|
||
sudo mkdir -p /var/lib/longhorn
|
||
sudo chmod 700 /var/lib/longhorn
|
||
```
|
||
|
||
#### 5. 访问 Longhorn UI
|
||
|
||
```bash
|
||
# 创建 Ingress (如果还没有)
|
||
kubectl apply -f k3s/my-blog/longhorn-ingress.yaml
|
||
|
||
# 访问: https://longhorn.u9.net3w.com
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ 验证和测试
|
||
|
||
### 1. 检查节点状态
|
||
|
||
```bash
|
||
# 查看所有节点
|
||
kubectl get nodes -o wide
|
||
|
||
# 查看节点详细信息
|
||
kubectl describe node <node-name>
|
||
|
||
# 查看节点资源使用
|
||
kubectl top nodes
|
||
```
|
||
|
||
### 2. 测试 Pod 调度
|
||
|
||
```bash
|
||
# 创建测试 Deployment
|
||
kubectl create deployment nginx-test --image=nginx --replicas=6
|
||
|
||
# 查看 Pod 分布
|
||
kubectl get pods -o wide
|
||
|
||
# 清理测试
|
||
kubectl delete deployment nginx-test
|
||
```
|
||
|
||
### 3. 测试存储
|
||
|
||
```bash
|
||
# 创建测试 PVC
|
||
cat <<EOF | kubectl apply -f -
|
||
apiVersion: v1
|
||
kind: PersistentVolumeClaim
|
||
metadata:
|
||
name: test-pvc
|
||
spec:
|
||
accessModes:
|
||
- ReadWriteOnce
|
||
storageClassName: longhorn
|
||
resources:
|
||
requests:
|
||
storage: 1Gi
|
||
EOF
|
||
|
||
# 检查 PVC 状态
|
||
kubectl get pvc test-pvc
|
||
|
||
# 清理
|
||
kubectl delete pvc test-pvc
|
||
```
|
||
|
||
### 4. 测试高可用(仅 HA 集群)
|
||
|
||
```bash
|
||
# 模拟 Master 节点故障
|
||
# 在一个 Master 节点执行
|
||
sudo systemctl stop k3s
|
||
|
||
# 在另一个节点检查集群是否正常
|
||
kubectl get nodes
|
||
|
||
# 恢复节点
|
||
sudo systemctl start k3s
|
||
```
|
||
|
||
### 5. 测试网络连通性
|
||
|
||
```bash
|
||
# 在 Master 节点创建测试 Pod
|
||
kubectl run test-pod --image=busybox --restart=Never -- sleep 3600
|
||
|
||
# 进入 Pod 测试网络
|
||
kubectl exec -it test-pod -- sh
|
||
|
||
# 在 Pod 内测试
|
||
ping 8.8.8.8
|
||
nslookup kubernetes.default
|
||
|
||
# 清理
|
||
kubectl delete pod test-pod
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 故障排查
|
||
|
||
### 问题 1: 节点无法加入集群
|
||
|
||
**症状**: `k3s-agent` 服务启动失败
|
||
|
||
**排查步骤**:
|
||
|
||
```bash
|
||
# 1. 检查服务状态
|
||
sudo systemctl status k3s-agent
|
||
|
||
# 2. 查看日志
|
||
sudo journalctl -u k3s-agent -f
|
||
|
||
# 3. 检查网络连通性
|
||
ping <MASTER_IP>
|
||
telnet <MASTER_IP> 6443
|
||
|
||
# 4. 检查 token 是否正确
|
||
echo $NODE_TOKEN
|
||
|
||
# 5. 检查防火墙
|
||
sudo ufw status
|
||
```
|
||
|
||
**解决方案**:
|
||
```bash
|
||
# 重新安装
|
||
sudo /usr/local/bin/k3s-agent-uninstall.sh
|
||
curl -sfL https://get.k3s.io | K3S_URL=https://${MASTER_IP}:6443 \
|
||
K3S_TOKEN=${NODE_TOKEN} sh -
|
||
```
|
||
|
||
---
|
||
|
||
### 问题 2: 节点状态为 NotReady
|
||
|
||
**症状**: `kubectl get nodes` 显示节点 NotReady
|
||
|
||
**排查步骤**:
|
||
|
||
```bash
|
||
# 1. 检查节点详情
|
||
kubectl describe node <node-name>
|
||
|
||
# 2. 检查 kubelet 日志
|
||
# 在问题节点执行
|
||
sudo journalctl -u k3s-agent -n 100
|
||
|
||
# 3. 检查网络插件
|
||
kubectl get pods -n kube-system | grep flannel
|
||
```
|
||
|
||
**解决方案**:
|
||
```bash
|
||
# 重启 k3s 服务
|
||
sudo systemctl restart k3s-agent
|
||
|
||
# 如果是网络问题,检查 CNI 配置
|
||
sudo ls -la /etc/cni/net.d/
|
||
```
|
||
|
||
---
|
||
|
||
### 问题 3: Pod 无法调度到新节点
|
||
|
||
**症状**: Pod 一直 Pending 或只调度到旧节点
|
||
|
||
**排查步骤**:
|
||
|
||
```bash
|
||
# 1. 检查节点污点
|
||
kubectl describe node <node-name> | grep Taints
|
||
|
||
# 2. 检查节点标签
|
||
kubectl get nodes --show-labels
|
||
|
||
# 3. 检查 Pod 的调度约束
|
||
kubectl describe pod <pod-name>
|
||
```
|
||
|
||
**解决方案**:
|
||
```bash
|
||
# 移除污点
|
||
kubectl taint nodes <node-name> node.kubernetes.io/not-ready:NoSchedule-
|
||
|
||
# 添加标签
|
||
kubectl label nodes <node-name> node-role.kubernetes.io/worker=worker
|
||
```
|
||
|
||
---
|
||
|
||
### 问题 4: Longhorn 存储无法使用
|
||
|
||
**症状**: PVC 一直 Pending
|
||
|
||
**排查步骤**:
|
||
|
||
```bash
|
||
# 1. 检查 Longhorn 组件
|
||
kubectl get pods -n longhorn-system
|
||
|
||
# 2. 检查节点是否满足要求
|
||
kubectl get nodes -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}'
|
||
|
||
# 3. 检查 iscsid 服务
|
||
sudo systemctl status iscsid
|
||
```
|
||
|
||
**解决方案**:
|
||
```bash
|
||
# 在新节点安装依赖
|
||
sudo apt install -y open-iscsi
|
||
sudo systemctl enable --now iscsid
|
||
|
||
# 重启 Longhorn manager
|
||
kubectl rollout restart deployment longhorn-driver-deployer -n longhorn-system
|
||
```
|
||
|
||
---
|
||
|
||
### 问题 5: etcd 集群不健康(HA 模式)
|
||
|
||
**症状**: Master 节点无法正常工作
|
||
|
||
**排查步骤**:
|
||
|
||
```bash
|
||
# 1. 检查 etcd 成员
|
||
sudo k3s etcd-snapshot ls
|
||
|
||
# 2. 检查 etcd 日志
|
||
sudo journalctl -u k3s -n 100 | grep etcd
|
||
|
||
# 3. 检查 etcd 端口
|
||
sudo netstat -tlnp | grep 2379
|
||
```
|
||
|
||
**解决方案**:
|
||
```bash
|
||
# 从快照恢复(谨慎操作)
|
||
sudo k3s server \
|
||
--cluster-reset \
|
||
--cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<snapshot-name>
|
||
```
|
||
|
||
---
|
||
|
||
## 📚 快速参考
|
||
|
||
### 常用命令
|
||
|
||
```bash
|
||
# 查看集群信息
|
||
kubectl cluster-info
|
||
kubectl get nodes -o wide
|
||
kubectl get pods -A
|
||
|
||
# 查看节点资源
|
||
kubectl top nodes
|
||
kubectl describe node <node-name>
|
||
|
||
# 管理节点
|
||
kubectl cordon <node-name> # 标记为不可调度
|
||
kubectl drain <node-name> # 驱逐 Pod
|
||
kubectl uncordon <node-name> # 恢复调度
|
||
|
||
# 删除节点
|
||
kubectl delete node <node-name>
|
||
|
||
# 在节点上卸载 k3s
|
||
# Worker 节点
|
||
sudo /usr/local/bin/k3s-agent-uninstall.sh
|
||
# Master 节点
|
||
sudo /usr/local/bin/k3s-uninstall.sh
|
||
```
|
||
|
||
### 节点标签示例
|
||
|
||
```bash
|
||
# 角色标签
|
||
kubectl label nodes <node> node-role.kubernetes.io/worker=worker
|
||
kubectl label nodes <node> node-role.kubernetes.io/master=master
|
||
|
||
# 功能标签
|
||
kubectl label nodes <node> workload=database
|
||
kubectl label nodes <node> workload=web
|
||
kubectl label nodes <node> workload=cache
|
||
|
||
# 区域标签
|
||
kubectl label nodes <node> topology.kubernetes.io/zone=zone-a
|
||
kubectl label nodes <node> topology.kubernetes.io/region=us-east
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 最佳实践
|
||
|
||
### 1. 节点命名规范
|
||
```
|
||
master-1, master-2, master-3
|
||
worker-1, worker-2, worker-3, ...
|
||
```
|
||
|
||
### 2. 逐步扩展
|
||
- 先加入 1 个节点测试
|
||
- 验证正常后再批量加入
|
||
- 避免同时加入多个节点
|
||
|
||
### 3. 监控和告警
|
||
```bash
|
||
# 部署 Prometheus + Grafana
|
||
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/setup/
|
||
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/
|
||
```
|
||
|
||
### 4. 定期备份
|
||
```bash
|
||
# 备份 etcd
|
||
sudo k3s etcd-snapshot save --name backup-$(date +%Y%m%d-%H%M%S)
|
||
|
||
# 查看备份
|
||
sudo k3s etcd-snapshot ls
|
||
```
|
||
|
||
### 5. 资源预留
|
||
```bash
|
||
# 为系统组件预留资源
|
||
kubectl apply -f - <<EOF
|
||
apiVersion: v1
|
||
kind: ResourceQuota
|
||
metadata:
|
||
name: system-quota
|
||
namespace: kube-system
|
||
spec:
|
||
hard:
|
||
requests.cpu: "2"
|
||
requests.memory: 4Gi
|
||
EOF
|
||
```
|
||
|
||
---
|
||
|
||
## 📞 获取帮助
|
||
|
||
- k3s 官方文档: https://docs.k3s.io
|
||
- Longhorn 文档: https://longhorn.io/docs
|
||
- Kubernetes 文档: https://kubernetes.io/docs
|
||
|
||
---
|
||
|
||
**文档版本**: v1.0
|
||
**最后更新**: 2026-01-21
|
||
**适用于**: k3s v1.34.3+k3s1
|