Skip to content

241015 main from release ( v3.2.1 ) #251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Oct 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
2cbcfe0
policy: add a decapod app for policies
Oct 4, 2023
0923ba4
Merge pull request #218 from openinfradev/main
ktkfree Nov 13, 2023
4b43b37
Merge pull request #226 from openinfradev/main
ktkfree Nov 17, 2023
7128398
Merge pull request #178 from openinfradev/policy-serving
intelliguy Nov 24, 2023
49558f0
fluentbit: do not store as default over every logs
Dec 4, 2023
e008a7d
Merge pull request #230 from openinfradev/fluentbit
bluejayA Dec 5, 2023
50d7082
Merge pull request #231 from openinfradev/main
ktkfree Jan 15, 2024
a8653f6
feature. add alert ruler for tks_policy
ktkfree Apr 19, 2024
10f40d2
Merge pull request #233 from openinfradev/policy_ruler
intelliguy Apr 23, 2024
9d2964c
feature. remove thanos ruler from all stack_templates
ktkfree Apr 24, 2024
b298550
Merge pull request #235 from openinfradev/remove_thanos_ruller
intelliguy Apr 24, 2024
b7816bd
feature. change service type LoadBalancer for thanos-ruler
ktkfree Apr 25, 2024
5ab7d8e
Merge pull request #236 from openinfradev/change_servicetype_ruler
intelliguy Apr 25, 2024
f321e70
feature. add policy to byoh-reference
ktkfree May 3, 2024
15e62a7
Merge pull request #237 from openinfradev/byoh_fix
intelliguy May 3, 2024
eb5b524
Merge pull request #238 from openinfradev/develop
ktkfree May 17, 2024
7bd3a5f
fluentbit: add collecting targets for policy-serving
May 20, 2024
aaf00ca
Merge pull request #239 from openinfradev/policy-serving
ktkfree May 21, 2024
345bc1b
Merge pull request #240 from openinfradev/develop
ktkfree May 21, 2024
447a84d
Merge pull request #241 from openinfradev/release
ktkfree Jun 4, 2024
0273123
user-logging: add loki for non-platform-logs as loki-user
Jun 24, 2024
8031cc4
Merge pull request #242 from openinfradev/user-logging
intelliguy Jun 25, 2024
75991bd
trivial. remove service type LoadBalaner from thanos-ruler
ktkfree Jul 16, 2024
ca18275
Merge pull request #245 from openinfradev/minor_fix
intelliguy Jul 17, 2024
0b5c26e
feature. add byok-reference
ktkfree Jul 22, 2024
2f45c14
Merge pull request #246 from openinfradev/byok
zugwan Jul 24, 2024
228e573
bugfix. add s3 bucket 'tks-loki-user'
ktkfree Sep 4, 2024
532e8e2
Merge pull request #248 from openinfradev/update_eks_version
intelliguy Sep 4, 2024
6b49313
feature. update kubernetes version to 1.29.8
ktkfree Sep 6, 2024
43b9811
Merge pull request #249 from openinfradev/update_eks_version
zugwan Sep 9, 2024
6ac3350
Merge pull request #250 from openinfradev/develop
ktkfree Oct 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 43 additions & 57 deletions aws-msa-reference/lma/site-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ global:

lokiHost: loki-loki-distributed-gateway
lokiPort: 80
lokiuserHost: loki-user-loki-distributed-gateway
lokiuserPort: 80
s3Service: "minio.lma.svc:9000" # depends on $lmaNameSpace (ex. minio.taco-system.svc)

lmaNameSpace: lma
Expand Down Expand Up @@ -148,19 +150,23 @@ charts:
- name: taco-loki
host: $(lokiHost)
port: $(lokiPort)
lokiuser:
- name: taco-loki-user
host: $(lokiuserHost)
port: $(lokiuserPort)
targetLogs:
- tag: kube.*
bufferChunkSize: 2M
bufferMaxSize: 5M
do_not_store_as_default: false
index: container
loki_name: taco-loki
loki_name: taco-loki-user
memBufLimit: 20MB
multi_index:
- index: platform
loki_name: taco-loki
key: $kubernetes['namespace_name']
value: kube-system|$(lmaNameSpace)|taco-system|argo
value: kube-system|$(lmaNameSpace)|taco-system|gatekeeper-system|argo
parser: docker
path: /var/log/containers/*.log
type: kubernates
Expand Down Expand Up @@ -274,6 +280,8 @@ charts:
# - --deduplication.replica-label="replica"
storegateway.persistence.size: 8Gi
ruler.nodeSelector: $(nodeSelector)
ruler.service.type: LoadBalancer
ruler.service.annotations: $(awsNlbAnnotation)
ruler.alertmanagers:
- http://alertmanager-operated:9093
ruler.persistence.size: 8Gi
Expand All @@ -283,61 +291,7 @@ charts:
rules:
- alert: "PrometheusDown"
expr: absent(up{prometheus="lma/lma-prometheus"})
- alert: node-cpu-high-load
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 idle process의 cpu 점유율이 3분 동안 0% 입니다. (현재 사용률 {{$value}})
description: 워커 노드 CPU가 과부하 상태입니다. 일시적인 서비스 Traffic 증가, Workload의 SW 오류, Server HW Fan Fail등 다양한 원인으로 인해 발생할 수 있습니다.
Checkpoint: 일시적인 Service Traffic의 증가가 관측되지 않았다면, Alert발생 노드에서 실행 되는 pod중 CPU 자원을 많이 점유하는 pod의 설정을 점검해 보시길 제안드립니다. 예를 들어 pod spec의 limit 설정으로 과도한 CPU자원 점유을 막을 수 있습니다.
summary: Cpu resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: (avg by (taco_cluster, instance) (rate(node_cpu_seconds_total{mode="idle"}[60s]))) < 0 #0.1 # 진짜 0?
for: 3m
labels:
severity: warning
- alert: node-memory-high-utilization
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 Memory 사용량이 3분동안 80% 를 넘어서고 있습니다. (현재 사용률 {{$value}})
descriptioon: 워커 노드의 Memory 사용량이 80%를 넘었습니다. 일시적인 서비스 증가 및 SW 오류등 다양한 원인으로 발생할 수 있습니다.
Checkpoint: 일시적인 Service Traffic의 증가가 관측되지 않았다면, Alert발생 노드에서 실행되는 pod중 Memory 사용량이 높은 pod들에 대한 점검을 제안드립니다.
summary: Memory resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) < 0.2
for: 3m
labels:
severity: warning
- alert: node-disk-full
annotations:
message: 지난 6시간동안의 추세로 봤을 때, 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 root 볼륨은 24시간 안에 Disk full이 예상됨
description: 현재 Disk 사용 추세기준 24시간 내에 Disk 용량이 꽉 찰 것으로 예상됩니다.
Checkpoint: Disk 용량 최적화(삭제 및 Backup)을 수행하시길 권고합니다. 삭제할 내역이 없으면 증설 계획을 수립해 주십시요.
summary: Memory resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: critical
- alert: pvc-full
annotations:
message: 지난 6시간동안의 추세로 봤을 때, 클러스터({{ $labels.taco_cluster }})의 파드({{ $labels.persistentvolumeclaim }})가 24시간 안에 Disk full이 예상됨
description: 현재 Disk 사용 추세기준 24시간 내에 Disk 용량이 꽉 찰것으로 예상됩니다. ({{ $labels.taco_cluster }} 클러스터, {{ $labels.persistentvolumeclaim }} PVC)
Checkpoint: Disk 용량 최적화(삭제 및 Backup)을 수행하시길 권고합니다. 삭제할 내역이 없으면 증설 계획을 수립해 주십시요.
summary: Disk resources of the volume(pvc) {{ $labels.persistentvolumeclaim }} are running low.
discriminative: $labels.taco_cluster, $labels.persistentvolumeclaim
expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 24*3600) < 0 # kubelet_volume_stats_capacity_bytes
for: 30m
labels:
severity: critical
- alert: pod-restart-frequently
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 파드({{ $labels.pod }})가 30분 동안 5회 이상 재기동 ({{ $value }}회)
description: 특정 Pod가 빈번하게 재기동 되고 있습니다. 점검이 필요합니다. ({{ $labels.taco_cluster }} 클러스터, {{ $labels.pod }} 파드)
Checkpoint: pod spec. 에 대한 점검이 필요합니다. pod의 log 및 status를 확인해 주세요.
discriminative: $labels.taco_cluster, $labels.pod, $labels.namespace
expr: increase(kube_pod_container_status_restarts_total{namespace!="kube-system"}[60m:]) > 2 # 몇회로 할 것인지?
for: 30m
labels:
severity: critical


- name: thanos-config
override:
objectStorage:
Expand Down Expand Up @@ -393,10 +347,42 @@ charts:
aws:
s3: http://$(defaultUser):$(defaultPassword)@$(s3Service)/minio

- name: loki-user
override:
global.dnsService: kube-dns
# global.clusterDomain: $(clusterName) # annotate cluste because the cluster name is still cluster.local regardless cluster
gateway.service.type: LoadBalancer
gateway.service.annotations: $(awsNlbAnnotation)
ingester.persistence.storageClass: $(storageClassName)
distributor.persistence.storageClass: $(storageClassName)
queryFrontend.persistence.storageClass: $(storageClassName)
ruler.persistence.storageClass: $(storageClassName)
indexGateway.persistence.storageClass: $(storageClassName)
# select target node's label
ingester.nodeSelector: $(nodeSelector)
distributor.nodeSelector: $(nodeSelector)
querier.nodeSelector: $(nodeSelector)
queryFrontend.nodeSelector: $(nodeSelector)
queryScheduler.nodeSelector: $(nodeSelector)
tableManager.nodeSelector: $(nodeSelector)
gateway.nodeSelector: $(nodeSelector)
compactor.nodeSelector: $(nodeSelector)
ruler.nodeSelector: $(nodeSelector)
indexGateway.nodeSelector: $(nodeSelector)
memcachedChunks.nodeSelector: $(nodeSelector)
memcachedFrontend.nodeSelector: $(nodeSelector)
memcachedIndexQueries.nodeSelector: $(nodeSelector)
memcachedIndexWrites.nodeSelector: $(nodeSelector)
loki:
storageConfig:
aws:
s3: http://$(defaultUser):$(defaultPassword)@$(s3Service)/minio

- name: lma-bucket
override:
s3.enabled: true
s3.buckets:
- name: $(clusterName)-tks-thanos
- name: $(clusterName)-tks-loki
- name: $(clusterName)-tks-loki-user
tks.iamRoles: $(tksIamRoles)
5 changes: 5 additions & 0 deletions aws-msa-reference/policy/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
resources:
- ../base

transformers:
- site-values.yaml
26 changes: 26 additions & 0 deletions aws-msa-reference/policy/site-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
apiVersion: openinfradev.github.com/v1
kind: HelmValuesTransformer
metadata:
name: site

global:
nodeSelector:
taco-lma: enabled
clusterName: cluster.local
storageClassName: taco-storage
repository: https://openinfradev.github.io/helm-repo/

charts:
- name: opa-gatekeeper
override:
postUpgrade.nodeSelector: $(nodeSelector)
postInstall.nodeSelector: $(nodeSelector)
preUninstall.nodeSelector: $(nodeSelector)
controllerManager.nodeSelector: $(nodeSelector)
audit.nodeSelector: $(nodeSelector)
crds.nodeSelector: $(nodeSelector)

enableDeleteOperations: true

- name: policy-resources
override: {}
8 changes: 7 additions & 1 deletion aws-msa-reference/tks-cluster/site-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ charts:
sshKeyName: $(sshKeyName)
cluster:
name: $(clusterName)
kubernetesVersion: v1.26.10
kubernetesVersion: v1.29.8
eksEnabled: false
multitenancyId:
kind: AWSClusterRoleIdentity
Expand All @@ -54,6 +54,8 @@ charts:
kubeadmControlPlane:
replicas: $(tksCpNode)
controlPlaneMachineType: $(tksCpNodeType)
ami:
id: ami-02e4e8f09921cfe97
machinePool:
- name: taco
machineType: $(tksInfraNodeType)
Expand All @@ -69,6 +71,8 @@ charts:
taco-ingress-gateway: enabled
roleAdditionalPolicies:
- "arn:aws:iam::aws:policy/AmazonS3FullAccess"
ami:
id: ami-02e4e8f09921cfe97
machineDeployment:
- name: normal
numberOfAZ: 3 # ap-northeast-2
Expand All @@ -80,6 +84,8 @@ charts:
rootVolume:
size: 50
type: gp2
ami:
id: ami-02e4e8f09921cfe97

- name: ingress-nginx
override:
Expand Down
101 changes: 43 additions & 58 deletions aws-reference/lma/site-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ global:

lokiHost: loki-loki-distributed-gateway
lokiPort: 80
lokiuserHost: loki-user-loki-distributed-gateway
lokiuserPort: 80
s3Service: "minio.lma.svc:9000" # depends on $lmaNameSpace (ex. minio.taco-system.svc)

lmaNameSpace: lma
Expand Down Expand Up @@ -148,19 +150,23 @@ charts:
- name: taco-loki
host: $(lokiHost)
port: $(lokiPort)
lokiuser:
- name: taco-loki-user
host: $(lokiuserHost)
port: $(lokiuserPort)
targetLogs:
- tag: kube.*
bufferChunkSize: 2M
bufferMaxSize: 5M
do_not_store_as_default: false
index: container
loki_name: taco-loki
loki_name: taco-loki-user
memBufLimit: 20MB
multi_index:
- index: platform
loki_name: taco-loki
key: $kubernetes['namespace_name']
value: kube-system|$(lmaNameSpace)|taco-system|argo
value: kube-system|$(lmaNameSpace)|taco-system|gatekeeper-system|argo
parser: docker
path: /var/log/containers/*.log
type: kubernates
Expand Down Expand Up @@ -244,7 +250,6 @@ charts:
consoleIngress.nodeSelector: $(nodeSelector)
postJob.nodeSelector: $(nodeSelector)


- name: thanos
override:
global.storageClass: $(storageClassName)
Expand Down Expand Up @@ -274,6 +279,8 @@ charts:
# - --deduplication.replica-label="replica"
storegateway.persistence.size: 8Gi
ruler.nodeSelector: $(nodeSelector)
ruler.service.type: LoadBalancer
ruler.service.annotations: $(awsNlbAnnotation)
ruler.alertmanagers:
- http://alertmanager-operated:9093
ruler.persistence.size: 8Gi
Expand All @@ -283,61 +290,7 @@ charts:
rules:
- alert: "PrometheusDown"
expr: absent(up{prometheus="lma/lma-prometheus"})
- alert: node-cpu-high-load
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 idle process의 cpu 점유율이 3분 동안 0% 입니다. (현재 사용률 {{$value}})
description: 워커 노드 CPU가 과부하 상태입니다. 일시적인 서비스 Traffic 증가, Workload의 SW 오류, Server HW Fan Fail등 다양한 원인으로 인해 발생할 수 있습니다.
Checkpoint: 일시적인 Service Traffic의 증가가 관측되지 않았다면, Alert발생 노드에서 실행 되는 pod중 CPU 자원을 많이 점유하는 pod의 설정을 점검해 보시길 제안드립니다. 예를 들어 pod spec의 limit 설정으로 과도한 CPU자원 점유을 막을 수 있습니다.
summary: Cpu resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: (avg by (taco_cluster, instance) (rate(node_cpu_seconds_total{mode="idle"}[60s]))) < 0 #0.1 # 진짜 0?
for: 3m
labels:
severity: warning
- alert: node-memory-high-utilization
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 Memory 사용량이 3분동안 80% 를 넘어서고 있습니다. (현재 사용률 {{$value}})
descriptioon: 워커 노드의 Memory 사용량이 80%를 넘었습니다. 일시적인 서비스 증가 및 SW 오류등 다양한 원인으로 발생할 수 있습니다.
Checkpoint: 일시적인 Service Traffic의 증가가 관측되지 않았다면, Alert발생 노드에서 실행되는 pod중 Memory 사용량이 높은 pod들에 대한 점검을 제안드립니다.
summary: Memory resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) < 0.2
for: 3m
labels:
severity: warning
- alert: node-disk-full
annotations:
message: 지난 6시간동안의 추세로 봤을 때, 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 root 볼륨은 24시간 안에 Disk full이 예상됨
description: 현재 Disk 사용 추세기준 24시간 내에 Disk 용량이 꽉 찰 것으로 예상됩니다.
Checkpoint: Disk 용량 최적화(삭제 및 Backup)을 수행하시길 권고합니다. 삭제할 내역이 없으면 증설 계획을 수립해 주십시요.
summary: Memory resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: critical
- alert: pvc-full
annotations:
message: 지난 6시간동안의 추세로 봤을 때, 클러스터({{ $labels.taco_cluster }})의 파드({{ $labels.persistentvolumeclaim }})가 24시간 안에 Disk full이 예상됨
description: 현재 Disk 사용 추세기준 24시간 내에 Disk 용량이 꽉 찰것으로 예상됩니다. ({{ $labels.taco_cluster }} 클러스터, {{ $labels.persistentvolumeclaim }} PVC)
Checkpoint: Disk 용량 최적화(삭제 및 Backup)을 수행하시길 권고합니다. 삭제할 내역이 없으면 증설 계획을 수립해 주십시요.
summary: Disk resources of the volume(pvc) {{ $labels.persistentvolumeclaim }} are running low.
discriminative: $labels.taco_cluster, $labels.persistentvolumeclaim
expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 24*3600) < 0 # kubelet_volume_stats_capacity_bytes
for: 30m
labels:
severity: critical
- alert: pod-restart-frequently
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 파드({{ $labels.pod }})가 30분 동안 5회 이상 재기동 ({{ $value }}회)
description: 특정 Pod가 빈번하게 재기동 되고 있습니다. 점검이 필요합니다. ({{ $labels.taco_cluster }} 클러스터, {{ $labels.pod }} 파드)
Checkpoint: pod spec. 에 대한 점검이 필요합니다. pod의 log 및 status를 확인해 주세요.
discriminative: $labels.taco_cluster, $labels.pod, $labels.namespace
expr: increase(kube_pod_container_status_restarts_total{namespace!="kube-system"}[60m:]) > 2 # 몇회로 할 것인지?
for: 30m
labels:
severity: critical


- name: thanos-config
override:
objectStorage:
Expand Down Expand Up @@ -393,10 +346,42 @@ charts:
aws:
s3: http://$(defaultUser):$(defaultPassword)@$(s3Service)/minio

- name: loki-user
override:
global.dnsService: kube-dns
# global.clusterDomain: $(clusterName) # annotate cluste because the cluster name is still cluster.local regardless cluster
gateway.service.type: LoadBalancer
gateway.service.annotations: $(awsNlbAnnotation)
ingester.persistence.storageClass: $(storageClassName)
distributor.persistence.storageClass: $(storageClassName)
queryFrontend.persistence.storageClass: $(storageClassName)
ruler.persistence.storageClass: $(storageClassName)
indexGateway.persistence.storageClass: $(storageClassName)
# select target node's label
ingester.nodeSelector: $(nodeSelector)
distributor.nodeSelector: $(nodeSelector)
querier.nodeSelector: $(nodeSelector)
queryFrontend.nodeSelector: $(nodeSelector)
queryScheduler.nodeSelector: $(nodeSelector)
tableManager.nodeSelector: $(nodeSelector)
gateway.nodeSelector: $(nodeSelector)
compactor.nodeSelector: $(nodeSelector)
ruler.nodeSelector: $(nodeSelector)
indexGateway.nodeSelector: $(nodeSelector)
memcachedChunks.nodeSelector: $(nodeSelector)
memcachedFrontend.nodeSelector: $(nodeSelector)
memcachedIndexQueries.nodeSelector: $(nodeSelector)
memcachedIndexWrites.nodeSelector: $(nodeSelector)
loki:
storageConfig:
aws:
s3: http://$(defaultUser):$(defaultPassword)@$(s3Service)/minio

- name: lma-bucket
override:
s3.enabled: true
s3.buckets:
- name: $(clusterName)-tks-thanos
- name: $(clusterName)-tks-loki
- name: $(clusterName)-tks-loki-user
tks.iamRoles: $(tksIamRoles)
5 changes: 5 additions & 0 deletions aws-reference/policy/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
resources:
- ../base

transformers:
- site-values.yaml
Loading
Loading