This repository has been archived by the owner on Mar 24, 2022. It is now read-only.
forked from pivotal-cf/docs-concourse-olm
-
Notifications
You must be signed in to change notification settings - Fork 0
/
troubleshooting.html.md.erb
79 lines (68 loc) · 3.69 KB
/
troubleshooting.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
title: Monitoring and Troubleshooting Your Installation
owner: Concourse
---
This topic should include:
- Information on common errors and what to do about them
- What logs are available and how to access them.
- Fly commands that may be useful for troubleshooting
Monitoring concourse for metrics: http://concourse.ci/metrics.html
The logs for a specific job inside concourse release will be located in the vms on which that job was running at the location /var/vcap/sys/log/<concourse-job-name>/*.log. To enable syslog forwarding or to get logs for other components of concourse refer to: https://bosh.io/docs/job-logs.html.
Fly commands useful for troubleshooting:
* concourse (env) troubleshooting:
containers: Listing active containers
- check which container/task got placed on which worker
workers: Listing registered workers
- useful to verify that the number of containers aren't exceeding maximum allowable container on a worker.
prune-worker: Reap a non-running worker
- stop concourse from tracking an out of commission worker
volumes: Listing active volumes
- check disk usage across workers.
* pipeline troubleshooting: (do we even need this?)
pipelines: Listing configured pipelines
builds: Showing build history
- useful for getting build ids of one-off tasks run using execute
validate-pipeline: Validate a pipeline's configuration
- check pipeline for validity without calling set-pipeline
check-resource: Trigger discovery of new versions
- useful when developing a new resource
watch: View logs of in-progress builds
intercept: Accessing a running or recent build's steps
- poke around in a task container to see what's up
execute: Submitting Local Tasks
- spinning up a task quickly to test before putting it in a job
Common bosh Issues: https://bosh.io/docs/tips.html
Common concourse Issues:
Worker Out of Disk Space
Symptoms:
- cannot create volume: permissions denied (approximate error wording)
Solution:
- increase persistent disk for worker or increase number of worker vms
Container Limit Reached (unlikely)
Symptoms:
- cannot create container: limit of 250 containers reached
Solution:
- check fly containers
- increase number of worker vms
- decrease gc_interval if set to custom value (a large interval could mean that expired containers are kept around long enough that they build up)
Job doesnt get started (Build stuck in Pending state)
Solution:
- restart atc job.
-> This can be done either by loggin in as a root user on concourse web vms (the ones on which atc job is located.) Then running `monit restart atc`
-> Or by running `bosh restart web --force`, which will be slower than previous option.
Updating Concourse in Concourse Job errors out
Symptom
- when bosh deploying a concourse update from a job running on that concourse typically the job will error with "worker for container not found." This is expected as the bosh director will recreate the worker vm
Solution:
- run it again; it should be a no-op the second time after a successful update
- or just live with an orange build; it was probably a successful redeploy
Gotcha:
- this is only possible when running with a bosh director as the director is performing the update. If your concourse is deployed using bosh create-env you will need to perform the update elsewhere, typically from your workstation
Bosh Can't Finish Worker Upgrade While Tasks Running
Symptom:
- if you have a long running task bosh won't be able to finalize the upgrade by restarting the worker job until all of the worker's jobs (tasks?) have stopped
Solution:
- just wait
- try to avoid long running tasks
- e.g.: pool-resource if misused
- cancel running tasks/jobs