The DAG_airflow_db_cleaner.py
DAG periodically cleans some records of the Airflow metadata database. The goal is to keep airflow fast and avoid performance issues.
-
In your metadata database, create a new user and give to it
drop
andread
permissions on thelog
,task_instance
,job
anddag_run
tables in theairflow
database. -
In a secure directory, create a file containing your new user's password.
-
Alter the
DAG_airflow_db_cleaner.py
file and change the line of the first occurrence of the passFile variable with the password path.
It should look like this:
# Path to the password file
passFile = open("/my/safe/directory/user.passfile", "r")
- Next, change the
session
variable giving your connection informations to theconn
function. The first parameter is theuser
; the next is a password variable calledpasswd
that was defined with the result of your passfile path, so you don't need to do anything here; then you need to inform thehost
; next, the connection port; and finally the Airflow database.
It should look like this:
session = conn('user',passwd,'localhost','3306','airflow')
- Finally, when defining your DAG
DAG_airflow_db_cleaner = DAG( .... )
you can adjust thestart_date
parameter.
If everything is ok, it is done!
By default, every 15 days the DAG performs the following cleanings:
- Table
log
: all the logs stored before the current date will be deleted. - Table
task_instance
: all records older than 15 days will be deleted, but not the last one. - Table
job
: records withrunning
estate,latest_heartbeat
>= 15 days andstart_date
with more than 15 days, will be deleted. These kind of records can be generated by Airflow issues, and they never changes theirestate
. - Table
dag_run
: all the dagruns withrunning
estate andstart_date
with more than 15 days will be deleted. As on thejob
case, this is also to avoid endless dagruns.
Feel free to change these "rules" inside the clean(...)
function of each table model, add new models, etc.