Vacuum Analyze with Talend Open Studio


The whole started with a requirement to vacuum analyze our production greenplum appliances. The problem was with the timing, we only had two hours to complete the activity. This means we would need to start with the table that was never vacuumed, then followup with the other tables ordered by vacuum date and stop process precisely after two hours. The following query will return the vacuum analyze SQLs in the correct order (works for all postgres clones), we just need to execute them:

/*--VACUUM ANALYZE script that lists the not yet anaylzed tables on top followed by the tables that needs analyze most*/
SELECT 'VACUUM ANALYZE '||quote_ident(tbl.table_schema)||'.'||quote_ident(tbl.table_name) query
    /*opercreate.*,opervacuum.*,tbl.table_catalog,tbl.table_schema,tbl.table_name,tbl.table_type 	*/
FROM pg_stat_last_operation opercreate 
	LEFT JOIN (SELECT DISTINCT classid,objid,statime FROM pg_stat_last_operation
 		   WHERE staactionname IN ('VACUUM','ANALYZE')
		   GROUP BY classid,objid,statime
		   HAVING statime = MAX(statime) ) opervacuum 
	    ON (opercreate.classid = opervacuum.classid 
            AND opercreate.objid = opervacuum.objid)
	LEFT JOIN information_schema.tables tbl 
            ON (quote_ident(tbl.table_schema) || '.' || quote_ident(tbl.table_name))::regclass = opercreate.objid
  AND opercreate.staactionname = 'CREATE' AND opercreate.stasubtype 	 = 'TABLE'
--AND table_schema LIKE '"+ context.SCHEMAMASK +"'
--AND table_name    LIKE '"+ context.TABLEMASK+"'
  AND table_schema IS NOT NULL
  AND table_schema NOT IN ('pg_catalog','information_schema','gp_toolkit','hawq_toolkit')
  AND table_schema NOT LIKE 'pg\\_%'
ORDER BY coalesce(opervacuum.statime,'19000101'::timestamptz) ASC

It is possible to narrow down the tables/schemas using filters. So we have a SQL script that provides us the SQL queries we have to execute, and this is where we hit a problem. We can execute these SQL-s with Talend. We can even use multiple threads. But we want to have some automation to kill this after a given time.

My first solution looked like this:


Overview of the first version.

Looking at the results we can see that something is not okay. The execution time should be 2 hours: it was, execution should succeed, it failed. With a generated fail message, still it appears as a failure.


Failure even if the job was successful

This approach works, you can use this to automatically kill long running ETL sessions, and get notified by.Since we want to have this in production (where we get a notification from all the failed jobs) we need a way that won’t generate any failures in the logs, if we have a failure in the logs there must be a failure. I shared this approach because it is a generic solution. Its simple, and can be used anywhere.

As for the second version, I realised we have to get rid of the tDie component, also this approach works with Talend Open Studio


Overview of the second version

So where is the magic you may ask? Well it is hidden inside the tJavaRow component:

long startTime = 0;
if((String)globalMap.get("startTime") == null) {
	startTime = System.currentTimeMillis();
	globalMap.put("startTime",Long.toString(startTime) );
startTime =   Long.valueOf( (String)globalMap.get("startTime") );
if( (System.currentTimeMillis() - startTime) >= ((long)context.stop_after_N_minutes) * 60 /*sec*/ * 1000 /*milisec*/  ) {
	continue; /* stop */
output_row.sql = input_row.sql;

After a given time we start to skip in the loop. This way after the given time the second tFlowMeter won’t receive any new data.


Execution details

I configured it to stop after 1 minute, however it seems we had a few queries running, after those completed the job quickly iterated on the results without passing them. We can have a comparison with the original version, how it appears in the logs:


Talends Project level logging

For every project I configure project level logging, that is one of the best features talend offers, and its available in open studio as well.

You may ask how does the actual executor looks like:


Executor job

Since this job was created in Talend Enterprise we have the option to use Threads. The vacuums can be paralellized, using the Iterate links advanced option with a context variable. But stil connection creation will take some considerable time. So we should use a shared connection. Unfortunately we need a separate connection for each thread, since the PostgreSQL JDBC driver will serialize the parallel requests. We can have a unique connection name among the Threads:


The tJava contains only a System.out.println. Since this executor will be called thousands of times every week, project level logging will be turned on, but we still have the console logs to notify about errors, which shouldn’t happen.

Did you like this article? Follow us on Twitter!

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Not readable? Change text. captcha txt