Adding auto increment field to couchdb postgresql table

We have a data warehouse and some ETL pipelines pulling data from postgresql database and loading it into the data warehouse.
We would want to be able to perform incremental data loads into our data warehouse but this is difficult without a timestamp field or and integer auto incremental key.

Actualy in our test and live instances we have that the couchdb table has only one json column, doc

Could we add an auto incremental colonne to this table safely without crashing the coucdb2pg doc insertions in this table ?

Could we please get your advise on this ?

1 Like

Hi @bamatic !

Could we add an auto incremental colonne to this table safely without crashing the coucdb2pg doc insertions in this table ?

I think that would work fine, although this hasn’t been tested at all. It would be easy enough to test it out on your data on a clone and see if it works as expected. It’s possible you’ll get duplicated data (ie: multiple versions of the same doc, or even multiple auto-generated IDs for the same version of the same doc) so be sure to filter these out in your ETL layer using the _id and _rev fields.

Also note that making schema changes like that may make upgrading couch2pg in future difficult as it won’t match the expected schema, however couch2pg is currently in maintenance mode and no further releases are planned.

Could we please get your advise on this ?

At Medic we’ve also hit the same limitation with incremental updates. To get around this we’re in the process of developing an ETL pipeline to completely replace couch2pg and use incremental updates to improve update performance with large datasets. While this is pre-beta at this stage you can read more about the design and there will be discussion and maybe even a demo of this new pipeline in the upcoming CHT Round-up call.

Are you interested in being an early adopter for this pipeline? If so we’d be very interested in working with you to get this ready for production use.

1 Like

If I understand the idea, it is to replace cht-couch2pg with this new pipeline, logstash to extract the data from couchdb and pass it to postgREST api to insert the data into the postgresql database, i guess still in json format.
dbt to incrementally transform the json data into some “analytics-ready” tables in postgresql, instead of refreshing the materialized views that extract the data every 12 hours from the postgresql json column against the whole data and not incrementally

Finally apache superset for dashboarding, instead of klipfolio
I’ve saved the date for the CHT Round-up call, I will be reading more on all this and discuss this opportunity with our team
We’ill let you know if we decide to test the new pipeline, personally I think it’s really cool.
Thanks for all this work

Yes you understand the concept perfectly!

The first stage which does the couchdblogstashpostgrestpostgresql bit we’re calling cht-sync and it’s the most stable. If all you need is the raw JSON with sequential IDs to hook in to your existing ETL then we could work together to get that going without worrying about DBT or Superset which are still in development.