Reporting of Data in Postgres Populated from couch2pg (Json to tabular format)

Hi
We managed to replicate the data from couchDb to Postgres using couch2pg, the format of the data is different from data exported from medic portal report, thus difficult to analyze at the moment. We already have data from the case investigation form, which is organized into columns with questions and rows with the data values. We expect data from couch2pg to have both the contact registration data from the contact registration form and case investigation data from the case investigation form.

The couchdb table has one column, doc. Trying to figure out how to convert the data into CSV with clear columns similar to data exported from medic portal report. I have tried reading the file into Python but getting reading errors, will appreciate any insights

couch2p Migration Errors_3

couch2p Migration Errors

Great progress @oyierphil!

The data in the couchdb table is organized in rows of JSON documents. This includes all documents in the medic database in CouchDB. Fields within the documents can be queried using postgres’ JSON operators detailed here.

Views are convenient for accessing form data, where a view for each form type can be defined as

CREATE VIEW formview_myform AS 
(
	SELECT
		doc#>>'{fields,subfield1}' AS field1,
		doc#>>'{contact,subfield2,subsubfield1}' AS field2,
		<other fields>
		to_timestamp((NULLIF(doc->>'reported_date', '')::bigint / 1000)::double precision) AS reported_date
	FROM 
		couchdb	
	WHERE 
		doc->>'form' = 'myform'
);

Data from the view can be queried as

SELECT * FROM formview_myform

Contact data can be found in a default view called contact_metadata, See the view definition here.

To export CSV data from postgres, use this query:

COPY (

	SELECT * FROM formview_myform
	
) TO '/tmp/myform.csv' WITH (FORMAT CSV, HEADER);

We need to push data from CouchDb to PostgreSQL (already done using Couch2pg API).
The current utility of the data is as follows:

  1. Tabular format for use by project team members (Currently downloaded as csv, but json format)
  2. Materialized views for Superset dashboard (Tabular format)

The view runs, but returns null values in some fields, especially from the CIF form.
We did a join operation on couchdb table and raw_contacts view to get the missing fields, but still null result is returned.
I share updated code as below, which runs but doesn’t display all the fields, , which has our test results

SELECT couchdb.doc ->> ‘_id’::text AS uuid,
couchdb.doc ->> ‘_rev’::text AS rev,
couchdb.doc ->> ‘form’::text AS form,
couchdb.doc ->> ‘from’::text AS “from”,
couchdb.doc ->> ‘type’::text AS type1,
couchdb.doc #>> ‘{fields,lab,specimen_id}’::text[] AS specimen_id,
couchdb.doc #>> ‘{fields,lab,testing_lab}’::text[] AS testing_lab,
couchdb.doc #>> ‘{fields,lab,r_specimen_id}’::text[] AS r_specimen_id,
couchdb.doc #>> ‘{fields,lab,date_spec_coll}’::text[] AS date_spec_coll,
couchdb.doc #>> ‘{fields,lab,monotonic_counter}’::text[] AS monotonic_counter,
couchdb.doc #>> ‘{meta,instanceID}’::text[] AS instanceid,
couchdb.doc #>> ‘{inputs,meta,location,lat}’::text[] AS lat,
couchdb.doc #>> ‘{inputs,meta,location,long}’::text[] AS long,
couchdb.doc #>> ‘{inputs,meta,location,error}’::text[] AS error,
couchdb.doc #>> ‘{inputs,meta,location,message}’::text[] AS message,
couchdb.doc #>> ‘{inputs,meta,depracatedID}’::text[] AS depracatedid,
couchdb.doc #>> ‘{user,name}’::text[] AS name1,
couchdb.doc ->> ‘source’::text AS source,
couchdb.doc #>> ‘{contact,_id}’::text[] AS _id,
couchdb.doc #>> ‘{contact,sex}’::text[] AS sex1,
couchdb.doc #>> ‘{contact,name}’::text[] AS name2,
couchdb.doc #>> ‘{contact,role}’::text[] AS role,
couchdb.doc #>> ‘{contact,type}’::text[] AS type2,
couchdb.doc #>> ‘{Vaccine,covdose}’::text[] AS covdose,
couchdb.doc #>> ‘{Vaccine,covax}’::text[] AS covax,
couchdb.doc #>> ‘{Vaccine,vaxboost}’::text[] AS vaxboost,
couchdb.doc #>> ‘{Vaccine,vaxplace}’::text[] AS vaxplace,
couchdb.doc #>> ‘{testing,cov_dose}’::text[] AS cov_dose,
couchdb.doc #>> ‘{testing,tb_test}’::text[] AS tb_test,
couchdb.doc #>> ‘{screening,bi}’::text[] AS bi,
couchdb.doc #>> ‘{screening,tb}’::text[] AS tb,
couchdb.doc #>> ‘{screening,cov}’::text[] AS cov,
couchdb.doc #>> ‘{screening,biscreen}’::text[] AS biscreen,
couchdb.doc #>> ‘{screening,bi_suspect}’::text[] AS bi_suspect,
couchdb.doc #>> ‘{sampling_place,test_place}’::text[] AS test_place,
raw_contacts.doc #>> ‘{meta,created_by_place_uuid}’::text[] AS created_by_place_uuid,
raw_contacts.doc #>> ‘{meta,created_by_person_uuid}’::text[] AS created_by_person_uuid,
raw_contacts.doc ->> ‘type’::text AS type3,
raw_contacts.doc ->> ‘contact’::text AS contact,
raw_contacts.doc ->> ‘gender’::text AS gender,
raw_contacts.doc #>> ‘{parent,_id}’::text[] AS parent_uuid,
raw_contacts.doc #>> ‘{address,nhif}’::text[] AS nhif,
raw_contacts.doc #>> ‘{address,ward}’::text[] AS ward,
raw_contacts.doc #>> ‘{address,phone}’::text[] AS phone,
raw_contacts.doc #>> ‘{address,county}’::text[] AS county,
raw_contacts.doc #>> ‘{address,village}’::text[] AS village,
raw_contacts.doc #>> ‘{address,kin_name}’::text[] AS kin_name,
raw_contacts.doc #>> ‘{address,p_address}’::text[] AS p_address,
raw_contacts.doc #>> ‘{address,residence}’::text[] AS residence,
raw_contacts.doc #>> ‘{address,subcounty}’::text[] AS subcounty,
raw_contacts.doc #>> ‘{address,nationality}’::text[] AS nationality,
raw_contacts.doc #>> ‘{address,alternate_phone}’::text[] AS alternate_phone,
raw_contacts.doc #>> ‘{address,kin_phone_number}’::text[] AS kin_phone_number,
raw_contacts.doc #>> ‘{address,kin_relationship}’::text[] AS kin_relationship,
raw_contacts.doc #>> ‘{address,address_delimiter}’::text[] AS address_delimiter,
raw_contacts.doc #>> ‘{address,link_health_facility}’::text[] AS link_health_facility,
raw_contacts.doc #>> ‘{address,n_next_of_kin_details}’::text[] AS next_of_kin_details,
raw_contacts.doc #>> ‘{medical,m_exam}’::text[] AS m_exam,
raw_contacts.doc #>> ‘{medical,info_tb}’::text[] AS info_tb,
raw_contacts.doc #>> ‘{medical,tbaware}’::text[] AS tbaware,
raw_contacts.doc #>> ‘{medical,info_cov}’::text[] AS info_cov,
raw_contacts.doc #>> ‘{medical,covidaware}’::text[] AS covidaware,
raw_contacts.doc #>> ‘{per_info,sex}’::text[] AS sex2,
raw_contacts.doc #>> ‘{per_info,name}’::text[] AS name3,
raw_contacts.doc #>> ‘{per_info,f_name}’::text[] AS f_name,
raw_contacts.doc #>> ‘{per_info,o_name}’::text[] AS o_name,
raw_contacts.doc #>> ‘{per_info,s_name}’::text[] AS s_name,
raw_contacts.doc #>> ‘{per_info,case_id}’::text[] AS case_id,
raw_contacts.doc #>> ‘{per_info,dob_known}’::text[] AS dob_known,
raw_contacts.doc #>> ‘{per_info,education}’::text[] AS education,
raw_contacts.doc #>> ‘{per_info,dob_approx}’::text[] AS dob_approx,
raw_contacts.doc #>> ‘{per_info,national_id}’::text[] AS national_id,
raw_contacts.doc #>> ‘{per_info,age_in_years}’::text[] AS age_in_years,
raw_contacts.doc #>> ‘{per_info,dob_calendar}’::text[] AS dob_calendar,
raw_contacts.doc #>> ‘{per_info,id_delimiter}’::text[] AS id_delimiter,
raw_contacts.doc #>> ‘{per_info,n_id_numbers}’::text[] AS n_id_numbers,
raw_contacts.doc #>> ‘{per_info,patient_name}’::text[] AS patient_name,
raw_contacts.doc #>> ‘{per_info,p_national_id}’::text[] AS p_nattional_id,
raw_contacts.doc #>> ‘{per_info,marital_status}’::text[] AS marital_status,
raw_contacts.doc #>> ‘{per_info,n_demographics}’::text[] AS n_demographics,
raw_contacts.doc #>> ‘{per_info,ephemeral_years}’::text[] AS ephemeral_years,
raw_contacts.doc #>> ‘{per_info,type_of_identifier}’::text[] AS type_of_identifier,
raw_contacts.doc #>> ‘{per_info,demographics_delimiter}’::text[] AS demographics_delimiter,
raw_contacts.doc ->> ‘place_id’::text AS place_id,
raw_contacts.doc ->> ‘id_number’::text AS id_number,
raw_contacts.doc #>> ‘{consenting,icf}’::text[] AS icf,
raw_contacts.doc #>> ‘{consenting,consent1}’::text[] AS consent1,
raw_contacts.doc #>> ‘{consenting,consent2}’::text[] AS consent2,
raw_contacts.doc #>> ‘{eligibility,c1}’::text[] AS c1,
raw_contacts.doc #>> ‘{eligibility,c5}’::text[] AS c5,
raw_contacts.doc #>> ‘{eligibility,elig}’::text[] AS elig,
raw_contacts.doc #>> ‘{eligibility,p_elig}’::text[] AS p_elig,
raw_contacts.doc #>> ‘{eligibility,eligible}’::text[] AS eligible,
raw_contacts.doc ->> ‘patient_age’::text AS patient_age,
raw_contacts.doc ->> ‘contact_type’::text AS contact_type,
raw_contacts.doc ->> ‘introduction’::text AS introduction,
raw_contacts.doc ->> ‘phone_number’::text AS phone_number,
raw_contacts.doc ->> ‘date_of_birth’::text AS date_of_birth,
raw_contacts.doc ->> ‘reported_date’::text AS reported_date,
raw_contacts.doc ->> ‘relationship_delimiter’::text AS relationship_delimiter,
raw_contacts.doc ->> ‘notes’::text AS notes
FROM couchdb
LEFT JOIN raw_contacts ON (couchdb.doc ->> ‘_id’::text) = (raw_contacts.doc ->> ‘_id’::text);

We created a view to join the two tables and got the error below, indicating issues with custom view refresh:

@jkuester, @diana, please review and advise, we thought we would find all the report data on the couchdb table after using the API, which isn’t the case now

Hi @oyierphil

Checking out the error that you are getting:
Create a unique index with no WHERE clause on one or more columns of the materialized view.

According to Postgres docs, refreshing a view CONCURRENTLY is only allowed for views that have at least one unique index which uses only column names and includes all rows.
Source: PostgreSQL: Documentation: 9.6: REFRESH MATERIALIZED VIEW

Can you try adding a unique index on the view? The uuid (and in combination with the rev field, if you believe you will have duplicates) field would be the best choice.

1 Like

@oyierphil I wanted to highlight some things from the SQL statement you shared.

couchdb vs. raw_contacts

  • couchdb is a table that includes all records from the medic database
  • raw_contacts is a view that get’s it’s data from the couchdb table, but only includes contact docs

This means that raw_contacts is really just a subset of couchdb. Every document in raw_contacts will also be in couchdb.

JOIN-ing Reports with Contacts

The LEFT JOIN you have…

…is joining records from the couchdb table with records having the same _id in the raw_contacts view. Since couchdb includes all the records in raw_contacts, your query is actually only JOINing data when the two records are the same record.

I think what you are trying to do is JOIN data from the report with the person that report was about. To do that, you will probably need to JOIN the report’s doc #>> '{fields,patient_id}' field with the contact’s doc->> '_id' field.

Recommended View Architecture - Reports

Normally we will create one PostgreSQL view for each form we have defined in a project. So if you have a form called new_case and another form called case_investigation, you would have two separate views for those:

  • formview_new_case
  • formview_case_investigation

The WHERE clause for each one of those views (as illustrated in the post above) would filter the specific form, like this:

FROM
	couchdb 

WHERE
	doc->>'form' = 'case_investigation'

Recommended View Architecture - Contacts

Likewise for Contacts, it’s common to have a different view for the different contact_types you have defined in your project. So if you have defined contact_types of “chw” and “supervisor”, you might have two different views called:

  • contactview_chw
  • contactview_supervisor
1 Like

@michael and @diana, thank you for the feedback
I also thought that the couchdb table will have all the reports, thus we would have used this as a data source for Superset, any idea why it doesnt have the CIF data, like in our instance? We only get data for the suspect_registration and NOT CIF, we now have about 195,000 documents

We used the couch2pg API to pull the data, and it looked like the process completed, using the code below:
export POSTGRESQL_URL=postgres://username:pwd@vmIP:5455/testdb
export COUCHDB_URL=https://username:pwd@subdomain:5432/couch-rep
export COUCH2PG_DOC_LIMIT=1000
export COUCH2PG_RETRY_COUNT=5
export COUCH2PG_SLEEP_MINS=120
export COUCH2PG_CHANGES_LIMIT=1000
node .

@oyierphil A couple of things to try.

Search for a known case investigation report

  1. Go to the reports tab and find a case investigation report
  2. Copy the ID from the URL
  3. Search for that ID in PostgreSQL

You can use this SQL to get that record:

SELECT
	jsonb_pretty(doc)
FROM
	couchdb
WHERE
	doc->>'_id' = '<the ID from the URL>'

Get a general overview of quantity of reports in PostgreSQL

If you run this SQL, do you see what you expect?

SELECT
	doc->>'form' AS form,
	count(*)
	
FROM
	COUCHDB
	
WHERE
	doc->>'type' = 'data_record'
	
GROUP BY 
	form

ORDER BY 
	count(*) DESC;

@michael, sorry didn’t get back… I did the query and got form as case_investigation and count as 38 on our test instance.
Trying to run the same for the contact form, suspected_case to see the number of records, our idea now is to create two materialized views for each form, then pick fields of interest into one view which we will use as data source for Superset

Great, thanks for the update @oyierphil . Feel free to respond here if you want us to review anything else, or if you want some more advice on architecting views in PostgreSQL.

@michael, we managed to create two materialized views, from where we created one view pulling fields of interest to Superset, which has worked.
Currently figuring out aggregating data collected daily using date of sample collection into weekly or monthly summaries, which will allow us plot a bar graph for trend analysis per week or per month

2 Likes