5.3 Using ODBC to retrieve data from DHIS2 into R

In this example, we will use a system-wide ODBC connector which will be used to retrieve data from the DHIS2 database. There are some disadvantages with this approach, as ODBC is slower than other methods and it does raise some security concerns by providing a system-wide connector to all users. However, it is a convenient method to provide a connection to multiple users. The use of the R package RODBC will be used in this case. Other alternatives would be the use of the RPostgreSQL package, which can interface directly through the PostgreSQL driver described in the next section.

Assuming you have already installed R from the procedure in the previous section. Invoke the following command to add the required libraries for this example.

apt-get install r-cran-rodbc r-cran-lattice odbc-postgresql

Next, we need to configure the ODBC connection. Edit the file to suit your local situation using the following template as a guide. Lets create and edit a file called odbc.ini

[dhis2]
Description         = DHIS2 Database
Driver              = /usr/lib/odbc/psqlodbcw.so
Trace               = No
TraceFile           = /tmp/sql.log
Database            = dhis2
Servername          = 127.0.0.1
UserName            = postgres
Password            = SomethingSecure
Port                = 5432
Protocol            = 9.0
ReadOnly            = Yes
RowVersioning       = No
ShowSystemTables    = No
ShowOidColumn       = No
FakeOidIndex        = No
ConnSettings        =
Debug = 0

Finally, we need to install the ODBC connection with odbcinst -i -d -f odbc.ini

From the R prompt, execute the following commands to connect to the DHIS2 database.

> library(RODBC)
> channel<-odbcConnect("dhis2")#Note that the name must match the ODBC connector name
> sqlTest<-c("SELECT dataelementid, name FROM dataelement LIMIT 10;")
> sqlQuery(channel,sqlTest)
                                                                        name
1   OPD First Attendances Under 5
2   OPD First Attendances Over 5
3   Deaths Anaemia Under 5 Years
4   Deaths Clinical Case of Malaria Under 5 Years
5   Inpatient discharges under 5
6   Inpatient Under 5 Admissions
7   Number ITNs
8   OPD 1st Attendance Clinical Case of Malaria Under 5
9  IP Discharge Clinical Case of Malaria Under 5 Years
10 Deaths of malaria case provided with anti-malarial treatment 1 to 5 Years
>

It seems R is able to retrieve data from the DHIS2 database.

As an illustrative example, lets say we have been asked to calculate the relative percentage of OPD male and female under 5 attendances for the last twelve months.First, lets create an SQL query which will provide us the basic information which will be required.

OPD<-sqlQuery(channel,"SELECT p.startdate, de.name as de, sum(dv.value::double precision)
 FROM datavalue dv
 INNER JOIN period p on dv.periodid = p.periodid
 INNER JOIN dataelement de on dv.dataelementid = de.dataelementid
 WHERE p.startdate >= '2011-01-01'
 and p.enddate <= '2011-12-31'
 and de.name ~*('Attendance OPD')
 GROUP BY p.startdate, de.name;")

We have stored the result of the SQL query in an R data frame called “OPD”. Lets take a look at what the data looks like.

> head(OPD)
   startdate                                 de    sum
1 2011-12-01   Attendance OPD <12 months female  42557
2 2011-02-01   Attendance OPD <12 months female 127485
3 2011-01-01   Attendance OPD 12-59 months male 200734
4 2011-04-01   Attendance OPD 12-59 months male 222649
5 2011-06-01   Attendance OPD 12-59 months male 168896
6 2011-03-01 Attendance OPD 12-59 months female 268141
> unique(OPD$de)
[1] Attendance OPD <12 months female   Attendance OPD 12-59 months male  
[3] Attendance OPD 12-59 months female Attendance OPD >5 years male      
[5] Attendance OPD <12 months male     Attendance OPD >5 years female    
6 Levels: Attendance OPD 12-59 months female ... Attendance OPD >5 years male
> 
          

We can see that we need to aggregate the two age groups (< 12 months and 12-59 months) into a single variable, based on the gender. Lets reshape the data into a crosstabulated table to make this easier to visualize and calculate the summaries.

>OPD.ct<-cast(OPD,startdate ~ de) 
>colnames(OPD.ct)
[1] "startdate"                          "Attendance OPD 12-59 months female"
[3] "Attendance OPD 12-59 months male"   "Attendance OPD <12 months female"  
[5] "Attendance OPD <12 months male"     "Attendance OPD >5 years female"    
[7] "Attendance OPD >5 years male" 

We have reshaped the data so that the data elements are individual columns. It looks like we need to aggregate the second and fourth columns together to get the under 5 female attendance, and then the third and fifth columns to get the male under 5 attendance.After this, lets subset the data into a new data frame just to get the required information and display the results.

> OPD.ct$OPDUnder5Female<-OPD.ct[,2]+OPD.ct[,4]#Females
> OPD.ct$OPDUnder5Male<-OPD.ct[,3]+OPD.ct[,5]#males
> OPD.ct.summary<-OPD.ct[,c(1,8,9)]#new summary data frame
>OPD.ct.summary$FemalePercent<-
OPD.ct.summary$OPDUnder5Female/
(OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)*100#Females
>OPD.ct.summary$MalePercent<-
OPD.ct.summary$OPDUnder5Male/
(OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)*100#Males 

Of course, this could be accomplished much more elegantly, but for the purpose of the illustration, this code is rather verbose.Finally, lets display the required information.

> OPD.ct.summary[,c(1,4,5)]
    startdate FemalePercent MalePercent
1  2011-01-01      51.13360    48.86640
2  2011-02-01      51.49154    48.50846
3  2011-03-01      51.55651    48.44349
4  2011-04-01      51.19867    48.80133
5  2011-05-01      51.29902    48.70098
6  2011-06-01      51.66519    48.33481
7  2011-07-01      51.68762    48.31238
8  2011-08-01      51.49467    48.50533
9  2011-09-01      51.20394    48.79606
10 2011-10-01      51.34465    48.65535
11 2011-11-01      51.42526    48.57474
12 2011-12-01      50.68933    49.31067

We can see that the male and female attendances are very similar for each month of the year, with seemingly higher male attendance relative to female attendance in the month of December.

In this example, we showed how to retrieve data from the DHIS2 database and manipulate in with some simple R commands. The basic pattern for using DHIS2 and R together, will be the retrieval of data from the DHIS2 database with an SQL query into an R data frame, followed by whatever routines (statistical analysis, plotting, etc) which may be required.