By Deborah Cristina Vera-Cruz, @DeborahVeraCruz
I have just finished an amazing 6-week volunteer program with the Innovation Unit at NYHQ. It has been an enriching and rewarding experience, to work with the team and be able to contribute to the great work that is being done – bringing innovative solutions to the ones most in need.
The purpose of the volunteer assignment was to support some ongoing Big Data research activities and propose future developments to maximize the potential of RapidPro, specifically, U-Report using different data sources. U-Report is an SMS based platform for youth engagement, born in UNICEF’s Innovation Labs, that is gaining momentum and acceptance worldwide.
Nevertheless, in order to explore and exploit its full potential, it is necessary to count on a sandbox infrastructure to easily provide access to U-Report data and its combination with other complementary data, such as classic statistics, maps and new digital trails like Twitter.
Big Data analytics has opened the door to transform a vast quantity of digital trails into information and knowledge about the underlying communities generating those trails. The importance of this type of analysis for humanitarian response is being acknowledged by the efforts of the UN system on the so-called Data Revolution.
First steps – RapidPro class diagram
As part of the proposed work, understanding the database model was essential. Therefore, it was necessary to comprehend how the RapidPro system is built and how it is operated in order to make suggestions regarding data storage and management.
Based on the Python packages with class definitions, available through the github website of the project, the database model was translated to a diagram where all the entities, attributes and relationships were mapped.
The motivation for studying RapidPro’s database model was to know how the user (Contact) data for U-Report is stored in the system. One of the main findings was that, due to RapidPro’s flexibility, which enables each application to create the fields that are necessary for the specific context, each country potentially has different fields to represent the same data.
U-Report User Information
After me and Manuel were able to study the database model, namely, the entities and relationships that represent the U-Reporter in the system, we realized that the data regarding the variables that represent the user were stored in a key-value format, only accessible through the API provided by RapidPro, for each of the different countries.
Therefore, we developed a simple program in Python (based on the RapidPro Python client document) to open a txt file, read information about the name of the countries and the respective API token, to connect to each database and obtain the user identification fields.
With the program and based on the data that was available from a sample of 6 countries, namely Mali, Burundi, Cameroon, Zimbabwe, Sierra Leone and Central African Republic, we were able to identify two categories of information: Identification (age, gender, occupation) and Location (different administrative levels) and created a set of recommendations of best practices for variable naming.
For analysis purposes of U-Report, it is necessary to aggregate data regarding U-Reporters from all the countries where the program is running. Unless the identification variables (which collect the data) are created with the same name, it is extremely complicated to obtain the aggregated data in a quick and efficient way, especially considering the rapidly growing number of countries and the different platforms that U-Report uses.
Opportunities for next steps – Making use of Big Data sources
U-Report has been launched in 15 countries and growing. Most of these countries use the SMS version, however some are starting to use Twitter as well, taking advantage of the growing Internet connected community and the benefits that come with it, such as less cost.
With Twitter data there is a huge opportunity to identify, store and extract useful information from the social network of U-Reporters as well as the dynamics of geo-located tweets.
With tweets coming from different locations at different times of the day, it is possible to dynamically map a relationship between a user and a location and identify it as being from home, school, work, etc. Mapping the movement of a user can also show us trends when the aggregated data is analyzed.
Through the social network of U-Reporters, by analyzing their group of friends and what kind of subjects that they tweet, we can identify the ones who can potentially be of most influence when making a campaign, and we can also identify potential new U-Reporters based on common friends between existing users.
Because this is potentially a large amount of data (all relationships among U-Reporters and other Twitter users) we suggest using a non-relational database such as the open source elastic search that not only allow retrieving data from different sources such as Twitter streams, but also to search, analyze, and visualize it in real time.
It is clear that the possibilities are endless, if the necessary data is stored and analyzed in the most appropriate way. It may take some time and effort, but the outcome will make it worthy…so, lets get to work!!
I could not have asked for a better way to complete this Fulbright Fellowship Program. After months of academic activities, being able to put in practice what I learned and finally combining my expertise in IT with my passion – doing good, bringing change in innovative ways and taking steps towards making this world a better place.