#rstatsnyc 2019
The fifth annual (and second for me) Rstats NYC conference happened on May 10th and 11th. You can see all the action on Twitter under the #rstatsnyc
tag. Here are my key takeaways.
Data Science Education at Amplify
Ludmila Janda, a member of my team, presented on the work she is doing on a new series of computer science lessons. Alongside a team of curriculum developers, designers, engineers, and project managers, she has been the consulting subject matter expert for a unit about data analysis. They have been adapting the Scratch block programming system to handle data manipulating and graphing. She presented the design process she has gone through with the team to bring the tidy data and grammar of graphics approaches into the Scratch system. I think we’ve taken David Robinson’s “teach the tidyverse first” advice to the extreme! If the project survives the piloting phase, it has the potential to introduce these concepts to millions of students.
Crossing the Valley of Heartbreak
The image below is a slide from Noam Ross’s debut rollout of the redoc
package. Noam was describing the effort of working in two worlds at once: program outputs that are edited by hand. Ideally our programs would generate awesome editable docs, and those edits could flow back to our programs. The fact that this doesn’t happen leads to much heartbreak.
I think this image could serve as the banner for all my favorite talks at the conference. Our team of data scientists at Amplify is fairly small and scrappy, and I’d say we’re all at our best when we’re generalists—working across all areas of the company within each person’s product group. Every stakeholder brings their own challenges and limitations. Business and product can’t program, engineering doesn’t understand our data analysis stack, data engineering doesn’t have all the business context. Each of these forays into a different functional team is a crossing of the valley of heartbreak.
At the conference there were a few talks that were, in a way, about building bridges across this valley of heartbreak.
Bringing R to our business partners. Can
redoc
be the bridge to non-programming collaborators? This is Noam’s pitch and the reversibility of Word documents is almost too good to be true. I can’t wait to try this out on, for example, the next white paper my group produces.Bringing R to our data engineers. Amanda Dobbyn spoke about
drake
, a data analyis workflow package. Candrake
be the bridge to prototyping pipelines with data engineering? Our data engineering team uses Matillion for ETL, and data science uses a mix of SQL, R, and lookML. When data science has a transformation we’d like stood up in Matillion, the pieces of code and documentation that data engineering needs is spread across different places. I’m wondering ifdrake
can’t be a central place to tie this all together, enabling data engineering to visualize and understand the new business logic that data science is asking them to implement in their systems. Data engineering will be adopting one of my bespoke data transformation pipelines over the summer and I think I’m going to givedrake
a go.Bringing R to our customers. There were a few talks that addressed moving from a reproducible-but-still-ad-hoc analysis to production software. Emily Dodwell discussed the process her team went through in developing a distributed workflow for the implementation of their analyses. And the dynamic duo of Jacqueline Nolis and Heather Nolis spoke about how to run R in production in Docker containers. At Amplify, we already have R code running in production (using Docker) serving data to teachers. Still, I’m thinking of what else we can do, and since it’s so easy, if there aren’t also internal tools we can make.
Bring R to our other Python workflows. The Apache Arrow project (and the
arrow
andfeather
packages) that Wes Mikinney spoke about is all about being a bridge between languages and tools. Arrow is a common data layer that has libraries across multiple languages to support easy and fast data exchange. We usefeather
to pass data between Python and R, and even between R processes because it’s so fast. I am very much looking forward to thefeather
file format benefiting from an enhanced Arrow backend.
RLadies representation
The majority of speakers were women, and the #rladiesnyc group played a large role in helping that happen. This is yet another example of the way that R users are leading the charge on diversity and inclusion in the tech world. I couldn’t be happier to support and be a part of such a community.
Lightning round of other random thoughts: Jared Lander is an excellent host; the food was delicious and very New York; the venue was too small but here’s hoping they’re able to secure an bigger venue next year without increasing the registration fee; I really really enjoyed the single-track format. What an excellent conference.
— Samuel