Performing Dimension Reduction at Scale with Applications to Public Sentiment Models

We discuss our experience with dimension reduction for big datasets. We investigate the controlled performance decrease of our public sentiment models under transformations that reduce the number of features in the dataset. This feature reduction speeds up our real-time data science tools and helps to counter the curse of dimensionality. We outline the Python workflow that both produces and validates the quality of these transformations at scale in the AWS ecosystem, and we detail our programming and design choices, touching on the scikit-learn API, configuration versus code, SQL templatization, and our open source API client.

 Speaker: Walt Askew, Civis Analytics