Call for Papers

Small data are the digital traces that individuals generate as a byproduct of daily activities, such as sending e-mail or exercising with fitness trackers. Small data production is poised to explode in richness and variety as users’ online activities expand and wearable and mobile devices proliferate. The transformations brought to large organizations by big data promise to be mirrored in innovations in wellness, productivity and interaction at a personal scale driven by small data. Realizing the promise of small data will require experimental research in data analytics and modeling, security and privacy, and user experience design. On June 15-16 we will hold a small, invitation-only, NSF workshop to explore the research opportunities, challenges and infrastructure requirements posed by small data.

We invite short 2-page papers that describe specific technical challenges and opportunities around small data, in particular:

  1. parsing, fusion, and modeling of diverse and noisy data streams to create informative and predictive models from personal data;
  2. security and privacy for personal-data streams and applications; and
  3. user experience design for novel interactions based on small data.

Each selected workshop participant will be required to review the technical description of a proposed infrastructure (sdX). The workshop itself will be a participatory working meeting to further define technical designs and priorities for sdX. In particular:

  • All invited attendees will be asked to submit their reactions to a proposed set of research infrastructure capabilities; including their potential research needs and contributions to such a research infrastructure.
  • At the workshop we will cluster participants into themed sessions and capture both participant 2 pages, presentations, and discussions in a synthesized report of research infrastructure functionality, priorities, and challenges. Our intention is that the workshop will spawn a committed research community who would continue to participate in designing, creating, and using the research infrastructure.
  • On the second day of the workshop we will present a description of the community-defined infrastructure and proposed research to an industry panel for feedback on the opportunities and challenges that they anticipate in the creation and use of the research infrastructure.
  • We will synthesize our larger workshop findings and prepare a revised document outlining proposed capabilities, challenges and tradeoffs. This revised document will be posted on the web for public distribution and comment.

Objective of the workshop

The fusion of multiple personal data streams promises to fuel many novel applications––just as GPS data spawned concepts like geocaching and check-ins––yet exploring such opportunities is often infeasible for academia and industry alike. Currently, CISE research groups struggle to access study populations and personal data streams large and diverse enough for meaningful and informative research. For commercial organizations, data-source fragmentation (e.g., a multiplicity of apps) and the risks of public and regulatory backlash pose barriers to individual-scale mining of broadly harvested personal data. The purpose of this workshop is to articulate the utility, requirements, and specifications for a shared research infrastructure to support small data experimentation, sdX ("small data experimentation").

The envisioned living infrastructure would be designed to remove the barriers that CISE researchers face in performing rich, interactive, and long-lived research with small data. It would enable studies with the thousands of participants required to make generalizable claims about target users and operating environments. In particular, sdX would:

  • allow recruited study participants to share a broad spectrum of personal data streams easily, securely, and selectively with researchers.
  • streamline and automate processes that today are unworkable at scale.
  • include tools for participant recruitment and on-boarding, study and data management, data analytics, and data privacy and integrity.
  • follow best practices of IRB-approved participant recruitment and informed consent to protect participants and their data.

As an ambitious research platform for long-lived studies with thousands of users, sdX would entail sustained professional development, evaluation, and refinement of software infrastructure and tools–an effort well beyond the scope of traditional research funding programs. sdX would aim to achieve deep economies of scale by providing replicable, stable, and reusable infrastructure and tools for the CISE research community and beyond. sdX would also evolve and improve over time as the community innovates around and enriches its core functionality.

We invite brief 2 page descriptions of technical mechanisms and user facing applications that would use sdX for prototype development and pilot evaluation.

Background

Small data are the digital traces that individuals generate as a byproduct of their daily activities, such as sending e-mail, texting or talking via phone, buying groceries or take-out, going to work on foot or by car, watching TV shows or movies at home, or playing games on mobile devices or game consoles. Small data production is poised for an explosion in richness and variety as consumers acquire wearable devices, as mobile device sensors improve, and as our online lives generate increasingly comprehensive ambient personal data.

Today, service providers use small data to target advertisements, recommend products and optimize system performance. In the future, the promise of transformative advances in wellness and personal productivity will drive interest in letting individuals benefit from their own data directly. Individuals will then be empowered to gain insights into their own behavior, personalize their own care, motivate achievement of their own goals, and broadly improve their own quality of life.

Commercial services are emerging to exploit these opportunities. At the same time, researchers have articulated both broad visions and specific motivating examples of services that rely on access to users’ diverse small data streams Estrin14, Campbell14, Wang14 Westeyn11, Kientz08, Hong09, Baumer12, Consolvo08, Ma. Unlocking the potential of small data, though, will require new analytical and system techniques to transform small data into meaningful and trustworthy end-user experiences. In particular, innovations are needed in user modeling and personal data mining across multiple data streams, in security and privacy mechanisms, and in user experience design.

We argue that critical research needed to make progress is being held back by the lack of a community research infrastructure (CRI) to support sustained, naturalistic, multi-data-stream, small-data experiments. The three primary existing contexts for small data experimentation fall short of providing researchers with this needed infrastructure are:

  • Large commercial organizations: Major consumer service providers such as Facebook, Google, Apple, Samsung, and Microsoft have access to extensive repositories of small data. However, their commercial terms of service and internal policies prohibit many forms of experimentation and are tightening due to post-Snowden consumer sensitivities and backlash against episodes such as Facebook "mood" experimentation. Examples include Microsoft changing its terms of use in 2014 to exclude use of personal communications and data in targeting ads and Apple taking a strong position restricting commercial use of HealthKit data.
  • Mobile devices: Mobile apps and wearable devices can collect specific application data at scale, particularly for those that make the user’s data available to them through an authenticated API (e.g., Fitbit, Foursquare...). However, mobile apps and wearable devices do not provide access to many of the most interesting data sources that reside outside of mobile ecosystems (e.g., consumer transactions, game-console activity, media viewing, email communications, and search).
  • Existing academic experiments: Participatory research studies and controlled experiments are well established in non-technical fields such as public health, social psychology, and medical research, and broadly practiced at smaller scales in the fields of HCI and Ubicomp. However, when academic researchers need ongoing, automated, and secure access to dozens of data streams for each research participant, they struggle with logistical overhead required to perform experiments at scales beyond tens of individuals and tens of days. In all but rare cases, the resources required to engage users securely in meaningful, bold, and careful experimentation are simply too great.

The ubiquity of social media and mobile app use has caused a profound and problematic gap between experimental research and reality. Major industry advances in such areas as search, statistical machine translation, and spam filtering have arisen from massive growth in the aggregate data volume harvested by service providers, underscoring the transformative potential of increased data availability. However, aggregated and sanitized data sets do not provide the same utility as access to the original small data. Furthermore, many important concepts do not scale down to the small scale of typical academic experiments. Fundamental advances in small data science would thus require experiments that encompass diverse participants and their small data streams, capture participants’ everyday activities over extended durations from different channels, and leverage fine-grained control and analytics to achieve detailed, systematic, sustained, and iterative experimentation and evaluation.

An envisioned small data eXperimentation (sdX) infrastructure would provide researchers with a data software infrastructure that supports naturalistic, interactive, multi-stream studies for extended durations with user populations of meaningful size; it would provide unprecedented visibility into user behaviors and opportunities for in-depth research. sdX would enable multifaceted interactive experiments with 100s to 1000s of participants continuously for months to years, a meso-scale chosen because it is: (a) Large enough to capture statistically significant results with field and quasi-experimentation targeted populations, and (b) Modest enough for researchers to manage the detail of highly instrumented, intimate, and repeated experiments with feedback from real users. sdX would also provide commercial value, catalyzing innovation and readying small data applications for the marketplace.

Some of the key enabling features of sdX would be: (1) Robust small-data capture modules that connect participants and their small data with sdX (e.g., browser plugins, email parsers, mobile apps, authorizations to commercial APIs) and tools to facilitate onboarding of study participants; (2) Native security and privacy for user and data management using HIPAA-compliant cloud services, strong user authentication, secure channels for device-to-cloud data transfer, and privacy filters and audit trails on researcher access to user data; (3) Small-data analytics modules for preprocessing, filtering, and fusing small data streams; and (4) A Research Interface (portal) (Rx) for configuring, iterating, and reusing experimental designs and tools, and also for accessing, managing, and adjusting experiments and experimental analytics.

sdX would thus provide end-to-end support for the diverse, meso-scale, cross-system, interactive, and sustained experiments needed by the small-data research community, while enabling more personally engaging and intimate experiments than large commercial platforms permit. By creating a standard, modular, open API and open source (not open data) platform and facilitating reuse of experimental designs and tools, sdX would achieve community-wide economies of scale, bringing new small-data experimental capabilities and meso-scale experimentation affordably within the reach of research groups of all sizes. sdX would support research and stimulate infrastructure innovations that transcend sdX and apply to commercial platforms, big data, and beyond. Far more than most CISE research infrastructure projects and testbeds, sdX would aggressively blend systems, devices, and people.

Workshop Committee

Organizing Committee: Deborah Estrin, Ari Juels, JP Pollak, Cornell NYC Tech

Tentative Program Committee–pending confirmation:

  • Gregory Abowd, Georgia Tech
  • AJ Brush, MSR
  • Jeff Burke, UCLA
  • Andrew Campbell, Dartmouth
  • Tanzeem Choudhury, Cornell
  • Beki Grinter, GA Tech
  • Julie Kientz, University of Washington
  • Pedja Klasnja, University of Michigan
  • Yoshi Kohno, University of Washington
  • James Landay, Stanford University
  • Ratul Mahajan, MSR
  • Jennifer Mankoff, CMU
  • Amelie Marian, Rutgers
  • Katie Shilton, Univ of Maryland
  • Vitaly Shmatikov, UT Austin, Visiting Cornell Tech

Travel funding and logistics

Travel funds up to $700 will be available for each of two presenters per paper, on a reimbursement basis. Preference will be given to presenter-pairs that include students. At least one author per paper must commit to attending and presenting in the workshop, before final selection of the paper. All travel arrangements, including visa and other requirements, will be the responsibility of the presenter. We will have further information at the appropriate time regarding hotel and other local information.

Funding for these workshops comes from the National Science Foundation. Additional support is provided by Cornell Tech.

The workshop will be held on the interim campus of Cornell Tech, 111 8th Avenue, Suite 302, New York, New York.

For additional information please contact destrin@cs.cornell.edu

Submission Instructions

March 15, 2015: Submit to destrin@cs.cornell.edu a 2 page pdf document describing specific research project(s) in the area of small data techniques and applications. Your document should include your name, institution affiliation, research webpage url, and email address. Please do not include any proprietary information in the submission so that it can be readily shared with other participants and on the webpages for the workshop.

May 15, 2015: Communication of workshop presentation decision to authors along with link to workshop working document for shared comments.

June 1, 2015: Each attendee is required to comment on shared working document that we will use to frame workshop discussion.

June 15-16: Workshop in New York City. Workshop will begin Morning of June 15th and end mid-day on 16th.

Submissions may have multiple authors but only one or two of the authors will be invited to attend due to space and budget limitations.