Calendar
During the fall 2018 semester, the Computational Social Science (CSS) and the Computational Sciences and Informatics (CSI) Programs have merged their seminar/colloquium series where students, faculty and guest speakers present their latest research. These seminars are free and are open to the public. This series takes place on Fridays from 3-4:30 in Center for Social Complexity Suite which is located on the third floor of Research Hall.
If you would like to join the seminar mailing list please email Karen Underwood.
COLLOQUIUM ON COMPUTATIONAL SCIENCES AND INFORMATICS
Eduardo Lopez, Assistant Professor
Department of Computational and Data Sciences
George Mason University
A Network Theory of Inter-Firm Labor Flows
Monday, March 5, 4:30-5:45
Exploratory Hall, Room 3301
Abstract: Using detailed administrative microdata for two countries, we build a modeling framework that yields new explanations for the origin of firm sizes, the firm contributions to unemployment, and the job-to-job mobility of workers between firms. Firms are organized as nodes in networks where connections represent low mobility barriers for workers. These labor flow networks are determined empirically, and serve as the substrate in which workers transition between jobs. We show that highly skewed firm size distributions are a direct consequence of the connectivity of firms. Further, our model permits the reconceptualization of unemployment as a local phenomenon, induced by individual firms, leading to the notion of firm-specific unemployment, which is also highly skewed. In coupling the study of job mobility and firm dynamics the model provides a new analytical tool for industrial organization and may make it possible to synthesize more targeted policies managing job mobility.
COLLOQUIUM ON COMPUTATIONAL SCIENCES AND INFORMATICS
James Glasbrenner, Assistant Professor
Department of Computational and Data Sciences
George Mason University
Using data science and materials simulations to control the corkscrew magnetism of MnAu₂
Monday, March 19, 4:30-5:45
Exploratory Hall, Room 3301
Materials occupy a foundational role in our society, from the silicon-based chips in our smartphones to the metals used to manufacture automobiles and construct buildings. The sheer variety in materials properties enables this wide range of use, and studying the atoms that bond together to form solids reveals the microscopic origin behind these properties. Remarkably, many properties can be traced to the behavior of and interaction between electrons, and computational simulations such as density functional theory calculations are used to study the features and macroscopic effects of this electronic structure. This computational approach can be further enhanced through recent advances in data science, which provide powerful tools and methods for analyzing and modeling data and for handling and storing large datasets.
In this talk, I will: 1) introduce the basic concepts of computational materials science and density functional theory in an accessible manner, and 2) present calculations on the material MnAu₂ where I use density functional theory and modeling to analyze its magnetic properties. The MnAu₂ structure is layered and its magnetic ground state forms a noncollinear corkscrew that rotates approximately 50° between neighboring manganese layers. Using the results of my calculations, I will explain the electronic origin of this corkscrew state and how to control its angle using external pressure and chemical substitution. In addition to discussing the electron physics, I will place a particular emphasis on the connection between data science and how modeling was used to analyze and interpret the density functional theory calculations. This will include a new, critical reexamination of my model fitting procedure using cross-validation and feature selection techniques, which will formally test the underlying assumptions I made in the original study.
COLLOQUIUM ON COMPUTATIONAL SCIENCES AND INFORMATICS
Dr. Peer Kröger, Professor
Chair of Database Systems and Data Mining
Ludwig-Maximilians-University Munich
TBA
Monday, March 26, 4:30-5:45
Exploratory Hall, Room 3301
Details coming soon….
COLLOQUIUM ON COMPUTATIONAL SCIENCES AND INFORMATICS
Olga Papaemmanouil, Associate Professor
Department of Computer Science at Brandeis University
Data Management Expert Discussion Seminar:
Learning-based Cost Management for Cloud Databases
Monday, April 16, 4:30-5:45
Exploratory Hall, Room 3301
Cloud computing has become one of the most active areas of computer science research, in large part because it allows computing to behave like a general utility that is always available on demand. While existing cloud infrastructures and services reduce significantly the application development time, significant effort is still required by cloud data management applications to manage their monetary cost, for often this cost depends on a number of decisions including but not limited to performance goals, resource provisioning and workload allocation. These tasks depend on the application-specific workload characteristics and performance objectives and today their implementation burden is left on application developers.
We argue for a substantial shift away from human-crafted solutions and towards leveraging machine learning algorithms to address the above challenges. These algorithms can be trained on application-specific properties and customized performance goals to automatically learn how to provision resources as well as schedule the execution of incoming query workloads with low cost. Towards this vision, we have developed WiSeDB, a learning-based cost management service for cloud-deployed data management applications. In this talk, I will discuss how WiSeDB leverages (a) supervised learning to automatically learn cost-effective models for guiding query placement, scheduling, and resource provisioning decisions for batch processing, and (b) reinforcement learning to offer low cost online processing solutions, while being adaptive to resource availability and decoupled from notoriously inaccurate performance prediction models.
Speaker Bio: Dr. Papaemmanouil is an Associate Professor in the Department of Computer Science at Brandeis University. Her research interest lies in the area of data management with a recent focus on cloud databases, data exploration, query optimization and query performance prediction. She received her undergraduate degree in Computer Science and Informatics at the University of Patras, Greece in 1999. In 2001, she received her Sc.M. in Information Systems at the University of Economics and Business, Athens, Greece. She then joined the Computer Science Department at Brown University, where she completed her Ph.D in Computer Science at Brown University in 2008. She is the recipient of an NSF Career Award (2013) and a Paris Kanellakis Fellowship from Brown University (2002)
Notice and Invitation
Oral Defense of Doctoral Dissertation
Doctor of Philosophy in Computational Sciences and Informatics
Department of Computational and Data Sciences
College of Science
George Mason University
John T. Rigsby
Bachelor of Science, Mississippi State University, 1999
Master of Science, George Mason University, 2005
Automated Storytelling: Generating and Evaluating Story Chains
Monday, April 30, 2018, 11:00 a.m.
Research Hall, Room 162
All are invited to attend.
Committee
Daniel Barbara, Dissertation Director
Estela Blaisten
Carlotta Domeniconi
Igor Griva
Abstract: Automated storytelling attempts to create a chain of documents linking one article to another while telling a coherent and cohesive story that explains events connecting the two article end points. The need to understand the relationship between documents is a common problem for analysts; they often have two snippets of information and want to find the other pieces that relate them. These two snippets of information form the bookends (beginning and ending) of a story chain. The story chain formed using automated storytelling provides the analyst with better situational awareness by collecting and parsing intermediary documents to form a coherent story that explains the relationships of people, places, and events.
The promise of the Data Age is that the truth really is in there, somewhere. But our age has a curse, too: apophenia, the tendency to see patterns that may or may not exist. — Daniel Conover, Post and Courier, Charleston, South Carolina, 30 Aug. 2004
The above quote expresses a common problem in all areas of pattern recognition and data mining. For text data mining, several fields of study are dedicated to solving aspects of this problem. Some of these include literature-based discovery (LBD), topic detection and tracking (TDT), and automated storytelling. Methods to pull the signal from the noise are often the first step in text data analytics. This step usually takes the form of organizing the data into groups (i.e. clustering). Another common step is understanding the vocabulary of the dataset; this could be as simple as phrase frequency analysis or as complex as topic modeling. TDT and automated storytelling come into play once the analyst has specific documents for which they want more information.
In our world of ever more numerous sources of information, which includes scientific publications, news articles, web resources, emails, blogs, tweets, etc., automated storytelling mitigates information overload by presenting readers with the clarified chain of information most pertinent to their needs. Sometimes referred to as connecting the dots, automated storytelling attempts to create a chain of documents linking one article to another that tells a coherent and cohesive story and explains the events that connect the two articles. In the crafted story, articles next to each other should have enough similarity that readers easily comprehend why the next article in the chain was chosen. However, adjacent articles should also be different enough to move the reader farther along the chain of events with each successive article making significant progress toward the destination article.
The research in this thesis concentrates on three areas:
- story chain generation
- quantitative storytelling evaluation
- focusing storytelling with signal injection.
Storytelling evaluation of the quality of the created stories is difficult and has routinely involved human judgment. Existing storytelling evaluation methodologies have been qualitative in nature, based on results from crowd sourcing and subject matter experts. Limited quantitative evaluation methods currently exist and are generally only used for filtering results before qualitative evaluation. In addition, quantitative evaluation methods become essential to discern good stories from bad when two or more story chains exist for the same bookends. The work described herein extends the state of the art by providing quantitative methods of story quality evaluation which are shown to have good agreement with human judgment. Two methods of automated storytelling evaluation are developed: dispersion and coherence, which will be used later as criterion for a storytelling algorithm. Dispersion, a measure of story flow, ascertains how well the generated story flows away from the beginning document and towards the ending document. Coherence measures how well the articles in the middle of the story provide information about the relationship of the beginning and ending document pair. Kullback-Leibler divergence (KLD) is used to measure the ability to encode the vocabulary of the beginning and ending story documents using the set of middle documents in the story. The dispersion and coherence methodologies developed here have the added benefit that they do not require parameterization or user inputs and are easily automated.
An automated storytelling algorithm is proposed as a multi-criteria optimization problem that maximizes dispersion and coherence simultaneously. The developed storytelling methodologies allow for the automated identification of information which associates disparate documents in support of literature-based discovery and link analysis tasking. In addition, the methods provide quantitative measures of the strength of these associations.
We also present a modification of our storytelling algorithm as a multi-criteria optimization problem that allows for signal injection by the analyst without sacrificing good story flow and content. This is valuable because analysts often have an understanding of the situation or prior knowledge that could be used to focus the story in a better way as compared to the story chain formed without signal injection. Storytelling with signal injection allows an analyst to create alternative stories which incorporate the domain knowledge of the analyst into the story chain generation process.
Notice and Invitation
Oral Defense of Doctoral Dissertation
Doctor of Philosophy in Computational Sciences and Informatics
Department of Computational and Data Sciences
College of Science
George Mason University
Karl Battams
Bachelor of Science – Astrophysics, University College London, 2002
Master of Science – Computational Sciences, George Mason University, 2008
Reduction and Synopses of Multi-Scale Time Series with Applications to Massive Solar Data
Monday, July 30, 2018, 11:00 a.m.
Exploratory Hall, Room 3301
All are invited to attend.
Committee
Robert Weigel, Dissertation Director/Chair
Jie Jhang
Robert Meier
Huzefa Rangwala
In this dissertation, we explore new methodologies and techniques applicable to aspects of Big Solar Data to enable new analyses of temporally long, or volumetrically large, solar physics imaging data sets. Specifically, we consider observations returned by two space-based solar physics missions – the Solar Dynamics Observatory (SDO) and the Solar and Heliospheric Observatory (SOHO) – the former operating for over 7-years to date, returning around 1.5 terabytes of data daily, and the latter having been operational for more than 22-years to date. Despite ongoing improvements in desktop computing performance and storage capabilities, temporally and volumetrically massive datasets in the solar physics community continue to be challenging to manipulate and analyze. While historically popular, but more simplistic, analysis methods continue to provide new insights, the results from those studies are often driven by improved observations rather than the computational methods themselves. To fully exploit the increasingly high volumes of observations returned by current and future missions, computational methods must be developed that enable reduction, synopsis and parameterization of observations to reduce the data volume while retaining the physical meaning of those data.
In the first part of this study we consider time series of 4 – 12 hours in length extracted from the high spatial and temporal resolution data recorded by the Atmospheric Imaging Assembly (AIA) instrument on the NASA Solar Dynamics Observatory (SDO). We present a new methodology that enables the reduction and parameterization of full spatial and temporal resolution SDO/AIA data sets into unique components of a model that accurately describes the power spectra of these observations. Specifically, we compute the power spectra of pixel-level time series extracted from derotated AIA image sequences in several wavelength channels of the AIA instrument, and fit one of two models to their power spectra as a function of frequency. This enables us to visualize and study the spatial dependence of the individual model parameters in each AIA channel. We find that the power spectra are well-described by at least one of these models for all pixel locations, with unique model parameterizations corresponding directly to visible solar features. Computational efficiency of all aspects of this code is provided by a flexible Python-based Message Passing Interface (MPI) framework that enables distribution of all workloads across all available processing cores. Key scientific results include clear identification of numerous quasi-periodic 3- and 5-minute oscillations throughout the solar corona; identification and new characterizations of the known ~4.0-minute chromospheric oscillation, including a previously unidentified solar-cycle driven trend in these oscillations; identification of “Coronal Bullseyes”, that present radially decaying periodicities over sunspots and sporadic foot-point regions, and of features we label “Penumbral Periodic Voids”, that appear as annular regions surrounding sunspots in the chromosphere, bordered by 3- and 5-minute oscillations but exhibiting no periodic features.
The second part of this study considers the entire mission archive returned by the Large Angle Spectrometric Coronagraph (LASCO) C2 instrument, operating for more than 20-years on the joint ESA/NASA Solar and Heliospheric Observatory (SOHO) mission. We present a technique that enables the reduction of this entire data set to a fully calibrated, spatially-located time series known as the LASCO Coronal Brightness Index (CBI). We compare these time series to a number concurrent solar activity indices via correlation analyses to indicate relationships between these indices and coronal brightness both globally across the entire corona, and locally over small spatial scales within the corona, demonstrating that the LASCO observations can be reliably used to derive proxies for a number of geophysical indices. Furthermore, via analysis of these time series in the frequency domain, we highlight the effects of long-time scale variability in long solar time series, considering sources of both solar origin (e.g., solar rotation, solar cycle) and of instrumental/operation origin (e.g., spacecraft rolls, stray light contamination), and demonstrate the impact of filtering of temporally long time series to reduce the impacts of these uncertain variables in the signals. Primary findings of this include identification of a strong correlation between coronal brightness and both Total and Spectral Solar Irradiance leading to the development of a LASCO-based proxy of solar irradiance, as well as identification of significant correlations with several other geophysical indices, with plausible driving mechanisms demonstrated via a developed correlation mapping technique. We also determine a number of new results regarding LASCO data processing and instrumental stray light that important to the calibration of the data and have important impacts on the long-term stability of the data.
Computational Social Science Research Colloquium /
Colloquium in Computational and Data Sciences
Robert Axtell, Professor
Computational Social Science Program,
Department of Computational and Data Sciences
College of Science
and
Department of Economics
College of Humanities and Social Sciences
George Mason University
Are Cities Agglomerations of People or of Firms? Data and a Model
Friday, September 28, 3:00 p.m.
Center for Social Complexity, 3rd Floor Research Hall
All are welcome to attend.
Abstract: Business firms are not uniformly distributed over space. In every country there are large swaths of land on which there are very few or no firms, coexisting with relatively small areas on which large numbers of businesses are located—these are the cities. Since the dawn of civilization the earliest cities have husbanded a variety of business activities. Indeed, often the raison d’etre for the growth of villages into towns and then into cities was the presence of weekly markets and fairs facilitating the exchange of goods. City theorists of today tend to see cities as amalgams of people, housing, jobs, transportation, specialized skills, congestion, patents, pollution, and so on, with the role of firms demoted to merely providing jobs and wages. Reciprocally, very little of the conventional theory of the firm is grounded in the fact that most firms are located in space, generally, and in cities, specifically. Consider the well-known facts that both firm and city sizes are approximately Zipf distributed. Is it merely a coincidence that the same extreme size distribution approximately describes firm and cities? Or is it the case that skew firm sizes create skew city sizes? Perhaps it is the other way round, that skew cities permit skew firms to arise? Or is it something more intertwined and complex, the coevolution of firm and city sizes, some kind of dialectical interplay of people working in companies doing business in cities? If firm sizes were not heavy-tailed, but followed an exponential distribution instead, say, could giant cities still exist? Or if cities were not so varied in size, as they were not, apparently, in feudal times, would firm sizes be significantly attenuated? In this talk I develop the empirical foundations of this puzzle, one that has been little emphasized in the extant literatures on firms and cities, probably because these are, for the most part, distinct literatures. I then go on to describe a model of individual people (agents) who arrange themselves into both firms and cities in approximate agreement with U.S. data.
Computational Social Science Research Colloquium /
Colloquium in Computational and Data Sciences
Maciej Latek, Chief Technology Officer, trovero.io./
Ph.D. in Computational Social Science 2011
George Mason University
Industrializing multi-agent simulations:
The case of social media marketing, advertising and influence campaigns
Friday, October 12, 3:00 p.m.
Center for Social Complexity, 3rd Floor Research Hall
All are welcome to attend.
Abstract: System engineering approaches required to transition multi-agent simulations out of science into decision support share features with AI, machine learning and application development, but also present unique challenges. In this talk, I will use trovero as an example to illustrate how some of these challenges can be addressed.
As platform to help advertisers and marketers plan and implement campaigns on the social media, trovero is comprised of social network simulations for optimization and automation and network population synthesis used to preserve people’s privacy while maintaining a robust picture of social media communities. Social network simulations forecast campaign outcomes and pick the right campaigns for given KPIs. Simulation is the only viable way to reliably forecast campaign outcomes: Big data methods fail to forecast campaign outcomes, because they are fundamentally unfit for social network data. Network population synthesis enables working with aggregate data without relying on data sharing agreements with social media platforms that are ever more reluctant to share user data with third parties after GDPR and the Cambridge Analytica debacle.
I will outline how these two approaches complement one another, what computational and data infrastructure is required to support them and how workflows and interactions with social media platforms are organized.
Computational Social Science Research Colloquium /
Colloquium in Computational and Data Sciences
J. Brent Williams
Founder and CEO
Euclidian Trust
Improved Entity Resolution as a Foundation for Model Precision
Friday, November 2, 3:00 p.m.
Center for Social Complexity, 3rd Floor Research Hall
All are welcome to attend.
Abstract: Analyzing behavior, identifying and classifying micro-differentiations, and predicting outcomes relies on the establishment of a core foundation of reliable and complete data linking. Whether data about individuals, families, companies, or markets, acquiring data from orthogonal sources results in significant matching challenges. These matching challenges are difficult because attempts to eliminate (or minimize) false positives yields an increase in false negatives. The converse is true also.
This discussion will focus on the business challenges in matching data and the primary and compounded impact on subsequent outcome analysis. Through practical experience, the speaker led the development and first commercialization of novel approach to “referential matching”. This approach leads to a more comprehensive unit data model (patient, customer, company, etc.), which enables greater computational resolution and model accuracy by hyper-accurate linking, disambiguation, and detection of obfuscation. The discussion also covers the impact of enumeration strategies, data obfuscation/hashing, and natural changes in unit data models over time.
Notice and Invitation
Oral Defense of Doctoral Dissertation
Doctor of Philosophy in Computational Sciences and Informatics
Department of Computational and Data Sciences
College of Science
George Mason University
Joseph Shaheen
Bachelor of Science, Murray State University, 2003
Master of Professional Studies, Georgetown University, 2011
Master of Business Administration, Georgetown University, 2013
Data Explorations in Firm Dynamics:
Firm Birth, Life, & Death Through Age, Wage, Size & Labor
Monday, November 26, 2018, 12.30 p.m.
Research Hall
All are invited to attend.
Committee
Robert Axtell, Dissertation Director
Eduardo Lopez
John Shortle
William Rand
Marc Smith
A better understanding of firm birth, life, and death yields a richer picture of firms’ life-cycle and dynamical labor processes. Through “big data” analysis of a collection of universal fundamental distributions and beginning with firm age, wage and size, I discuss stationarity, their functional form, and consequences emanating from their defects. I describe and delineate the potential complications of the firm age defect–caused by the Great Recession—and speculate on a stark future where a single firm may control the U.S. economy. I follow with an analysis of firm sizes, tensions in heavy-tailed model fitting, how firm growth depends on firm size and consequently, the apparent conflict between empirical evidence and Gibrat’s Law. Included is an introduction of the U.S. firm wage distribution. The ever-changing nature of firm dynamical processes played an important role in selecting the conditional distributions of age and size, and wage and size in my analysis. A closer look at these dynamical processes reveals the role played by mode wage and mode size in the dynamical processes of firms and thus in the firm life-cycle. Analysis of firm labor suggests preliminary evidence that the firm labor distribution conforms to scaling properties—that it is power law distributed. Moreover, I report empirical evidence supporting the existence of two separate and distinct labor processes—dubbed labor regimes—a primary and secondary, coupled with a third unknown regime. I hypothesize that this unknown regime must be drawn from the primary labor regime—that it is either emergent from systemic fraudulent activity or subjected to data corruption. The collection of explorations found in this dissertation product provide a fuller, richer picture of firm birth, life, and death through age, wage, size, and labor while supporting our understanding of firm dynamics in many directions.