Postdoctoral Research Associate and CITP Fellow, Princeton University
Email
Abstract
Fairness, Privacy, and Security via Machine Learning and Natural Language Processing
My research involves the heavy use of machine learning and natural language processing in novel ways to interpret big data, develop privacy and security attacks, and gain insights about humans and society through these methods. I do not use machine learning only as a tool but I also analyze machine learning models’ internal representations to investigate how the artificial intelligence perceives the world. This work has been recently featured in Science where I showed that societal bias exists at the construct level of machine learning models, namely semantic space word embeddings which are dictionaries for machines to understand language. When I use machine learning as a tool to uncover privacy and security problems, I characterize and quantify human behavior in language, including programming languages, by coming up with a linguistic fingerprint for each individual. By extracting linguistic features from natural language or programming language texts of humans, I show that humans have unique linguistic fingerprints since they all learn language on an individual basis. Based on this finding, I can de-anonymize humans that have written certain text, source code, or even executable binaries of compiled code. This is a serious privacy threat for individuals that would like to remain anonymous, such as activists, programmers in oppressed regimes, or malware authors. Nevertheless, being able to identify authors of malicious code enhances security. On the other hand, identifying authors can be used to resolve copyright disputes or detect plagiarism. The methods in this realm have been used to identify so called doppelgångers to link the accounts that belong to the same identities across platforms, especially underground forums that are business platforms for cyber criminals. By analyzing machine learning models’ internal representation and linguistic human fingerprints, I am able to uncover facts about the world, society, and the use of language, such as bias, which have implications for privacy, security, and fairness in machine learning.
Bio
Aylin Caliskan is a Postdoctoral Research Associate and a CITP Fellow at Princeton University. Her research involves the heavy use of machine learning and natural language processing to characterize and quantify aspects of human behavior. Her work builds upon the key element of feature extraction for rigorous analysis of large-scale corpora and machine learning models. Her recent work on fairness, accountability, and transparency, particularly uncovering bias in language models, has received great attention upon the publication of “Semantics derived automatically from language corpora contain human-like biases” at Science. She continues investigating bias in joint visual-semantic models of artificial intelligence to explore their intersections with natural intelligence and society. Her doctoral research combines machine learning with natural language processing and touches the two main realms, privacy and security. The applications of this research complement each other by enhancing security and preserving privacy. She demonstrated large-scale de-anonymization of programmers of source code and executable binaries along with authors of micro-text in social media and l33tsp34k in cyber criminal forums via stylometric analysis. Her joint work on semi-automated anonymization of writing received the Privacy Enhancing Technologies Symposium Best Paper Award. Aylin holds a PhD in Computer Science from Drexel University and a Master of Science in Robotics from the University of Pennsylvania.