Current Projects

Secure and flexible Information Sharing in Coalition Environments

Privacy preserving data mining and data management

Data Exploration and Navigation

Adversarial learning

Efficient and scalable RDF store with support for federated search and reasoning

Past Projects

Semantic-based Search and Data Integration Using Semantic Networks

Database Compression

XML and substring selectivity estimation and XML indexing 

Workload-aware mapping of XML to relational tables

Privacy preserving data mining and data management

In the environment of today's world where data collection and storage is growing at an exponential rate, the privacy issues related to sensitive individual information cannot be overemphasized. Many organizations need to protect privacy of data and at the same time still allow useful patterns being discovered from the data. I have been working on optimizing utility and privacy tradeoff for privacy preserving data mining algorithms. This research has been funded by NSF, NIST and MITRE.

Recent Publications:


Data Exploration and Navigation

Due to the large volume of data stored in many different databases, information overload becomes one of the major obstacles for ordinary people to search for useful information in a database. My research on data exploration and navigation aims to use navigation and search techniques to help user quickly useful information. The novelty of my approach is as follows: (1) my approach takes into account the diversity of user preferences, (2) my approach uses probabilistic models and is robust to poor quality data.

Recent Publications:


Adversarial Learning¡¡

With the arrival of big data era, data mining techniques have been widely used to build detection models for cyber security applications such as spam filtering, virus or malware detection, and intrusion detection. At the same time, attackers may try to modify their attack to evade detection. For example, an email spammer may drop certain words or symbols from spam emails to avoid detection of spam filtering software. An attacker may use a variant of an attack to evade detection by an intrusion detection system. My research on adversarial learning studies possible evasion attacks against cyber security protection techniques as well as techniques to increase robustiness of cyber security techniques. This research has been funded by IBM as part of the project "Accelerating Cognitive Cyber Security".

Recent Publications:

Efficient and scalable RDF store with support for federated search and reasoning

RDF is the way to represent data and knowledge on semantic web. One challenge in big data era is to efficiently store and process large scale RDF data in a distributed and sometimes resource limited environment. I am collaborating with Dr. Adina Crainiceanu from US Naval Academy on enhancing the capability of Rya (a RDF triple store initially developed at USNA) in such environment. The current focus is to develop algorithms for federated search and reasoning over multiple Rya instances. Compared to existing work in the literature, our solution addresses limited storage and network connectivity capabilities. This project is funded by USNA and Navy.

 

Recent Publications:

Fan Yang, Adina Crainiceanu, Zhiyuan Chen, Don Needham, Cluster-Based Join for Geographically Distributed Big RDF Data, IEEE BigData Congress, accepted, 2019. (Acceptance rate 23%).

¡¡

Past Projects

 

Semantic-based Search and Data Integration Using Semantic Networks

There is a great need to find relevant information from diverse data sources. Existing keyword based search techniques fail to take into account implicit relationships between different data objects. For example, project managers and software developers often want to find out software modules that will be affected by a certain change. This information can not be easily returned using a keyword search query.

My research develops a technique that uses semantic network to capture relationships between data objects and helps users find related information. The semantic network can also be used in data integration to find relevant data sets to integrate.

Recent Publications:

 

Database compression

Since CPU speed improves much faster than disk speed, it makes sense to compress data on disk to achieve better I/O performance. I have investigated novel compression techniques for databases (since database needs to support fine granularity access unit such as rows or cells) and query optimization techniques to balance the I/O savings and decompression overhead.

Publications:


¡¡

XML and substring selectivity estimation and XML indexing 

This project investigates techniques to estimate the selectivity of XML and substring queries as well as XML indexing techniques.

Publications:


¡¡

Workload-aware mapping of XML to relational tables

I studied the problem of storing XML data into a relational storage such that the evaluation of queries over such XML data is optimized. Unlike existing work, this work takes into account the interplay of logical design (how to store XML in tables) and physical design (how to select indexes).

Publications: