Project Ideas | Hack@CEWIT 2024

Process Automation - Meta-Analysis Studies

CEWIT Logo

Meta-analysis studies play a crucial role in synthesizing evidence from multiple research studies to derive meaningful conclusions in various scientific fields. However, one of the most time-consuming and tedious tasks in conducting a meta-analysis is collecting data from multiple sources, mainly pdf and other files. This challenge aims to automate this process to streamline and expedite meta-analysis studies using various computer science toolsets. Your objective is to develop an automated solution capable of efficiently collecting relevant data from PDF files to facilitate meta-analysis studies across different domains. The solution should be versatile, scalable, and user-friendly, enabling researchers to gather data effortlessly while ensuring accuracy and reliability.

Please use ths link to access a shared folder

You will find a sample instruction, i.e., VariableCoding MA 231022.docx, that researchers would provide for their research assistants to extract relevant information. In addition, you will find a typical data sheet, i.e., Meta Analysis MasterSheet - Coded.xlsx, that research assistants would create to record such information. Finally, in the "PDF Files" folder, you will find pdf files that would serve as the source of a potential meta-analysis.

Zero-Shot Forecasting with Large Language Models onTick-by-Tick Trades Data from NYSE-TAQ

CEWIT Logo

The Transformer architecture, introduced by Vaswani et al. [4] in 2017, has become a cornerstone in the field of Natural Language Processing (NLP). Its effectiveness in managing sequential data, scalability, and proficiency in capturing long-range dependencies in text have established it as a foundational framework. Large Language Models (LLMs), such as GPT, utilize the Transformer’s multi-headed self-attention mechanism to process and generate text. These models are trained on extensive datasets to comprehend and produce natural language. The term "zero-shot forecasting" refers to the capability of these models to predict future events or generate insights in various domains, including finance, without prior specific training on those tasks.

Remarkably, certain LLMs have demonstrated the ability to extrapolate time series data, such as GPT-3 and LLaMa-2, provided the time series is aptly converted into textual format, and LLMs then transform the task of time series forecasting into a problem of predicting the next token in text. Gruver et al. [2] introduced LLMTime, a methodology for employing pretrained LLMs to forecast continuous time series. Examples of converting time series data into textual inputs for pretrained LLMs are available in the LLMTime GitHub repository. In this project, participants are tasked with employing selected pretrained LLMs on the NYSE Daily TAQ (Trade and Quote) client dataset to perform zero-shot forecasting concerning Trade Price and Trade Direction (upward vs downward movement of Trade Price) for subsequent trade data sets. Participants will have access to trades data for 94 masked stocks over 3 days, and the performance of your model will be evaluated on a holdout dataset.

Participants are expected to use the intraday data up until 3:40pm to train or fine-tune their models, and then forecast trading prices (point predictions) for future intervals of 5 minutes, 10 minutes, 15 minutes, and 20 minutes. This is also called "closing price prediction". It is important to note that the predictability of high-frequency returns may diminish within a few minutes [1]. Participants have the option to develop either univariate or multivariate models depending on the chosen LLMs, taking into consideration the differences in tokenizers among the LLMs.

View the Data Here

CEWIT Building Energy Management & Sustainability Hack

Stony Brook University logo

Affiliated Organization: Office of Sustainability at Stony Brook University

Project Description: We are looking for creative, informative, and actionable usages of the Office of Sustainabilities CEWIT building data

Message the Mentors on our Discord to learn more

NASA Earthdata and POWER Datasets

NASA logo

Check out the Data supplied by NASA re: weather - the biggest driver of climate change and enviromental changes.
Jennifer Wei, Lead Scientist of Earth Sciences Remote Sensoring at NASA, has provided Hack@CEWIT 2024 hackers with data from the Earthdata and Prediction of Worldwide Energy Resources (POWER) data sets.
If your project focuses on climate change or sustainable energy, don't forget to utilize these resources.

Earthdata Data Set

POWER Data Set

Machine Learning Dataset Management

Signal Classification

Hack@CEWIT 2024 Project Ideas