Hurricane Harvey Webpage Collection Data Archive
Description
During natural disasters, users of social media share information with their family, friends, and others in their networks regarding the progression of the disaster, weather forecasts, response or recovery actions by the emergency agencies, or simply about their whereabouts. Often, such social media posts contain one or more embedded URLs which point to richer information on webpages from news media, government agencies, NGOs, weather channels, or multimedia storage.
Due to the ephemeral nature of the webpages, especially the news articles, the project team collaborated with the researchers at the Digital Library Research Laboratory (DLRL) at Virginia Tech (VT) to archive such webpages shared during the Hurricane Harvey event.
Data Collection and Processing
- More than 580 thousand URLs were extracted from the Twitter data which were posted during Hurricane Harvey.
- From these URLs, 457,454 webpages, shared at least 10 times in Twitter, were archived using a webcrawler running on 24 computers in parallel with the help of DLRL at VT.
- We further filtered the dataset to select relevant pages that contained the keywords (‘harvey’, ‘hurricane’, or ‘hurricane harvey’) in their textual content or in the description meta tag (a total of 386,667).
Credits to Virginia Tech
Many thanks to the Director, Edward A. Fox, and the members of DLRL at Virginia Tech for collaborating on this task with us!
They have been working on disaster data archiving tasks for a while through multiple NSF projects: GETAR (IIS-1619028 and 1619371), IDEAL (IIS-1319578), CTRnet (IIS-0916733), DL-VT416 (IIS-0736055), and CBAR-tpd (CMMI-1638207).
For more details about VirginiaTech’s Events Archiving Initiative, please visit eventsarchive.org
We provide the 10% sample of the dataset filtered by keywords (38,667 webpages).
Download “Harvey Webpage Data Sample” sampled_files.zip – Downloaded 66 times – 1 GB
Upon request, the project team may consider sharing the entire webpage collection.
Hurricane Sandy Tweets, 1% Sample
Description
This dataset presents 150k tweet IDs for the tweets related to Hurricane Sandy posted from 10/23-11/26, 2012, sampled at random from the full dataset.
Researchers can retrieve tweets via Twitter APIs using the provided tweet IDs.
Data Collection and Processing
- The 16M tweets dataset was purchased from Twitter and retrieved using keywords
'hurricane' or 'sandy'
. - 1% of the tweets were selected at random.
Download “Hurricane Sandy Tweet IDs” sandy_sample_150k.csv – Downloaded 2 times – 3 MB
Reference
Zou L, Lam NSN, Cai H, Qiang Y (2018) Mining Twitter Data for Improved Understanding of Disaster Resilience. Annals of the American Association of Geographers 108:1422–1441.
Hurricane Isaac Tweets, 5% Sample
Description
This dataset presents 150k tweet IDs for the tweets related to Hurricane Isaak posted on 08/21-09/17, 2012, sampled at random from the full dataset.
Researchers can retrieve tweets via Twitter APIs using the provided tweet IDs.
Data Collection and Processing
- The 3M tweets dataset was purchased from Twitter and retrieved using keywords
'hurricane' or 'isaac'
. - 5% of the tweets were selected at random.
Download “Hurricane Isaac Tweet IDs” isaac_sample_150k.csv – Downloaded 2 times – 3 MB
Reference
Wang K, Lam NSN, Zou L, Mihunov V (2021) Twitter Use in Hurricane Isaac and Its Implications for Disaster Resilience. ISPRS International Journal of Geo-Information 10:116.
Hurricane Harvey Post-Landfall Tweet Collection Dataset
Authors
Shayan Shams, University of Texas Health Science Center at Houston, Kisung Lee, LSU
Description
This dataset presents 27,178 tweet IDs for the tweets related to Hurricane Harvey posted from 9/9/2017 to 10/6/2017 and excluding retweets.
Researchers can retrieve tweets via Twitter APIs using the provided tweet IDs.
Data Collection and Processing
- Sampled via the real-time Twitter API using a query
"#harvey OR #hurricaneharvey OR #houston OR @txtf1 OR Coast guard OR Texas OR @houstonpolice OR #disaster OR @houstonOEM OR #houstonflodd OR @govabbott OR @txdps"
- Total of 137,656 collected tweets were filtered by removing retweets resulting in a 27,178 tweets dataset.
Download “Harvey Tweet IDs” harvey_id_excl_retweets.txt – Downloaded 63 times – 557 KB
Please credit the source and cite
S. Shams, S. Goswami and K. Lee, “Deep Learning-Based Spatial Analytics for Disaster-Related Tweets: An Experimental Study,” 2019 20th IEEE International Conference on Mobile Data Management (MDM), Hong Kong, Hong Kong, 2019, pp. 337-342.
doi: 10.1109/MDM.2019.00-40