Documentation for our databases can be found on our Process documentation page
https://github.com/PopHIVE/Ingest/blob/main/status.md
https://dissc-yale.github.io/dcf/report/?repo=PopHIVE/Ingest
The data shown on PopHIVE.org are found in the Ingest project project in the ./Data/bundle_*/dist/ subfolders. The files are stored in either parquet or compressed csv format. If using R, parquet files can be downloaded using the arrow package in R. For example:
library(arrow)
ds1 <- read_parquet(url1)
compressed csv can be downloaded with vroom::vroom() in R:
ds2 <- vroom::vroom(url2)
In general, the data closest to the source data are found in the 'value' column. Some datasets also include a 3 week moving average (value_smooth), and a smoothed value, scaled to between 0-100 (value_smooth_scale). The data in 'value' are generally drawn directly from the source data. Exceptions include:
-
In some datasets where national level data were not provided by the source, a national average is calculated using a population-weighted average.
-
For Epic Cosmos, if the data are based on fewer than 10 counts, the cell is suppressed. For visualization purposes, this is filled in with a value halfway between 0 and the minimum value reported for that state. These values are indicated with suppressed_flag=1.
Time-stamped archives of the data are available in the Pulled Data folder.
Can I re-use the data from PopHIVE?
Yes! Much of the data are drawn from publicly available Federal datasets obtained from CDC or data.gov. Other data, including the results of research performed using Epic Cosmos or the data available through Google Health Trends, can be used with appropriate attribution. A suggested citation relating to this data is 'Results of research performed with Epic Cosmos were obtained from the PopHIVE platform [url for Github corresponding to the specific data source].’
Please cite the use of data from PopHIVE and the original source. the DOI for PopHIVE is
Who is it for? PopHIVE is designed for a broad audience: - Members of the public who want to understand what’s happening in their communities. - Clinicians who need to anticipate trends and adjust care. - Public health departments and local governments who need up-to-date data to allocate resources. - Researchers, journalists, and advocates working to tell stories and drive policy change. - Policy makers and decision-makers who need to understand the basics of who, what, and where about health issues occurring in the areas they serve.
Can you show ZIP code-level data? Because the data is de-identified, we can’t always go down to ZIP code level, especially for sensitive conditions like STIs or mental health outcomes. For some topics, like asthma or heat-related illness, we can show more granular data. Our data team is constantly working to expand local detail while protecting individual privacy.
Will you show additional conditions in the future? Yes. PopHIVE is evolving based on user needs and feedback. As high-quality, de-identified data becomes available, we plan to expand condition-specific dashboards, such as those for diabetes, maternal health, and behavioral health. Please provide us feedback on what you’d like to see here.
How do I know the data is accurate or reliable? PopHIVE’s data team continually evaluates the quality and representativeness of the data. In some cases (like diabetes Hemoglobin A1C data), completeness varies, and we are committed to transparency about what the data can and can’t tell us. This is an evolving platform, and we're building new functionality and insights over time.
How are you using electronic health record data from Epic? Isn’t that a violation of HIPAA? PopHIVE doesn’t change any rules or regulations around health data sharing. We only use de-identified, aggregate data, following all existing privacy laws. We’re not sharing individual patient records—we’re simply making existing public health trends more timely and accessible for the public good.
Are you accepting additional data sets? Yes! We welcome partnerships and are actively working to expand PopHIVE’s data offerings. If you have a reliable, de-identified dataset that could help improve public understanding of health, we’d love to hear from you. Please submit here.
How can I give feedback on this tool? We’d love to hear from you. PopHIVE is shaped by the people who use it. Whether you have a technical suggestion, want to request a feature, or share how it helped your community, please submit here.
| Category | Source | Description | File(s) on PopHIVE | Restrictions |
|---|---|---|---|---|
| Respiratory Diseases | Google Health Trends | This represents the volume of Google searches for ‘RSV’, statistically adjusted to remove searches related to RSV immunizations. Unadjusted search volumes can be accessed here. | Weekly time series of RSV for multiple indicators | Non-commercial purposes |
| Respiratory Diseases | Epic Cosmos | Percentage of ED visits due to RSV, influenza, or COVID-19, based on ICD-10 coding | Weekly time series of RSV, influenza, and COVID-19 for multiple indicators by state | Can be used with attribution (see FAQ) |
| Respiratory Diseases | CDC National Respiratory and Enteric Virus Surveillance System | Number of positive tests for RSV, by health and human services region. | RSV positive tests by region | - |
| Respiratory Diseases | CDC National Wastewater surveillance program | Viral concentration for RSV, influenza, or SARS-CoV-2 in wastewater. | Weekly time series of RSV, influenza, and SARS-COV-2 for multiple indicators by state | - |
| Respiratory Diseases | CDC National Syndromic Surveillance Program | Percentage of ED visits due to RSV, influenza, or COVID-19. | Weekly time series of RSV, influenza, and COVID-19 for multiple indicators by state | - |
| Respiratory Diseases | [CDC RESP-NET](https://data.cdc.gov/Public-Health-Surveillance/Rates-of-Laboratory-Confirmed-RSV-COVID-19-and-Flu/kvib-3txy/about_data "The CDC's Respiratory Virus Hospitalization Surveillance Network (RESP-NET) monitors laboratory-confirmed hospitalizations associated with influenza, COVID-19, and respiratory syncytial virus (RSV) among children and adults. The data are collected from hospitals in selected counties and county equivalents. This dataset has several important advantages: the area around the hospitals is well described, so rates of disease adjusted for population size can be accurately reported. The selected counties include ~10% of the US population and are demographically representative of the country. Detailed patient demographic information is available, and officials actively search for cases to ensure they capture all cases in the data. A limitation is that the network relies on the clinicians to perform viral ... | Number of laboratory-confirmed hospitalizations due to the virus per 100,000 people. | Weekly time series of RSV, influenza, and COVID-19 for multiple indicators by state | - |
| Respiratory Diseases | CDC Active Bacterial Core Surveillance (ABCs) | The number of cases of invasive pneumococcal disease by age group, year, and serotype, 1998-2023. For 2018, state-specific breakdowns are provided | Serotype-specific IPD by year, Number of IPD cases by state for 2018 | - |
| Respiratory Diseases | Surveillance for serotype-specific pneumococcal pneumonia | Comparison of invasive pneumococcal disease and pneumonia | Comparison of IPD and pneumonia | - |
| Childhood Immunizations | CDC National Immunization Survey | Estimates of immunization coverage by vaccine, age, and state. | Immunization rates | - |
| Childhood Immunizations | CDC National Immunization Survey | Estimates of immunization coverage by vaccine, age, and state, and by urbanicity of the county/city of residence. | Immunization rates | - |
| Childhood Immunizations | CDC National Immunization Survey | Estimates of immunization coverage by vaccine, age, and state, and by insurance status. | Immunization rates | - |
| Chronic diseases | Epic Cosmos | Percentage of 'active users' in Epic Cosmos who have a history of measurements indicating diabetes (Hemoglobin A1C ≥7%) |
November 14, 2025 We have updated several aspects of the obesity and diabetes definitions from Epic Cosmos. The denominator population has been updated to include base patients with an encounter, and a elevated HbA1c measurement or BMI>30 measurement in the 2 years prior to the encounter. This allows for stratification over time and more accurately captures the active users. We also change from a 10 year look back period to a 2 year look back period to be in line with the definitions used by the Medicare CCW. In addition to these changes, we have added two additional ways to measure diabetes and obesity prevalence based on the Epic Cosmos data. This is based on the CCW definitions, which evaluates the presence of diganostic codes for diabetes or obesity during a 2 year lookback period. The updated file can be found here
November 21,2025 The CDC updated their invsdive pneumococcal disease file to i clude geographic site for 1998-2023. The file with geographic stratification by serotype has been updated accordingly, and the dashboard now shows 2023 instead of 2019
###Create the data source folder
Run
dcf_add_source("DATASETNAME")
Edit the ingest.R file. As an example, here we add a file from data.gov using dcf_download_cdc(). The goal is to download a raw file and convert to the standard format
process <- dcf::dcf_process_record()
raw_state <- dcf::dcf_download_cdc(
"kvib-3txy",
"raw",
process$raw_state
)
if (!identical(process$raw_state, raw_state)) {
#read in raw, filter, and do any formatting needed
data1 <- vroom::vroom('raw/kvib-3txy.csv.xz') %>%
filter(Type=='Unadjusted Rate' & Sex=='Overall' & `Race/Ethnicity`=='Overall') %>%
rename(virus= 'Surveillance Network',
age = 'Age group',
state = Site,
time= 'Week Ending Date' ) %>%
mutate( virus = if_else(grepl('COVID', toupper(virus)),'rate_covid',
if_else(grepl('RSV', toupper(virus)),'rate_rsv',
if_else(grepl('FLU', toupper(virus)),'rate_flu',
'rate_any'
)))
) %>%
dcast( ., time + age + state ~ virus, value.var = 'Weekly Rate') %>%
mutate( rate_flu = if_else(is.na(rate_flu),0, rate_flu), #do not fill in below
geography = if_else(state=='Overall', 0,
cdlTools::fips(state, to='FIPS'))
) %>%
filter(age =='Overall') %>%
dplyr::select(-state)
#Write standard data
vroom::vroom_write(
data1,
"standard/data.csv.gz",
","
)
# record processed raw state
process$raw_state <- raw_state
dcf::dcf_process_record(updated = process)
Each variable should have an entry. for example:
"rate_any": { "id": "rate_any", "short_name": "Number of laboratory confirmed cases of RSV, influenza or COVID-19 per 100,000 people", "long_name": "", "category": "", "short_description": "", "long_description": "", "statement": "", "measure_type": "Incidence", "unit": "Cases per 100,000 people", "time_resolution": "Week", "restrictions": "", "sources": [], "citations": [] }
Groups of related datasets are combined into a bundle. For example run:
dcf::dcf_process("bundle_respiratory", ".")
This creates a bundle folder for respiratory in the data folder
Open the build.R file. This is where datasets should be combined and formatted into final 'production' formats. Output files are saved into the dist/ folder in whatever format is needed (e.g., parquet)
Any standard format files that are used in the bundle should be referenced in process.json. For example:
"source_files": [ "epic/standard/weekly.csv.gz", "gtrends/standard/data.csv.gz", "wastewater/standard/data.csv.gz", "abcs/standard/data.csv.gz", "abcs/standard/uad.csv.gz", "NREVSS/standard/data.csv.gz", "nssp/standard/data.csv.gz", "respnet/standard/data.csv.gz" ]
From the parent directory, run:
dcf_build()
These data and PopHIVE statistical outputs are provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors, contributors, or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the data or the use or other dealings in the data.
The PopHIVE statistical outputs are research tools intended for use in the fields of public health and medicine. They are not intended for clinical decision making, are not intended to be used in the diagnosis or treatment of patients and may not be useful or appropriate for any clinical purpose. Users of the PopHIVE statistical outputs should be aware of their responsibilities to ensure the ethical and appropriate use of this technology, including adherence to any applicable legal and regulatory requirements.
The content and data provided with the statistical outputs do not replace the expertise of healthcare professionals. Healthcare professionals should use their professional judgment in evaluating the outputs of the PopHIVE statistical outputs.