Data Analytics at ��ɫ��TV FAQ

Frequently asked questions related to data analytics at ��ɫ��TV.

Centralization vs. Autonomy

How can the "Single Source of Truth" model preserve unit-level perspectives?

The Franchise model adopted by the university preserves unit-level perspectives by separating centralized data management from decentralized data application.

Centralized Data, Decentralized Context: The central body (the Franchisor) ensures the data hub contains raw, high-quality, and governed data. The units (Franchisees) apply their specific domain expertise to this data.
Localized Metrics: Units use the central hub to build metrics and dashboards that reflect their specific operational needs. For example, the Provost's office may use student data to calculate enrollment yield, while the Financial Aid office uses the same data to calculate need-based aid distribution. Both use the same underlying data, but their derived "data products" are tailored to their unique perspective.
Data Views: We support this by creating specific "data views" or a semantic layer within the hub. This organizes central data into unit-specific groupings without duplicating the core data.

How can units retain autonomy and flexibility alongside centralized data and reporting?

Units retain autonomy by controlling the "last mile" of data product development. While the central team manages the "first mile" (provisioning and governance), units are responsible for:

Franchisee-Owned Analytics: Units are fully responsible for developing their own localized solutions and deciding which tools/visualizations they use to analyze the data. This means they are autonomous in their choice of metrics, analysis methods (e.g., advanced analytics/AI/ML), and in creating their specific data products.
Applying Domain Expertise: Using their operational expertise to identify problems and design solutions. Centralized data acts as a resource and standard, not a mandate on insights.
Innovation within Boundaries: Innovating freely with data, provided they adhere to university-wide governance, security, and privacy policies.

How will the organization manage data duplication from various sources?

We are minimizing duplication by establishing the centralized data hub (Snowflake data lakehouse) as the mandatory conduit for major data initiatives.

Centralized Ingestion: Critical operational data (e.g., student records, HR, finance) is copied only once—from the source system to the hub.
Single Environment: Franchisees are required to use the data hub for solution development rather than pulling directly from source systems or building "shadow" data marts.
Quality Adjudication: The Franchisor's focus on data quality and review/adjudication processes helps identify and resolve data discrepancies early. If a Franchisee proposes a data element that already exists in the hub, the process can guide them to the existing, governed data.

Data Access and Usability

How can access barriers to transaction-level data be removed?

Our intent is not to limit access; rather, by rigorously protecting sensitive elements (specifically PII), we encourage data stewards to make their data more widely available.

Collaborative Solutions: Individuals requiring transaction-level data should work with the Data Collaborative to build specific views, reports, or dashboards that meet their needs.
Streamlined Requests: We have made requesting access much easier through a new Data Governance process.
Pre-Approved Access: Work is underway to create "pre-approved data products"—dashboards and datasets that will be available to the university community without requiring individual data steward approval.

How can data be made more broadly accessible to the university community?

We are actively working to democratize data access by shifting toward governed enablement. This includes the development of pre-approved data assets mentioned above. We encourage users with specific needs to partner with the Data Collaborative to design views that meet their requirements while respecting privacy constraints. Data access requests have also been centralized and simplified through the Data Governance framework.

Is improved data access measured as a key metric of success?

Yes. Improved data access is a primary metric of success. When users can easily locate and understand trusted data, operational efficiency increases. Beyond simple access statistics, we also measure success by monitoring how that data is utilized to make strategic decisions and implement change across the university.

Data Governance and Consistency

Can you explain the concept and management of data lineage?

Data lineage is a map of the journey of data—from its origin in a source system through its transformations to its final consumption in a report. Managing lineage allows us to trust the data, troubleshoot errors quickly, and meet compliance requirements. It ensures that any metric used in decision-making is traceable back to a reliable source, much like tracking a package from sender to delivery.

How does leadership ensure data consistency and governance?

Leadership ensures consistency by establishing clear frameworks, standards, and accountability mechanisms. This includes:

Governance Bodies: Defining roles and responsibilities (Data Stewards, Franchise) for managing quality and access.
Tooling: Investing in technologies that support data cataloging, lineage tracking, and automated validation.
Culture: Regular audits and communication to reinforce a culture where data stewardship is a shared responsibility.

Data Integration and Architecture

Are there plans to modernize legacy platforms and standardize data structures across tools?

Yes—there is an ongoing strategy to modernize legacy platforms and standardize data structures, but it’s executed incrementally as part of business-driven initiatives. We prioritize efforts based on business impact, risk, technical readiness, and franchise capacity. There isn’t a single enterprise-wide timeline covering all sources today.

If you have a candidate system or domain, please direct it through your franchise leadership or email enterprise-data@andrew.cmu.edu.

What are the biggest obstacles to integrating data from multiple source systems?

There are many obstacles to integrating data across systems. The hardest part isn’t the plumbing—it’s agreeing on what things mean, proving records refer to the same thing, and keeping that agreement stable as things change. The top three:

- Conceptual/semantic mismatches (beyond naming): Different systems describe the world at different levels and lenses (e.g., order header vs. line, account vs. person). That misalignment breaks things even when column names match.

No reliable identifiers (entity resolution): The same person or company appears under different IDs, names, or email addresses. You get duplicates—or worse, merge unrelated records.

Data quality at the source: Missing or invalid values, stale timestamps, free text where codes belong, data in the wrong field. Data may be fit for the source’s purpose but not fit for integration—problems appear only when you join it with other data.

What is the timeline for consolidating shadow IT systems (e.g., spreadsheets) into central data sources?

Consolidation is prioritized on a case-by-case basis based on risk, compliance, and business criticality. We prioritize workflows that pose security risks or involve high manual effort. The process involves discovery, defining data definitions, prototyping in the central platform, and user validation before decommissioning the spreadsheet. If you have a specific use case, please submit it through your franchise leadership if applicable, or to enterprise-data@andrew.cmu.edu if not. We’ll review and share next steps.

Data Literacy and Training

Do end-users understand the data, and what training is needed for stakeholders?

Currently, data understanding is often siloed within the departments that produce it. To address this, we are launching resources like the Data Catalog to improve cross-departmental context. We are also expanding Data Literacy training to equip stakeholders with the skills to locate, interpret, and effectively use data for decision-making.

What is the plan for providing frequent, in-person data training opportunities for staff?

We recognize the strong demand for hands-on learning. Planning is underway to establish a regular cadence of in-person or synchronous training sessions. These will focus on practical applications of our analytics tools and reinforcing data literacy concepts. We are currently assessing staff needs to design the most effective curriculum.

Will training be provided to facilitate the move to campus-wide data analytics?

Yes. Training will be provided to support the transition to campus-wide data analytics. While the specific structure and content are still being developed, the plan includes leveraging both internal expertise and externally sourced training opportunities to ensure comprehensive coverage of tools, concepts, and best practices. The goal is to equip staff and stakeholders with the knowledge and skills needed to effectively use new analytics resources. Planning efforts will focus on aligning these training opportunities with identified user needs and system capabilities.

Data Quality and Trust

How can we improve data quality and completeness?

Improving quality starts with identifying issues at the source. We have established a feedback loop through the Data Stewardship Council, where users can report inaccuracies. By identifying these inaccuracies and inconsistencies, we can create processes to eliminate them at the root (in the enterprise application systems) before they reach users.

How do we build confidence in the data for informed decision-making?

Confidence is built through transparency and responsiveness. We encourage users to report data quality concerns to the Data Governance Office so the Data Stewardship Council can adjudicate and resolve them. Additionally, we now publish , ensuring users are aware of current limitations while we work toward resolution.

What processes are in place to ensure data accuracy?

We utilize a combination of automated validation and human governance. The Data Stewardship Council oversees the identification and remediation of quality issues. By centralizing ingestion and standardizing definitions, we reduce the variance that occurs when data is manually manipulated in isolated silos.

Impact, Action, and Culture

What are the top-down leadership initiatives for modernizing key databases?

There are many projects underway for modernizing critical enterprise applications, which are a primary source of university data. The coordination of these major investments is managed through MASG. MASG stands for Major Administrative Systems Governance, a governing body composed of owners/stewards of enterprise IT systems that drive our largest IT investments. The various IT leaders are responsible for the major enterprise system used on campus to operate the university’s various diverse and robust business functions. These leaders represent not only the concerns and needs of the systems they oversee but also act as a conduit to the various business functions they represent.

How do we ensure accountability for moving from data analysis to concrete action and change?

Accountability is embedded in our roadmap process. We utilize Baseline and Needs Assessments to link every analytical project back to a specific strategic goal. This ensures that analysis is not performed for its own sake, but is directly tied to an intended outcome or operational improvement.

How can we determine the value of specific data, especially in the finance domain?

Data is worth pursuing if the value equation is positive: Value = (Benefit + Risk Mitigation) - Cost.

Benefit: Strategic advantage, efficiency, or increased revenue.
Risk Mitigation: Avoidance of fines, fraud, or reputational loss.
Cost: The total cost of acquisition, cleaning, storage, and security.

In finance, value is measured by the utility of achieving efficiency, mitigating financial risk, and aiding strategic resource allocation.

What traits define successful "bridge-builders" for cross-organizational data efforts?

Successful bridge-builders play a critical role in a federated model like ours, as they connect the centralized Franchisor capabilities with the decentralized Franchisee needs. Their traits include:

Bilingual Fluency (Technical & Business): They can fluently translate a department's operational problem (e.g., "Our course scheduling is inefficient") into a technical data requirement (e.g., "Need historical enrollment, room usage, and faculty load data integrated").
High Emotional Intelligence and Trust: They must be trusted by both the central IT/Data teams (Franchisor) and the departmental operational staff (Franchisee). They act as a neutral party, capable of mediating competing priorities and expectations.
Systems Thinking: They don't just solve the immediate problem; they understand how a specific data solution in one unit can potentially be reused or integrated with another unit's data product, thereby facilitating the desired cross-unit collaboration.
Focus on Business Value: They ensure every data effort is tied back to a tangible, demonstrable outcome for the Franchisee, moving the focus from "data for data's sake" to "data for institutional benefit."

Privacy, Ethics, and Bias

Are there plans for data retention policies, especially for historically sensitive data?

Yes. A working group led by the University Libraries is currently developing comprehensive data retention policies. This initiative aims to balance the historical preservation needs of the university with privacy, security, and regulatory requirements regarding sensitive data deletion.

How do we balance data sharing with privacy and regulatory compliance?

Balancing data sharing with privacy and regulatory compliance means using only the minimum data necessary to meet business objectives while ensuring transparency, consent, and protection of personal information. This requires aligning with legal standards, applying privacy-enhancing technologies, and embedding governance practices that prioritize responsible data use.

How is the university addressing bias in data collection and analysis?

��ɫ��TV addresses bias in data collection and analysis by embedding ethical, transparent, and consistent practices throughout its research and assessment processes.

The Office of Research Integrity and Compliance (ORIC) promotes responsible research through training that emphasizes accuracy, fairness, and the avoidance of bias.
The Office of Institutional Research and Analysis (IR&A) ensures standardized data definitions and reporting to reduce inconsistencies.

Additionally, the university prioritizes self-reported demographic data to respect individual identity, using institutional data only as a secondary source to ensure completeness. Assessments incorporate representative student perspectives, analyze results across demographic groups, and use findings to guide improvement rather than judgment. Ongoing training for faculty, staff, and students on topics such as implicit bias and ethical data management further strengthens the university’s commitment to objective and responsible data practices.

Technology, AI, and Tools

Given concerns about bias, should the university be using AI for data analytics?

Yes, but with strict guardrails. AI allows us to spot patterns and improve services, but it must be used to augment human judgment, not replace it. We deploy analytics models only when we can demonstrate fairness, protect privacy, and maintain human accountability for the final decisions.

Where is human oversight necessary, and where is AI analysis inappropriate?

AI adoption is not binary; rather, the level of human oversight scales with the potential risk—specifically regarding impact, reversibility, explainability, and data sensitivity. We operate on the principle that while AI can inform analysis, humans must remain accountable for material outcomes.

We categorize the necessary oversight into three distinct tiers:

Human-Required: Essential for high-stakes decisions affecting rights, access, or benefits; situations where errors are irreversible; or instances where models lack full explainability.
AI-Assistive: Used for providing recommendations where a human reviews and validates the output (Human-in-the-loop).
AI-Autonomous: Permissible only for low-risk, reversible tasks, such as system health alerts, effectively monitored by IT staff.

As a standard rule, if an outcome affects a student or staff member's livelihood or status, a human must always serve as the final decision-maker.

What is the long-term vision for using key data technologies like Snowflake?

Snowflake is the backbone of our data hub. It’s where we curate governed, reusable data products (tables/views) and, in many cases, encode the business logic that standardizes definitions across units. The goal is a trusted source of truth that can serve both analytics and select operational read use cases without every team having to rebuild the same pipelines.

The long-term vision includes:

Central, governed data products: Canonical entities and metrics live here, versioned and documented, with quality checks and lineage.
Transform in the platform (ELT): Business rules are expressed as code and run in Snowflake so they’re testable, reviewable, and repeatable.
Built for reuse: clear definitions, a common way to link records, standard access, and secure sharing—so everyone uses the same truth without extra copies.
Right-time by default: Cadence scales daily → hourly → minutes based on business need. While near-real-time is supportable, we avoid making operational systems depend on Snowflake for write-time or sub-second decisions.
Supporting AI: Enabling ML and GenAI while keeping data in a governed environment—allowing standard security, auditing, and policy controls to apply.

��ɫ����TV