DATA2901: Big Data and Data Diversity (Advanced)

University of Sydney• 2025 S1

Completed

Python

SQL

Hadoop

Linux

Jupyter

Big Data

Applied Linux, SQL, Python, and Hadoop for large-scale data pipelines, integrating machine learning and database design into real-world workflows.

Learning Outcomes

Data Automation: Automated tasks on structured, semi-structured and unstructured data (text, images, geo, time series).
Ingestion & Integration: Combined heterogeneous datasets and produced meaningful summaries under real-world constraints.
Declarative Querying: Wrote efficient SQL for extraction and manipulation.
Big Data Fundamentals: Applied the 4Vs; used indexing, compression, partitioning, and distributed frameworks like Hadoop.
Ethics & Privacy: Considered privacy and ethical implications when handling sensitive/large-scale data.

Takeaways

This was my first advanced-level data science course, and the expectations were clearly higher than in introductory units. I learned how to work with Linux systems for handling data, deepened my knowledge of big data concepts, and used Jupyter notebooks to connect with PostgreSQL databases. A key part of the course was designing better table structures and fields to support efficient querying and retrieval—skills that are directly relevant to a data scientist’s work. In the group project, we also had to integrate machine learning techniques to support our analysis, which required both technical adaptability and teamwork. At the same time, the SQL tests and final exam were particularly challenging for me, but I see those difficulties as valuable growth opportunities. Beyond the technical skills, I also realized how important it is to manage group dynamics effectively when tackling complex projects. Overall, the challenges were significant, but the learning outcomes and confidence I gained were even greater.