Exploring Trends of Data Science

Data science is the practice of uncovering patterns, trends, and insights from large sets of data. Unlike AI, which aims to create intelligent behavior, data science emphasizes data analysis and employs a range of techniques—from traditional statistics to more advanced machine learning models. This thread will delve into how data scientists approach problems, the methods they use (both machine learning and beyond), and the diverse applications where data science drives decisions across industries. I’ll be posting regular updates about the evolving landscape of data science, sharing the latest methods, trends, and tools that are shaping this field.

11/1/202413 min read

The rise of vibe coding, Google and GitHub make apps from natural language

July 28.2025

Recently Google released Opal, which is an experimental AI tool for creating functional, AI-powered "mini-apps" without the need for traditional coding. Currently, Opal is available to users in the United States through a public beta program, allowing Google to gather feedback and refine the platform based on real-world use. The primary objective of Opal is to broaden access to AI application development, making it feasible for individuals who lack programming expertise to bring their digital ideas to fruition.

The process of building an app with Opal typically begins with a user describing the desired functionality in natural language, similar to engaging with a conversational AI. Opal then interprets these instructions and translates them into a visual workflow, which represents the app's internal logic. This visual editor allows users to observe and manipulate each step of their application, including inputs, AI model calls, and outputs, providing granular control without exposing underlying code. Users can refine their app's behavior by directly editing prompts within these visual steps or by issuing further natural language commands.

Under the hood, Opal leverages various Google AI models, such as Gemini, and potentially others like Veo for video generation or Imagen for image generation, to fulfill the specified tasks within the app's workflow. The resulting applications are web-based, accessible via a unique URL, and hosted on Google's infrastructure, which simplifies sharing and deployment for the user. While Opal is suitable for rapidly prototyping, creating custom productivity tools, and demonstrating AI concepts, it is not currently designed for developing native mobile or complex enterprise-level standalone applications that require extensive backend integration or real-time data handling.

Besides Google's OPAL, GitHub recently also introduced Spark, an AI-powered coding platform that turns natural language descriptions into fully functional web applications. Spark is designed to streamline the app development process by enabling users to describe what they want—such as “build a task tracker with user authentication and analytics”—and have the system generate the front end, back end, and deployment configuration automatically. It integrates tightly with GitHub Copilot and leverages multiple large language models (like GPT-4 and Claude) to translate user intent into working code. Spark not only generates the application’s codebase but also handles infrastructure setup, database configuration, and hosting, all from a single natural language prompt. In doing so, GitHub Spark positions itself as a powerful tool for both technical users and non-developers alike, ushering in a new era of "vibe coding" where building software is more about expressing ideas than writing syntax.

That’s my take on it:

The recent emergence of tools like Google Opal and GitHub Spark marks a pivotal shift in how we build software and interact with data. These platforms allow users to create full-stack applications using plain natural language, often without writing a single line of traditional code. While this might feel revolutionary to some, to others—especially those frustrated by the overemphasis on manual coding in data science—it feels long overdue.

Since the rise of data science, training programs have leaned heavily on languages like Python and R, often treating coding proficiency as the primary gateway to becoming a data scientist. Conversely, GUI tools like SPSS, SAS, JMP, and Excel were seen as “not real data science.” While understandable in the past, this approach has unintentionally excluded countless individuals who are conceptually strong but less interested in syntax. The irony is that coding—once viewed as the key enabler—has become a barrier to creativity, agility, and insight. For many learners, the shift from GUI tools to writing code felt like going backwards, not forwards.

But that narrative is changing rapidly. With the growing popularity of tools like Opal, Spark, and AI-enhanced platforms like JASP, Tableau, Power BI, BigQuery ML, and AutoML, natural-language development is emerging not just as a trend, but as a new standard. These tools lower the barrier to entry and allow users to focus on what truly matters: logic, insight, communication, and impact.

This evolution demands a serious rethinking in how we train the next generation of data professionals. We need to move away from teaching coding as an end goal, and instead treat it as one of many tools in a data problem-solver’s toolbox. The emphasis should shift toward data reasoning, prompt engineering, exploratory data analysis, causal thinking, and AI-assisted workflows. Learners should be introduced early to tools that enable them to build, analyze, and explain without the friction of traditional code. That doesn’t mean abandoning programming—it means integrating it more thoughtfully, guided by context and purpose, with AI as a co-developer.

We should also embrace low-code and no-code platforms as legitimate, production-ready tools. These aren’t just “training wheels” for beginners; they are powerful accelerators used in serious business and scientific environments. More importantly, they empower domain experts—who may not be fluent coders—to become effective builders and analysts.

In a world increasingly shaped by AI, we need AI-literate problem solvers, not just script writers. The future of data science lies in critical thinking, interpretability, ethical modeling, and the ability to communicate insights clearly—skills that endure even as the tooling evolves. Coding will always have its place, but it’s no longer the center of the universe. It’s the vehicle—and AI is the self-driving co-pilot.

If we continue training for yesterday’s workflows, we risk leaving learners unprepared for tomorrow’s jobs. But if we embrace these new paradigms, we unlock a broader, more inclusive, and more powerful future for data science and app development alike.

Links:

https://developers.googleblog.com/en/introducing-opal/

https://www.youtube.com/watch?v=CdCwpcFMJLo

Chong Ho Alex Yu

April 19, 2025

The Wikimedia Foundation has announced a new initiative aimed at reducing the strain placed on Wikipedia’s servers by artificial intelligence developers who frequently scrape its content. In partnership with Kaggle, a Google-owned platform for data science and machine learning, Wikimedia has released a beta dataset containing structured Wikipedia content in English and French. This dataset is explicitly designed for machine learning workflows and offers a cleaner, more accessible alternative to scraping raw article text.

According to Wikimedia, the dataset includes machine-readable representations of Wikipedia articles in the form of structured JSON files. These contain elements such as research summaries, short descriptions, image links, infobox data, and various article sections. However, it intentionally excludes references and non-textual content like audio files. The goal is to provide a more efficient and reliable resource for tasks such as model training, fine-tuning, benchmarking, and alignment.

While Wikimedia already maintains content-sharing agreements with large organizations such as Google and the Internet Archive, this collaboration with Kaggle is intended to broaden access to high-quality Wikipedia data, particularly for smaller companies and independent researchers. Kaggle representatives have expressed enthusiasm for the partnership, highlighting their platform’s role in supporting the machine learning community and their commitment to making this dataset widely available and useful.

That’s my take on it:

While the release of a structured dataset by the Wikimedia Foundation is a meaningful step toward reducing reliance on web scraping, its overall impact on the broader data science community—particularly those working with unstructured data—may be limited. For data scientists focused on structured tasks such as natural language processing or machine learning applications involving encyclopedic knowledge, the dataset offers clear benefits. By providing pre-processed, machine-readable JSON files containing curated article content, it simplifies data ingestion and integration, reducing the overhead traditionally associated with scraping and cleaning raw HTML. This is particularly valuable for smaller organizations and independent researchers who may lack the infrastructure or resources to perform large-scale data extraction.

However, for those whose work depends heavily on unstructured data—such as social media analysis, customer feedback mining, or domain-specific natural language processing—the dataset does little to alleviate their ongoing need to collect data from diverse, often messy sources. The vast majority of valuable online information remains in unstructured formats, and in many cases, it is accessible only through scraping or limited APIs. As such, this initiative by Wikimedia is unlikely to replace the necessity of scraping for most real-world applications.

Web scraping is controversial. This move is symbolically significant. It reflects a broader trend toward encouraging ethical and sustainable access to machine-learning-relevant content. By offering a public, machine-learning-friendly dataset, Wikimedia sets a precedent that could inspire other content providers to follow suit, potentially reducing the strain caused by indiscriminate scraping and fostering greater transparency. In that sense, while the immediate practical implications may be narrow, the long-term influence on data access practices could be substantial.

Link: https://www.theverge.com/news/650467/wikipedia-kaggle-partnership-ai-dataset-machine-learning

Wikimedia offers accessible data sets to alleviate Web scraping

Top 10 innovative data science companies in 2025

March 21, 2025

In a year where artificial intelligence is becoming the bedrock of innovation across industries, the importance of data science has never been clearer. As Michel Tricot, CEO of Airbyte, puts it: “No data, no AI.” The 10 companies recognized by Fast Company in 2025 aren’t just building clever AI tools—they’re transforming how data is collected, processed, and used to solve real-world problems. From healthcare to crypto, supply chains to outer space, these innovators are proving that the smart use of data can power meaningful change.

1. Unstructured

Unstructured unlocks hidden business value by converting unstructured data into AI-ready formats, fueling applications like RAG and fine-tuned LLMs. With 10,000+ customers and partnerships with U.S. military branches, it’s become a foundational tool for enterprise AI.

2. Chainalysis

Chainalysis brings clarity to the murky world of crypto through blockchain forensics, helping trace and recover billions in illicit funds. In 2024 alone, it analyzed $4 trillion in transactions and secured a landmark legal win for crypto analytics.

3. Airbyte

Airbyte makes large-scale data integration seamless, enabling AI initiatives with plug-and-play connectors and unstructured data support. Its open-source model now powers over 170,000 deployments and a thriving ecosystem of 10,000+ community-built connectors.

4. Norstella

Norstella speeds up the drug development pipeline by analyzing billions of data points through its AI platforms, helping pharma companies make faster, smarter decisions. It has directly contributed to the launch of over 50 new drugs in the past year.

5. Makersite

Makersite empowers product teams to design more sustainably with real-time supply chain data and AI-driven life cycle analysis. In one standout case, it helped Microsoft slash the Surface Pro 10’s carbon footprint by 28%.

6. Anaconda

Anaconda is democratizing AI by enhancing Python workflows for data scientists and non-coders alike, with tools like Python in Excel and a secure AI model library. Now used by over 1 million organizations, it’s a key enabler of accessible data science.

7. Satelytics

Satelytics uses advanced geospatial analytics to detect methane leaks and monitor land health, offering quick insights from satellites and drones. Its technology helped Duke Energy detect hundreds of leaks and has expanded across multiple industries.

8. Rune Labs

Rune Labs is changing the way Parkinson’s disease is managed with real-time data from wearables and AI-driven treatment insights. Its platform has improved patient outcomes significantly, reducing ER visits and boosting medication adherence.

9. EarthDaily

EarthDaily enhances sustainability in mining through hyperspectral imaging and radar analytics that reduce environmental impact and safety risks. It provides precision tools to accelerate mineral discovery while avoiding unnecessary drilling.

10. Nominal

Nominal streamlines testing and evaluation in aerospace, defense, and high-tech sectors with a unified, real-time analytics platform. Used in everything from drone trials to spacecraft diagnostics, it’s redefining how critical systems are validated.

That’s my take on it:

The phrase “No data, no AI” captures more than just a technical truth—it underscores the deep interdependence between data science and artificial intelligence. No matter how advanced AI becomes, it cannot function in a vacuum. It needs clean, relevant, and well-structured data to learn, adapt, and perform effectively. And that process—collecting, cleaning, transforming, and curating data—is still very much a human-driven discipline.

The success of the companies recognized by Fast Company in 2025 highlights this reality. Whether it's transforming unstructured data into LLM-ready formats, streamlining complex supply chains, or analyzing geospatial signals from satellites, these innovations all hinge on strong data science foundations, not just AI magic. What they demonstrate is that while AI engineering skills are in high demand, non-AI data science roles—like data engineering, data quality management, and domain-specific analytics—remain absolutely essential.

Link: https://www.fastcompany.com/91269286/data-science-most-innovative-companies-2025

Python is Number 1 in TIOBE Index

Jan. 10, 2025

Python has been named "TIOBE's Programming Language of the Year 2024" in the TIOBE Index due to achieving the highest ratings. Nonetheless, while Python offers numerous advantages, it also faces challenges such as performance limitations and runtime errors. The TIOBE Index measures programming language popularity based on global expertise, courses, and third-party support, with contributions from major platforms like Google and Amazon. Positions two through five are occupied by C++, Java, C, and C#. Notably, SQL ranks eighth, R is positioned at number 18, and SAS is at number 22.

That’s my take on it:

Python's widespread popularity is largely driven by the growing demand for data science and machine learning. Its rich ecosystem of libraries and frameworks, including TensorFlow, PyTorch, and scikit-learn, makes it an ideal choice for DSML tasks. Interestingly, certain programming languages exhibit remarkable longevity. For example, JavaScript was ranked seventh in 2000 and currently holds sixth place. Similarly, Fortran, which was ranked 11th in 1995, now occupies the tenth position. The resurgence of Fortran is notable; according to TIOBE, it excels in numerical analysis and computational mathematics, both of which are increasingly relevant in artificial intelligence. Fortran is also gaining traction in image processing applications, including gaming and medical imaging.

While some languages maintain stable rankings over time, others have shown dramatic improvements. For instance, SQL was ranked 100th in 2005 but has since risen to ninth place. Predicting the future trajectory of programming languages is challenging, underscoring the dynamic nature of the field. As the saying goes, "Never say never!"

Links:

https://www.tiobe.com/tiobe-index/

https://www.techrepublic.com/article/tiobe-index-may-2024/

11/20/2024

A recent article titled " AI-Assisted Genome Studies Are Riddled with Errors" by Dr. Sitaraman highlights the challenges and errors associated with using artificial intelligence (AI) in large genomics studies. Researchers have employed AI to fill in gaps in patient information and improve predictions in genome-wide association studies (GWAS). However, new research from the University of Wisconsin-Madison reveals that these AI-assisted approaches can lead to false positives and misleading correlations.

For 15 years, GWAS has been used to identify genetic variants associated with traits or diseases. Despite its success, GWAS has limitations, which scientists have attempted to overcome using AI. However, AI can introduce biases, especially when working with incomplete datasets. The research highlights that AI-assisted GWAS can create false associations between gene variants and diseases. For instance, AI models showed a high correlation between certain gene variants and type II diabetes, which was not supported by conventional GWAS. Further, the use of proxy data, such as family history, in GWAS-by-proxy (GWAX) can also lead to incorrect conclusions. For example, AI approaches showed a positive correlation between education attainment and Alzheimer's risk, contrary to established GWAS findings. The research team suggests new statistical methods to correct these biases and emphasizes the need for transparency and rigor in reporting findings from AI-assisted studies.

That’s my take on it:

No doubt machine learning methods have overshadowed conventional statistics in big data analytics. However, no solution is 100% foolproof. Like conventional statistics, machine learning methods could be misguided and misused. We should avoid the mentality that “I use a hammer and so every problem is a nail”: don’t apply ML just because it is popular or powerful, and then mindlessly assume that the conclusion must be right. Rather, we must consider the nature of the data, the question being asked, and the desired outcome. When a method is experimental and the data pattern is strange, we must evaluate it with a pair of skeptical eyes. After all, skepticism is the principle of Tukey’s exploratory data analysis.

Link: https://www.the-scientist.com/ai-assisted-genome-studies-are-riddled-with-errors-72339

Machine learning methods fail in genome studies

Is Rguroo Version 2.0 a journey to data science?

11/12/2024

Today, Rguroo announced the release of Rguroo Version 2.0, titled “Journey to Data Science.” Among its major updates is an enhanced logistic regression feature. The new diagnostic tools include interactive logistic curve plotting, prediction assessments, external data prediction, model validation, and k-fold cross-validation. Rguroo is a web-based statistical platform with a graphical user interface (GUI) that provides access to R’s capabilities without requiring users to know R programming

That’s my take on it:

While Rguroo benefits analysts and students by making R’s statistical and graphical tools more accessible, I am uncertain whether this release truly represents a “journey to data science.” After exploring the software, I noticed it lacks core data science methods, such as decision trees, random forests, XGBoost, gradient boosting, and neural networks. The focus on logistic regression—while a valuable classical statistical method—reflects a model-driven, inference-centered approach, rather than the data exploration and pattern recognition that define data science and machine learning. In fact, this highlights a broader issue: many so-called “data science” tools and programs don’t fully reflect the paradigm shift toward data-centric methodologies

Link: https://rguroo.com/

DSML trend: Best visualization tools for business intelligence

11/8/2024

According to Forbes Advisor, the top data visualization tools for business in 2024 are as follows:

·Microsoft Power BI: Leader in business intelligence (BI) with robust integration capabilities
·Tableau: Known for sophisticated interactive visualizations
·Qlik Sense: Stands out for AI integration and machine learning features
·Klipfolio: Excels in custom dashboard creation
·Looker: Provides comprehensive visualization options and data modeling
·Zoho Analytics: Seamlessly integrates with other Zoho products
·Domo: Distinguished by its custom app development capabilities

The evaluation criteria included user-friendliness, cost-effectiveness, support quality, and key features such as real-time analytics, customization options, and collaborative data sharing.

That’s my take on it: Data visualization tools are essential for both business and academic purposes, offering powerful ways to analyze and present complex data. While the tools mentioned by Forbes are indeed popular for business intelligence, there are several excellent options for academics and other specialized purposes. For example,

SAS Visual Analytics on SAS Viya: General purposes
JMP Pro: General purposes
IBM Watson Studio: General purposes
MATLAB: Popular in engineering and scientific computing, providing robust visualization tools alongside computational capabilities.
Wolfram Mathematica: A powerful and comprehensive computational software system that offers extensive capabilities for data visualization, scientific computing, and statistical analysis.
Origin: Specifically designed for scientific graphing and data analysis, popular in physical sciences and engineering.
Gephi: An open-source tool particularly useful for network analysis and visualization, popular in social sciences and complex systems research.
·Python with libraries like Matplotlib, Seaborn, and Plotly: Widely used in data science and research for its flexibility and powerful visualization options.

Factors for selecting visualization tools:

Data complexity and size: Tools like SAS Viya and IBM Watson Studio are better suited for very large datasets.
·Statistical analysis needs: JMP and Python offer more advanced statistical capabilities.
·Collaboration requirements: Cloud-based solutions like IBM Watson Studio may offer better collaboration features.
·Domain-specific needs: Some fields may have preferred tools (e.g., Gephi for network analysis whereas Matlab and Mathematica for mathematics and engineering).

Link: https://www.forbes.com/advisor/business/software/best-data-visualization-tools/