Exploring Trends of Data Science
Data science is the practice of uncovering patterns, trends, and insights from large sets of data. Unlike AI, which aims to create intelligent behavior, data science emphasizes data analysis and employs a range of techniques—from traditional statistics to more advanced machine learning models. This thread will delve into how data scientists approach problems, the methods they use (both machine learning and beyond), and the diverse applications where data science drives decisions across industries. I’ll be posting regular updates about the evolving landscape of data science, sharing the latest methods, trends, and tools that are shaping this field.
11/1/20249 min read
April 19, 2025
The Wikimedia Foundation has announced a new initiative aimed at reducing the strain placed on Wikipedia’s servers by artificial intelligence developers who frequently scrape its content. In partnership with Kaggle, a Google-owned platform for data science and machine learning, Wikimedia has released a beta dataset containing structured Wikipedia content in English and French. This dataset is explicitly designed for machine learning workflows and offers a cleaner, more accessible alternative to scraping raw article text.
According to Wikimedia, the dataset includes machine-readable representations of Wikipedia articles in the form of structured JSON files. These contain elements such as research summaries, short descriptions, image links, infobox data, and various article sections. However, it intentionally excludes references and non-textual content like audio files. The goal is to provide a more efficient and reliable resource for tasks such as model training, fine-tuning, benchmarking, and alignment.
While Wikimedia already maintains content-sharing agreements with large organizations such as Google and the Internet Archive, this collaboration with Kaggle is intended to broaden access to high-quality Wikipedia data, particularly for smaller companies and independent researchers. Kaggle representatives have expressed enthusiasm for the partnership, highlighting their platform’s role in supporting the machine learning community and their commitment to making this dataset widely available and useful.
That’s my take on it:
While the release of a structured dataset by the Wikimedia Foundation is a meaningful step toward reducing reliance on web scraping, its overall impact on the broader data science community—particularly those working with unstructured data—may be limited. For data scientists focused on structured tasks such as natural language processing or machine learning applications involving encyclopedic knowledge, the dataset offers clear benefits. By providing pre-processed, machine-readable JSON files containing curated article content, it simplifies data ingestion and integration, reducing the overhead traditionally associated with scraping and cleaning raw HTML. This is particularly valuable for smaller organizations and independent researchers who may lack the infrastructure or resources to perform large-scale data extraction.
However, for those whose work depends heavily on unstructured data—such as social media analysis, customer feedback mining, or domain-specific natural language processing—the dataset does little to alleviate their ongoing need to collect data from diverse, often messy sources. The vast majority of valuable online information remains in unstructured formats, and in many cases, it is accessible only through scraping or limited APIs. As such, this initiative by Wikimedia is unlikely to replace the necessity of scraping for most real-world applications.
Web scraping is controversial. This move is symbolically significant. It reflects a broader trend toward encouraging ethical and sustainable access to machine-learning-relevant content. By offering a public, machine-learning-friendly dataset, Wikimedia sets a precedent that could inspire other content providers to follow suit, potentially reducing the strain caused by indiscriminate scraping and fostering greater transparency. In that sense, while the immediate practical implications may be narrow, the long-term influence on data access practices could be substantial.
Link: https://www.theverge.com/news/650467/wikipedia-kaggle-partnership-ai-dataset-machine-learning
Top 10 innovative data science companies in 2025
March 21, 2025
In a year where artificial intelligence is becoming the bedrock of innovation across industries, the importance of data science has never been clearer. As Michel Tricot, CEO of Airbyte, puts it: “No data, no AI.” The 10 companies recognized by Fast Company in 2025 aren’t just building clever AI tools—they’re transforming how data is collected, processed, and used to solve real-world problems. From healthcare to crypto, supply chains to outer space, these innovators are proving that the smart use of data can power meaningful change.
1. Unstructured
Unstructured unlocks hidden business value by converting unstructured data into AI-ready formats, fueling applications like RAG and fine-tuned LLMs. With 10,000+ customers and partnerships with U.S. military branches, it’s become a foundational tool for enterprise AI.
2. Chainalysis
Chainalysis brings clarity to the murky world of crypto through blockchain forensics, helping trace and recover billions in illicit funds. In 2024 alone, it analyzed $4 trillion in transactions and secured a landmark legal win for crypto analytics.
3. Airbyte
Airbyte makes large-scale data integration seamless, enabling AI initiatives with plug-and-play connectors and unstructured data support. Its open-source model now powers over 170,000 deployments and a thriving ecosystem of 10,000+ community-built connectors.
4. Norstella
Norstella speeds up the drug development pipeline by analyzing billions of data points through its AI platforms, helping pharma companies make faster, smarter decisions. It has directly contributed to the launch of over 50 new drugs in the past year.
5. Makersite
Makersite empowers product teams to design more sustainably with real-time supply chain data and AI-driven life cycle analysis. In one standout case, it helped Microsoft slash the Surface Pro 10’s carbon footprint by 28%.
6. Anaconda
Anaconda is democratizing AI by enhancing Python workflows for data scientists and non-coders alike, with tools like Python in Excel and a secure AI model library. Now used by over 1 million organizations, it’s a key enabler of accessible data science.
7. Satelytics
Satelytics uses advanced geospatial analytics to detect methane leaks and monitor land health, offering quick insights from satellites and drones. Its technology helped Duke Energy detect hundreds of leaks and has expanded across multiple industries.
8. Rune Labs
Rune Labs is changing the way Parkinson’s disease is managed with real-time data from wearables and AI-driven treatment insights. Its platform has improved patient outcomes significantly, reducing ER visits and boosting medication adherence.
9. EarthDaily
EarthDaily enhances sustainability in mining through hyperspectral imaging and radar analytics that reduce environmental impact and safety risks. It provides precision tools to accelerate mineral discovery while avoiding unnecessary drilling.
10. Nominal
Nominal streamlines testing and evaluation in aerospace, defense, and high-tech sectors with a unified, real-time analytics platform. Used in everything from drone trials to spacecraft diagnostics, it’s redefining how critical systems are validated.
That’s my take on it:
The phrase “No data, no AI” captures more than just a technical truth—it underscores the deep interdependence between data science and artificial intelligence. No matter how advanced AI becomes, it cannot function in a vacuum. It needs clean, relevant, and well-structured data to learn, adapt, and perform effectively. And that process—collecting, cleaning, transforming, and curating data—is still very much a human-driven discipline.
The success of the companies recognized by Fast Company in 2025 highlights this reality. Whether it's transforming unstructured data into LLM-ready formats, streamlining complex supply chains, or analyzing geospatial signals from satellites, these innovations all hinge on strong data science foundations, not just AI magic. What they demonstrate is that while AI engineering skills are in high demand, non-AI data science roles—like data engineering, data quality management, and domain-specific analytics—remain absolutely essential.
Link: https://www.fastcompany.com/91269286/data-science-most-innovative-companies-2025
Python is Number 1 in TIOBE Index
Jan. 10, 2025
Python has been named "TIOBE's Programming Language of the Year 2024" in the TIOBE Index due to achieving the highest ratings. Nonetheless, while Python offers numerous advantages, it also faces challenges such as performance limitations and runtime errors. The TIOBE Index measures programming language popularity based on global expertise, courses, and third-party support, with contributions from major platforms like Google and Amazon. Positions two through five are occupied by C++, Java, C, and C#. Notably, SQL ranks eighth, R is positioned at number 18, and SAS is at number 22.
That’s my take on it:
Python's widespread popularity is largely driven by the growing demand for data science and machine learning. Its rich ecosystem of libraries and frameworks, including TensorFlow, PyTorch, and scikit-learn, makes it an ideal choice for DSML tasks. Interestingly, certain programming languages exhibit remarkable longevity. For example, JavaScript was ranked seventh in 2000 and currently holds sixth place. Similarly, Fortran, which was ranked 11th in 1995, now occupies the tenth position. The resurgence of Fortran is notable; according to TIOBE, it excels in numerical analysis and computational mathematics, both of which are increasingly relevant in artificial intelligence. Fortran is also gaining traction in image processing applications, including gaming and medical imaging.
While some languages maintain stable rankings over time, others have shown dramatic improvements. For instance, SQL was ranked 100th in 2005 but has since risen to ninth place. Predicting the future trajectory of programming languages is challenging, underscoring the dynamic nature of the field. As the saying goes, "Never say never!"
Links:
11/20/2024
A recent article titled " AI-Assisted Genome Studies Are Riddled with Errors" by Dr. Sitaraman highlights the challenges and errors associated with using artificial intelligence (AI) in large genomics studies. Researchers have employed AI to fill in gaps in patient information and improve predictions in genome-wide association studies (GWAS). However, new research from the University of Wisconsin-Madison reveals that these AI-assisted approaches can lead to false positives and misleading correlations.
For 15 years, GWAS has been used to identify genetic variants associated with traits or diseases. Despite its success, GWAS has limitations, which scientists have attempted to overcome using AI. However, AI can introduce biases, especially when working with incomplete datasets. The research highlights that AI-assisted GWAS can create false associations between gene variants and diseases. For instance, AI models showed a high correlation between certain gene variants and type II diabetes, which was not supported by conventional GWAS. Further, the use of proxy data, such as family history, in GWAS-by-proxy (GWAX) can also lead to incorrect conclusions. For example, AI approaches showed a positive correlation between education attainment and Alzheimer's risk, contrary to established GWAS findings. The research team suggests new statistical methods to correct these biases and emphasizes the need for transparency and rigor in reporting findings from AI-assisted studies.
That’s my take on it:
No doubt machine learning methods have overshadowed conventional statistics in big data analytics. However, no solution is 100% foolproof. Like conventional statistics, machine learning methods could be misguided and misused. We should avoid the mentality that “I use a hammer and so every problem is a nail”: don’t apply ML just because it is popular or powerful, and then mindlessly assume that the conclusion must be right. Rather, we must consider the nature of the data, the question being asked, and the desired outcome. When a method is experimental and the data pattern is strange, we must evaluate it with a pair of skeptical eyes. After all, skepticism is the principle of Tukey’s exploratory data analysis.
Link: https://www.the-scientist.com/ai-assisted-genome-studies-are-riddled-with-errors-72339


Machine learning methods fail in genome studies
Is Rguroo Version 2.0 a journey to data science?
11/12/2024
Today, Rguroo announced the release of Rguroo Version 2.0, titled “Journey to Data Science.” Among its major updates is an enhanced logistic regression feature. The new diagnostic tools include interactive logistic curve plotting, prediction assessments, external data prediction, model validation, and k-fold cross-validation. Rguroo is a web-based statistical platform with a graphical user interface (GUI) that provides access to R’s capabilities without requiring users to know R programming
That’s my take on it:
While Rguroo benefits analysts and students by making R’s statistical and graphical tools more accessible, I am uncertain whether this release truly represents a “journey to data science.” After exploring the software, I noticed it lacks core data science methods, such as decision trees, random forests, XGBoost, gradient boosting, and neural networks. The focus on logistic regression—while a valuable classical statistical method—reflects a model-driven, inference-centered approach, rather than the data exploration and pattern recognition that define data science and machine learning. In fact, this highlights a broader issue: many so-called “data science” tools and programs don’t fully reflect the paradigm shift toward data-centric methodologies
Link: https://rguroo.com/


DSML trend: Best visualization tools for business intelligence
11/8/2024
According to Forbes Advisor, the top data visualization tools for business in 2024 are as follows:
·Microsoft Power BI: Leader in business intelligence (BI) with robust integration capabilities
·Tableau: Known for sophisticated interactive visualizations
·Qlik Sense: Stands out for AI integration and machine learning features
·Klipfolio: Excels in custom dashboard creation
·Looker: Provides comprehensive visualization options and data modeling
·Zoho Analytics: Seamlessly integrates with other Zoho products
·Domo: Distinguished by its custom app development capabilities
The evaluation criteria included user-friendliness, cost-effectiveness, support quality, and key features such as real-time analytics, customization options, and collaborative data sharing.
That’s my take on it: Data visualization tools are essential for both business and academic purposes, offering powerful ways to analyze and present complex data. While the tools mentioned by Forbes are indeed popular for business intelligence, there are several excellent options for academics and other specialized purposes. For example,
SAS Visual Analytics on SAS Viya: General purposes
JMP Pro: General purposes
IBM Watson Studio: General purposes
MATLAB: Popular in engineering and scientific computing, providing robust visualization tools alongside computational capabilities.
Wolfram Mathematica: A powerful and comprehensive computational software system that offers extensive capabilities for data visualization, scientific computing, and statistical analysis.
Origin: Specifically designed for scientific graphing and data analysis, popular in physical sciences and engineering.
Gephi: An open-source tool particularly useful for network analysis and visualization, popular in social sciences and complex systems research.
·Python with libraries like Matplotlib, Seaborn, and Plotly: Widely used in data science and research for its flexibility and powerful visualization options.
Factors for selecting visualization tools:
Data complexity and size: Tools like SAS Viya and IBM Watson Studio are better suited for very large datasets.
·Statistical analysis needs: JMP and Python offer more advanced statistical capabilities.
·Collaboration requirements: Cloud-based solutions like IBM Watson Studio may offer better collaboration features.
·Domain-specific needs: Some fields may have preferred tools (e.g., Gephi for network analysis whereas Matlab and Mathematica for mathematics and engineering).
Link: https://www.forbes.com/advisor/business/software/best-data-visualization-tools/






















































































































































2024 Japan: Hiroshima and Himeji