<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Data analysis journey]]></title><description><![CDATA[Data analysis journey]]></description><link>https://blog.dtucker.xyz</link><generator>RSS for Node</generator><lastBuildDate>Sat, 16 May 2026 11:59:35 GMT</lastBuildDate><atom:link href="https://blog.dtucker.xyz/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Building an A3 Process Improvement App with Claude]]></title><description><![CDATA[I recently built an interactive A3 process improvement app using Claude Sonnet 4 - and here's the interesting part: the app itself uses Claude's API to analyze completed A3 documents. It's essentially Claude helping to build a tool that leverages Cla...]]></description><link>https://blog.dtucker.xyz/building-an-a3-process-improvement-app-with-claude</link><guid isPermaLink="true">https://blog.dtucker.xyz/building-an-a3-process-improvement-app-with-claude</guid><category><![CDATA[a3]]></category><category><![CDATA[claude.ai]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Sat, 28 Jun 2025 05:00:19 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1751028329111/d9b7e6af-829f-4efd-88b9-64918148968e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I recently built an interactive <a target="_blank" href="https://claude.ai/public/artifacts/22829a62-eadf-439b-9aff-9c9c9a272106?fullscreen=false">A3 process improvement app</a> using Claude Sonnet 4 - and here's the interesting part: the app itself uses Claude's API to analyze completed A3 documents. It's essentially Claude helping to build a tool that leverages Claude's capabilities. This meta-approach opened up fascinating possibilities for creating intelligent, self-improving applications.</p>
<p>This was a <a target="_blank" href="https://support.anthropic.com/en/articles/11649427-use-artifacts-to-visualize-and-create-ai-apps-without-ever-writing-a-line-of-code?campaignId=14008655&amp;source=i_email&amp;medium=email&amp;content=Nov2024ClaudeStyles&amp;messageTypeId=140367">new feature</a> released by Claude.</p>
<p><a target="_blank" href="https://claude.ai/public/artifacts/22829a62-eadf-439b-9aff-9c9c9a272106?fullscreen=false"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751027664771/568774ee-2013-4894-8cf2-f3f3594049e4.png" alt="Image of the A3 app on Claude" class="image--center mx-auto" /></a></p>
<h3 id="heading-implementing-the-a3-methodology">Implementing the A3 Methodology</h3>
<p>The app covers all eight essential A3 steps:</p>
<ol>
<li><p><strong>Background</strong> - Current situation context</p>
</li>
<li><p><strong>Problem Statement</strong> - Specific problem definition</p>
</li>
<li><p><strong>Current State Analysis</strong> - Detailed process examination</p>
</li>
<li><p><strong>Target State</strong> - Ideal future vision</p>
</li>
<li><p><strong>Root Cause Analysis</strong> - Underlying cause identification</p>
</li>
<li><p><strong>Countermeasures</strong> - Proposed solutions</p>
</li>
<li><p><strong>Implementation Plan</strong> - Execution strategy</p>
</li>
<li><p><strong>Follow-up Plan</strong> - Monitoring and sustainability</p>
</li>
</ol>
<p>Each step includes thoughtful prompts and placeholders to guide users toward comprehensive responses.</p>
<h3 id="heading-app-output">App Output</h3>
<p>The app provides an analysis of the sections and produces an executive summary, key findings, strengths, recommendations, and risk assessment.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751028248176/50be25ed-4739-4adc-8ac0-a252f2bfbf31.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751028253425/0fd44da6-552e-461a-be96-9b8bde3cc276.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751028258916/927a24d9-bc45-438f-a712-6180b555af8b.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-impressions">Impressions</h3>
<p><strong>Prompt Engineering is Critical -</strong> The quality of AI analysis depends heavily on structured prompts. I found that explicitly requesting JSON format and providing examples dramatically improved response consistency.</p>
<p><strong>Iterative prompts -</strong> It took a few iterations after the first design to help tweak the app but the initial design was very good.</p>
<p>This project represents a fascinating development pattern: using AI to build AI-powered applications. Claude helped design the interface, implement the functionality, and now provides the analytical intelligence that makes the tool valuable.</p>
<p>Check out the A3 app on Claude here</p>
]]></content:encoded></item><item><title><![CDATA[Prompt Comparison app]]></title><description><![CDATA[One of my favorite ways to learn how large-language models behave is to drop the same prompt into two different models and watch the discrepancies unfold. A few weeks ago I kept switching tabs between ChatGPT and local models (LM Studio), but the pro...]]></description><link>https://blog.dtucker.xyz/prompt-comparison-app</link><guid isPermaLink="true">https://blog.dtucker.xyz/prompt-comparison-app</guid><category><![CDATA[huggingface]]></category><category><![CDATA[llm]]></category><category><![CDATA[#PromptEngineering]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 20 Jun 2025 16:04:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750434887633/4cbe442a-824f-4772-92a0-aee069dd19cb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of my favorite ways to learn how large-language models behave is to <strong>drop the same prompt into two different models and watch the discrepancies unfold</strong>. A few weeks ago I kept switching tabs between ChatGPT and local models (LM Studio), but the process felt clunky. So I built a micro-tool that does the comparison for me—and it runs entirely on Hugging Face.</p>
<p>Although it only compares open source models and not the latest foundation models, it still is useful for me to compare differences between the open source models, at least the lightweight versions.</p>
<h3 id="heading-why-i-built-it"><strong>Why I Built It</strong></h3>
<ul>
<li><p><strong>Rapid prompt engineering</strong></p>
<p>  Seeing two answers side-by-side helps me decide which wording and styling.</p>
</li>
<li><p><strong>Qualitative benchmarking</strong></p>
<p>  Formal evaluation metrics are great, but a quick visual gut-check on tone and factuality often saves me time. I like get a ‘feel’ for each model.</p>
</li>
</ul>
<h3 id="heading-how-it-works-under-the-hood"><strong>How It Works under the Hood</strong></h3>
<ol>
<li><p><strong>Gradio front-end</strong></p>
<p> A single Textbox for the prompt and two response panes keep the UI friction-free.</p>
</li>
<li><p><strong>huggingface-hub’s InferenceClient</strong></p>
<p> Instead of loading giant weights locally, the app makes chat_completion calls to</p>
<ul>
<li><p>mistralai/Mistral-7B-Instruct-v0.2</p>
</li>
<li><p>meta-llama/Llama-2-7b-chat-hf</p>
</li>
</ul>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> InferenceClient

client = InferenceClient(<span class="hljs-string">"mistralai/Mistral-7B-Instruct-v0.2"</span>, token=HF_TOKEN)
resp = client.chat_completion(
        messages=[{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: prompt}],
        max_tokens=<span class="hljs-number">200</span>).choices[<span class="hljs-number">0</span>].message[<span class="hljs-string">"content"</span>]
</code></pre>
<h3 id="heading-using-the-tool"><strong>Using the Tool</strong></h3>
<ol>
<li><p><strong>Paste a prompt</strong> – e.g. <em>“Explain the second law of thermodynamics in plain English.”</em></p>
</li>
<li><p><strong>Hit “Submit.”</strong></p>
</li>
</ol>
<p>Compare the Mistral answer (usually concise and instructional) with LLaMA-2’s (often more conversational).</p>
<p><a target="_blank" href="https://huggingface.co/spaces/Unizomby/prompt_comparison"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750434900711/55175964-6273-4e36-8036-55c6f9f3ff26.png" alt="Prompt Comparison Tool" class="image--center mx-auto" /></a></p>
<h2 id="heading-observations">Observations</h2>
<p>My personal observations experimenting with the two models:</p>
<ul>
<li><p><strong>Stylistic flavor</strong> – Mistral tends to jump straight into bullet points; LLaMA-2 sprinkles more transition words.</p>
</li>
<li><p><strong>Output</strong> – Mistral packs more information than LLaMA-2.</p>
</li>
</ul>
<h2 id="heading-check-it-out">Check it out</h2>
<p>➡️ <a target="_blank" href="https://huggingface.co/spaces/Unizomby/prompt_comparison">Prompt Comparison Tool on HuggingFace</a></p>
<p>Feel free to ping me with improvements or bug reports. Happy prompting!</p>
]]></content:encoded></item><item><title><![CDATA[🤖 Building a Personalized Chatbot Powered by my Portfolio]]></title><description><![CDATA[In my latest project, I set out to build a personalized chatbot that could answer questions based on documents about me—think of it as an AI-powered assistant trained on my bio, resume, and project work. The goal was to create something interactive a...]]></description><link>https://blog.dtucker.xyz/building-a-personalized-chatbot-powered-by-portfolio</link><guid isPermaLink="true">https://blog.dtucker.xyz/building-a-personalized-chatbot-powered-by-portfolio</guid><category><![CDATA[rag chatbot]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Sat, 24 May 2025 17:06:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748106477532/e8102f7a-78d4-43ec-9bd4-00e86bae1b16.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my latest project, I set out to build a personalized chatbot that could answer questions based on documents about me—think of it as an AI-powered assistant trained on my bio, resume, and project work. The goal was to create something interactive and intelligent that could provide relevant, accurate responses whenever someone wanted to learn more about my experience or skills.</p>
<p>To bring this to life, I used LlamaIndex for document indexing and OpenAI’s language models for generating responses. I deployed the app using Gradio on Hugging Face Spaces, which provided a clean and accessible interface. The chatbot accepts questions and retrieves information from a curated set of documents I uploaded, offering a live demo of how retrieval-augmented generation (RAG) can create powerful personal or professional tools.</p>
<p>Check it out <a target="_blank" href="https://huggingface.co/spaces/Unizomby/dtuckchat">here</a></p>
<p><a target="_blank" href="https://huggingface.co/spaces/Unizomby/dtuckchat"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748106168557/d99b6b90-f5bb-4fb4-8bb7-4be6c23cc5d9.png" alt class="image--center mx-auto" /></a></p>
]]></content:encoded></item><item><title><![CDATA[🎉 Apprenticeship Milestone Unlocked!]]></title><description><![CDATA[This took a bit of time to get the certificate, but circling back to follow up on a previous accomplishment — I officially completed an AI/ML Apprenticeship! [U.S. Department of Labor apprenticeship completion certificate for the AI/ML Fundamentals P...]]></description><link>https://blog.dtucker.xyz/apprenticeship-milestone-unlocked</link><guid isPermaLink="true">https://blog.dtucker.xyz/apprenticeship-milestone-unlocked</guid><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Mon, 05 May 2025 23:02:26 GMT</pubDate><content:encoded><![CDATA[<p>This took a bit of time to get the certificate, but circling back to follow up on a previous accomplishment — I officially completed an AI/ML Apprenticeship! [U.S. Department of Labor apprenticeship completion certificate for the AI/ML Fundamentals Program, that’s a mouthful]</p>
<p>As part of this journey, I built a machine learning model using XGBoost to forecast production spans. It was a hands-on project that pushed me to think critically about real-world data, apply machine learning techniques, and deliver something that could actually improve planning and operations on the shop floor.</p>
<p>What made this experience especially rewarding was seeing how AI isn’t just about theory — it’s a tool that, when applied thoughtfully, can make complex systems more predictable and efficient.</p>
<p>I’m grateful for the chance to learn, experiment, and grow through this program. Onward to the next challenge!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1746486003405/9775c23a-c0d9-447e-9197-a79e7f10577d.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[Building a Time-Series Forecast & Anomaly Dashboard]]></title><description><![CDATA[A time-series forecasting and anomaly-detection tool that lets users upload any dataset, automatically identifies the date and value columns, and produces dual forecasts with Prophet and auto-tuned SARIMAX—complete with Isolation Forest anomaly overl...]]></description><link>https://blog.dtucker.xyz/building-a-time-series-forecast-and-anomaly-dashboard</link><guid isPermaLink="true">https://blog.dtucker.xyz/building-a-time-series-forecast-and-anomaly-dashboard</guid><category><![CDATA[timeseries]]></category><category><![CDATA[anomaly detection]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[streamlit]]></category><category><![CDATA[forecasting]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 25 Apr 2025 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750521185234/9232fee0-5189-4f5c-9060-1558ac4a9486.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A time-series forecasting and anomaly-detection tool that lets users upload any dataset, automatically identifies the date and value columns, and produces dual forecasts with Prophet and auto-tuned SARIMAX—complete with Isolation Forest anomaly overlays.</p>
<h3 id="heading-why-i-built-it"><strong>Why I Built It</strong></h3>
<p>In many datasets, timelines and trends are everything—yet I often bounce between separate scripts for forecasting, outlier hunting, and visualization. I wanted a single, browser-based workspace where anyone could <strong>upload a CSV or Excel file and instantly see forward-looking forecasts and anomaly flags in one place</strong>. I built this tool utilizing Streamlit.</p>
<p>➡️ Check <a target="_blank" href="https://unizomby-timeseries-dash-app-xkbqpe.streamlit.app/">out the Streamlit dashboard</a>.</p>
<h3 id="heading-what-the-app-does"><strong>What the App Does</strong></h3>
<ol>
<li><p><strong>One-Click Data Ingestion</strong></p>
<p> Drop in any time-series file (or play with the built-in sample). The app automatically sniffs out the date/time and metric columns, even if you rename or reorder them.</p>
</li>
<li><p><strong>Dual Forecast Engines</strong></p>
<ul>
<li><p><strong>Prophet</strong>—great for strong seasonal patterns and holiday effects.</p>
</li>
<li><p><strong>Auto-tuned SARIMAX</strong>—handles subtle autocorrelation structures.</p>
<p>  Both models train side-by-side, and their prediction intervals are plotted together for comparison.</p>
</li>
</ul>
</li>
<li><p><strong>Isolation Forest Anomaly Layer</strong></p>
<p> After training, an Isolation Forest scans historical residuals plus new forecasts, shading points that deviate beyond an adaptive threshold.</p>
</li>
<li><p><strong>Interactive Plotly Visuals</strong></p>
<p> Hover to inspect values, toggle series on/off, zoom, or download a PNG snapshot.</p>
</li>
<li><p><strong>Instant Exports</strong></p>
<p> Click once to grab a tidy CSV of both forecasts or save the current chart to PNG for slide decks.</p>
</li>
</ol>
<p><a target="_blank" href="https://unizomby-timeseries-dash-app-xkbqpe.streamlit.app/"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750521387701/3539f530-2126-49c7-ab72-d6816d4bc9ff.png" alt class="image--center mx-auto" /></a></p>
<h3 id="heading-try-it-yourself"><strong>Try It Yourself</strong></h3>
<p>The repo is open-sourced on GitHub and deploy-ready to Streamlit Cloud in under five minutes. Clone, push, and share a public link with stakeholders—no server wrangling required.</p>
<p>Check <a target="_blank" href="https://unizomby-timeseries-dash-app-xkbqpe.streamlit.app/">out the Streamlit dashboard</a>. Also check out my other times-series projects.</p>
<p>Let me know what you think!</p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Creating Prompt Assistants as a Process Improvement Tool]]></title><description><![CDATA[Prompt assistants are quickly becoming powerful tools for enhancing productivity, consistency, and creativity. Recently, I had the opportunity to explore how ChatGPT's Playground can be used to create these assistants, and I shared that knowledge wit...]]></description><link>https://blog.dtucker.xyz/creating-prompt-assistants-as-a-process-improvement-tool</link><guid isPermaLink="true">https://blog.dtucker.xyz/creating-prompt-assistants-as-a-process-improvement-tool</guid><category><![CDATA[#PromptEngineering]]></category><category><![CDATA[processimprovement]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Sat, 05 Apr 2025 16:38:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743870940299/495754a9-7359-4af1-9c1f-9f58ee59969a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Prompt assistants are quickly becoming powerful tools for enhancing productivity, consistency, and creativity. Recently, I had the opportunity to explore how ChatGPT's Playground can be used to create these assistants, and I shared that knowledge with my team in a hands-on training session.</p>
<hr />
<p><strong>Why Prompt Assistants Matter for Process Improvement</strong></p>
<p>Prompt assistants are reusable, intelligent text inputs designed to guide AI behavior in a predictable and efficient way. I think of them as customizable coworkers, that never forget your formatting preferences or details that are important.</p>
<p>From a process improvement perspective, these assistants offer tangible benefits:</p>
<ul>
<li><p><strong>Standardization</strong>: They ensure that outputs—whether reports, summaries, or analysis—follow a consistent format, which is especially valuable in cross-functional teams.</p>
</li>
<li><p><strong>Efficiency</strong>: They significantly reduce the time spent on repetitive cognitive tasks, freeing up time for higher-value work.</p>
</li>
<li><p><strong>Shareability / Collaboration</strong>: These prompts can be shared with others.</p>
</li>
<li><p><strong>Training and Onboarding Support</strong>: New employees can use prompt assistants as learning tools, receiving structured and high-quality responses that reflect company standards.</p>
</li>
</ul>
<p>During our training, I shared some prompts that I had created to summarize long text, re-write emails, and evaluate an A3 document (A tool for process improvements) for consistency and completeness.</p>
<p>Below is an assistant prompt to help me summarize text.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743870796211/ea2a22db-bbcf-4d6d-af3b-c7b8f9d1f142.png" alt="TLDR prompt in OpenAI playground" class="image--center mx-auto" /></p>
<hr />
<p><strong>Driving a Culture of Continuous Improvement</strong></p>
<p>Beyond the tools themselves, what excited me most was the mindset shift. Creating prompt assistants encouraged me to think critically about my daily processes. What could be automated? What tasks were repetitive and ripe for improvement?</p>
<p>This approach ties directly into Lean and continuous improvement principles. By identifying waste (in the form of time, inconsistency, or rework) and applying a lightweight, tech-driven solution, we were able to make small but impactful changes to how we work.</p>
<hr />
<p><strong>Final Thoughts</strong></p>
<p>Prompt assistants are more than just clever AI tricks 😝—they're enablers of better workflows, sharper communication, and more empowered teams. I know using AI can sometimes be a controversial topics.</p>
<p>I'd love to hear how you're using prompt assistants to improve your processes. Drop a comment or connect with me to share your story!</p>
]]></content:encoded></item><item><title><![CDATA[Enhancing Control Charts with Anomaly Detection]]></title><description><![CDATA[Enhancing Control Charts
Control charts have always been one of my go-to tools for monitoring process stability. They offer a clear visual representation of how a process behaves over time. In this analysis, I wanted to take things a step further by ...]]></description><link>https://blog.dtucker.xyz/enhancing-control-charts-with-anomaly-detection</link><guid isPermaLink="true">https://blog.dtucker.xyz/enhancing-control-charts-with-anomaly-detection</guid><category><![CDATA[HypothesisTesting]]></category><category><![CDATA[anomaly detection]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Sat, 15 Feb 2025 18:30:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1739643559178/d8797723-8e02-4cd4-b605-0ec76627863f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-enhancing-control-charts"><strong>Enhancing Control Charts</strong></h2>
<p>Control charts have always been one of my go-to tools for monitoring process stability. They offer a clear visual representation of how a process behaves over time. In this analysis, I wanted to take things a step further by incorporating anomaly detection, specifically using Isolation Forest.</p>
<p>In the past I have also used the control chart as part of visualizing data while performing hypothesis testing. I have updated a previous script I have been using, updating the control chart to include anomaly detection.</p>
<p><a target="_blank" href="https://dtucker.xyz/projects/T-test_hypothesis.html">check out the ipynb script</a></p>
<p><a target="_blank" href="https://blog.dtucker.xyz/hypothesis-testing-2-sample-t-test">Hypothesis Testing</a></p>
<h2 id="heading-my-initial-control-chart-before-vs-after"><strong>My Initial Control Chart: Before vs After</strong></h2>
<p>To start, I created a control chart to visualize two data sets—one representing the "before" phase and the other the "after" phase. The chart plots sample values over time, including key statistical indicators:</p>
<ul>
<li><p><strong>Mean Line:</strong> Represents the average value of each phase.</p>
</li>
<li><p><strong>Standard Deviation Boundaries:</strong> Upper and lower bounds, marking one standard deviation from the mean.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739643241764/f64f4453-ed23-45cb-82ce-aa2b2475820b.png" alt class="image--center mx-auto" /></p>
<p>This visualization helped me detect shifts in process performance. By comparing the "before" and "after" distributions, I could easily spot any significant changes in the mean or spread of data, indicating possible process improvements or deviations.</p>
<p>I do want to note that although control charts are typically used to check if a process is in control, there is information from the chart that can be beneficial to hypothesis testing. It can help visualize and highlight the different means or help assess the stability of the data.</p>
<h2 id="heading-bringing-in-anomaly-detection"><strong>Bringing in Anomaly Detection</strong></h2>
<p>To take my analysis further, I applied the <strong>Isolation Forest</strong> algorithm for anomaly detection. This method is great for identifying outliers because it isolates anomalies in fewer steps than normal data points. Here’s how I went about it:</p>
<ol>
<li><p><strong>Model Training:</strong> I trained the algorithm using sample scores, setting a contamination rate of 10%.</p>
</li>
<li><p><strong>Anomaly Identification:</strong> The model flagged specific data points as anomalies, which I then plotted distinctly.</p>
<pre><code class="lang-python"> <span class="hljs-comment">#Anomaly detection</span>
 isolation_method = IsolationForest(n_estimators=<span class="hljs-number">100</span>, contamination=<span class="hljs-number">0.10</span>)
 <span class="hljs-comment"># Model fitting</span>
 isolation_method.fit(pd.DataFrame(df[<span class="hljs-string">'Scores'</span>]))
 df[<span class="hljs-string">'anomaly_iso'</span>] = isolation_method.predict(pd.DataFrame(df[<span class="hljs-string">'Scores'</span>]))
 a = df.loc[df[<span class="hljs-string">'anomaly_iso'</span>] == <span class="hljs-number">-1</span>, [<span class="hljs-string">'index'</span>, <span class="hljs-string">'Scores'</span>]]  <span class="hljs-comment"># Anomaly</span>
</code></pre>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739643136773/7cb25401-7cbb-481b-87f2-23863a40411f.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-what-i-learned-from-the-updated-chart"><strong>What I Learned from the Updated Chart</strong></h2>
<p>After applying anomaly detection, the second control chart gave me some fresh insights:</p>
<ul>
<li><p><strong>Highlighted Anomalies:</strong> Points identified as anomalies were marked in red, making it easy to spot unusual deviations. This also aligned with outliers above and below the standard deviation.</p>
</li>
<li><p><strong>Enhanced Process Understanding:</strong> These anomalies could indicate process breakdowns, measurement errors, or unexpected variations that needed further investigation.</p>
</li>
<li><p><strong>Better Decision-Making:</strong> Seeing anomalies in real time allowed me to take proactive measures to maintain process stability.</p>
</li>
</ul>
<p>Adding anomaly detection to control charts has been a game-changer for me. It provides an extra layer of insight that traditional control charts alone might miss. By combining classical statistical tools with modern machine learning techniques, I feel more confident in my ability to monitor processes effectively and drive continuous improvement.</p>
<p>This experience reinforced my belief that data-driven decision-making is key to maintaining operational efficiency, and I’ll definitely be using this approach in future analyses.</p>
]]></content:encoded></item><item><title><![CDATA[LP Optimization]]></title><description><![CDATA[In this project the goal was to find a method to take a group of 100 students and form 5 teams of students where the average team score was balanced.
Current State
The students are originally grouped into 7 teams shown below.
Each student has an aver...]]></description><link>https://blog.dtucker.xyz/lp-optimization</link><guid isPermaLink="true">https://blog.dtucker.xyz/lp-optimization</guid><category><![CDATA[optimization]]></category><category><![CDATA[Python]]></category><category><![CDATA[clustering]]></category><category><![CDATA[linear-programming ]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 06 Sep 2024 05:00:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1725072736628/19001550-f2b4-42cf-b27c-24dcab8986de.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this project the goal was to find a method to take a group of 100 students and form 5 teams of students where the average team score was balanced.</p>
<h3 id="heading-current-state">Current State</h3>
<p>The students are originally grouped into 7 teams shown below.</p>
<p>Each student has an average score based on the categories: Reading, Writing, Math, and Science.</p>
<p><strong>Team Distribution of Average Student Scores</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725070065463/9fe58ec7-7a2b-49df-a147-f17a81723f36.png" alt="Team Distribution of Average Student Scores" class="image--center mx-auto" /></p>
<p>The average of the teams ranged from 43.25 to 50.85. The difference between the teams is relatively small, but let's see if we can improve the teams.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725070278659/07fbf919-3662-4d66-bd67-15aec490c7e5.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-future-state">Future State</h3>
<p>For the future state teams, I'll be using Linear Programming to minimize the difference between the average team score. Making the teams as balanced as possible.</p>
<p><strong>New team constraints:</strong></p>
<ul>
<li><p>There will be 5 new teams</p>
</li>
<li><p>Each student can only be assigned to 1 team</p>
</li>
<li><p>Team size will have the specified size, team_sizes = [15, 25, 20, 20, 20]</p>
</li>
<li><p>There will be at least 1 student with a score of 70 or more in each of the categories 'Reading', 'Writing', 'Math', and 'Science'.</p>
</li>
</ul>
<p>The constraints were used to create a Linear Programming model with the following results.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725070969792/ad1b5544-af4f-48f7-81de-9cf30b817138.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725070974681/44ab1bf4-82e6-4a27-9e5c-cb955044ac17.png" alt class="image--center mx-auto" /></p>
<p>The team averages are more balanced than the current state while following the constraints.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725071115045/d3f1de4f-fda8-484d-80bf-f61d3efa531a.png" alt class="image--center mx-auto" /></p>
<p>In the chart above you can see each students score. I also clustered the students based on the category scores, to help find students of similar skill level. The clustering did not affect the results of the linear optimization. It was more for identification of student groupings.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725071275661/cf365877-c6f1-490d-9d7c-92b6043f3318.png" alt class="image--center mx-auto" /></p>
<p>One benefit was that I was able to write some code that allowed for finding a similar student.</p>
<p>In the example below with student 99, the nearest students are student 4 and student 87.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725073682083/de86b042-796f-469f-a10c-f0c0ed1a5bb5.png" alt class="image--center mx-auto" /></p>
<p><strong>Conclusion</strong></p>
<p>I was able to build a model that could rebalance the students while following a set of constraints on the new teams.</p>
<p><a target="_blank" href="https://dtucker.xyz/projects/student_LP/students.html">Link to project</a></p>
]]></content:encoded></item><item><title><![CDATA[Critical Path]]></title><description><![CDATA[Using the networkx library I build and find the critical path of a network from a data frame.
The network graph created below can be found here.
# Create a NetworkX DiGraph
G = nx.DiGraph()
# Iterate through the DataFrame and add edges to the graph
f...]]></description><link>https://blog.dtucker.xyz/critical-path</link><guid isPermaLink="true">https://blog.dtucker.xyz/critical-path</guid><category><![CDATA[networkx]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 14 Jun 2024 05:00:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1717796175198/d0761ccc-7b3f-439d-9ef9-87d72d97ac61.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Using the networkx library I build and find the critical path of a network from a data frame.</p>
<p>The network graph created below can be found <a target="_blank" href="https://dtucker.xyz/projects/networks/criticalpath.html">here</a>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Create a NetworkX DiGraph</span>
G = nx.DiGraph()
<span class="hljs-comment"># Iterate through the DataFrame and add edges to the graph</span>
<span class="hljs-keyword">for</span> index, row <span class="hljs-keyword">in</span> df.iterrows():
    G.add_edge(row[<span class="hljs-string">'predecessor'</span>], row[<span class="hljs-string">'successor'</span>], duration=row[<span class="hljs-string">'duration'</span>])
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># Find the critical path</span>
critical_path = nx.dag_longest_path(G, weight=<span class="hljs-string">'duration'</span>)
print(<span class="hljs-string">f'Critical Path: <span class="hljs-subst">{critical_path}</span>'</span>)
</code></pre>
<p><strong>Critical Path:</strong> ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']</p>
<p><strong>Total Duration of Critical Path:</strong> 35</p>
<p>Networkx makes it relatively simple to find the path. I manually colored the nodes to highlight the identified critical path.</p>
<p><a target="_blank" href="https://dtucker.xyz/projects/networks/criticalpath.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717796143565/09763226-642f-434a-8548-67393ccfc402.png" alt="network with critical path" class="image--center mx-auto" /></a></p>
]]></content:encoded></item><item><title><![CDATA[🌽 Crop Classification Model]]></title><description><![CDATA[This Kaggle dataset contains soil characteristics, used to recommend what type of farm crop to plant in that soil with a machine learning classification model.
I created a baseline by comparing the performance of 5 different classification models, me...]]></description><link>https://blog.dtucker.xyz/crop-classification-model</link><guid isPermaLink="true">https://blog.dtucker.xyz/crop-classification-model</guid><category><![CDATA[classification]]></category><category><![CDATA[Python]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 07 Jun 2024 14:25:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1717769734531/47e432f7-e9eb-4fe4-8bdf-4264f3ad1e11.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This Kaggle <a target="_blank" href="https://www.kaggle.com/datasets/varshitanalluri/crop-recommendation-dataset">dataset</a> contains soil characteristics, used to recommend what type of farm crop to plant in that soil with a machine learning classification model.</p>
<p>I created a baseline by comparing the performance of 5 different classification models, measuring accuracy.</p>
<p>First setup, a dictionary of models.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Put models in a dictionary</span>
models = {<span class="hljs-string">"KNN"</span>: KNeighborsClassifier(),
          <span class="hljs-string">"Logistic Regression"</span>: LogisticRegression(), 
          <span class="hljs-string">"Random Forest"</span>: RandomForestClassifier(),
          <span class="hljs-string">"GradientBoost"</span>: GradientBoostingClassifier(),
          <span class="hljs-string">"GaussianNB"</span>: GaussianNB(),
          }
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># Create function to fit and score models</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit_and_score</span>(<span class="hljs-params">models, X_train, X_test, y_train, y_test</span>):</span>
    <span class="hljs-string">"""
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data
    X_test : testing data
    y_train : labels assosciated with training data
    y_test : labels assosciated with test data
    """</span>
    <span class="hljs-comment"># Random seed for reproducible results</span>
    np.random.seed(<span class="hljs-number">42</span>)
    <span class="hljs-comment"># Make a list to keep model scores</span>
    model_scores = {}
    <span class="hljs-comment"># Loop through models</span>
    <span class="hljs-keyword">for</span> name, model <span class="hljs-keyword">in</span> models.items():
        <span class="hljs-comment"># Fit the model to the data</span>
        model.fit(X_train, y_train)
        <span class="hljs-comment"># Evaluate the model and append its score to model_scores</span>
        model_scores[name] = model.score(X_test, y_test)
    <span class="hljs-keyword">return</span> model_scores
</code></pre>
<pre><code class="lang-python">{<span class="hljs-string">'KNN'</span>: <span class="hljs-number">0.9568181818181818</span>,
 <span class="hljs-string">'Logistic Regression'</span>: <span class="hljs-number">0.9636363636363636</span>,
 <span class="hljs-string">'Random Forest'</span>: <span class="hljs-number">0.9931818181818182</span>,
 <span class="hljs-string">'GradientBoost'</span>: <span class="hljs-number">0.9818181818181818</span>,
 <span class="hljs-string">'GaussianNB'</span>: <span class="hljs-number">0.9954545454545455</span>}
</code></pre>
<p>The baseline scores show the Gaussian and Random Forest models performing the best.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717769682064/ece9cb4a-7fef-4a58-82a5-ce5a79fa7896.png" alt="Model Comparison" class="image--center mx-auto" /></p>
<p>Next steps of the model will be selection of either Gaussian or Random Forest models and performing cross validation grid search hyper parameter tuning.</p>
]]></content:encoded></item><item><title><![CDATA[Hypothesis Testing (2 Sample T-test)]]></title><description><![CDATA[I recently created a template for performing a 2 sample T-test, to determine if changes in a process are statistically significant. In the process I check to see if the normality and equal variance assumptions are valid. I also provide 3 different T-...]]></description><link>https://blog.dtucker.xyz/hypothesis-testing-2-sample-t-test</link><guid isPermaLink="true">https://blog.dtucker.xyz/hypothesis-testing-2-sample-t-test</guid><category><![CDATA[hypothesis testing]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Sun, 02 Jun 2024 14:30:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1717338266870/59e60991-0d1c-4eee-9fbc-5b235c597436.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I recently created a template for performing a 2 sample T-test, to determine if changes in a process are statistically significant. In the process I check to see if the normality and equal variance assumptions are valid. I also provide 3 different T-test depending on the type of data you have.</p>
<p>There are 20 before and 20 after samples taken of student test scores.</p>
<p>You can find the complete Jupyter notebook <a target="_blank" href="https://dtucker.xyz/projects/T-test_hypothesis.html">here</a>.</p>
<hr />
<p><strong>Null Hypothesis (H0)</strong>: Scores between the samples is the same. First 20 samples = last 20 samples<br /><strong>Alternative Hypothesis (H1)</strong>: Scores for the samples is different.</p>
<hr />
<p>First step in the process after importing the data is creating some visuals to help see the data.</p>
<p><a target="_blank" href="https://dtucker.xyz/projects/hypot_point.html">Interactive plotly point plot</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717337594620/922f7f81-a6d2-414d-aa05-76dca62197b5.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="https://dtucker.xyz/projects/hypot_line.html">Plotly line chart</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717337602570/b7c906c0-4e5f-4958-91c1-d4842a8509d0.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="https://dtucker.xyz/projects/hypot_box.html">Plotly boxplot</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717337606314/da805653-77ee-4cc2-bdfa-7abdd3d2369b.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="https://dtucker.xyz/projects/hypot_histogram.html">Plotly histogram</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717337609052/7447aaaf-0c2c-4740-ad2d-38286339f571.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717337617885/fced55c5-6c3c-4755-a370-4f8ce5bb951a.png" alt class="image--center mx-auto" /></p>
<p>This last chart helps makes visualizing the 2 samples over time easier, in a control chart type format.</p>
<p>After checking for normality and equal variance, I perform the hypothesis test using a function that check the p-value and returns an interpretation.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Perform t-test</span>
t_score, p_value = st.ttest_ind(a=df_new[<span class="hljs-string">'first_twenty'</span>],
                                b=df_new[<span class="hljs-string">'last_twenty'</span>],
                                alternative=<span class="hljs-string">'two-sided'</span>) <span class="hljs-comment"># change this if the hypothesis is greater or less than.</span>

print(<span class="hljs-string">f'T-score: <span class="hljs-subst">{t_score}</span>'</span>)
p_value_reader(p_value, alpha=<span class="hljs-number">0.05</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717338143485/c3fdb076-f541-4117-9f28-e0a172b88fe7.png" alt class="image--center mx-auto" /></p>
<p><strong>Conclusion:</strong> Reject the Null Hypothesis, there is a difference in the means.</p>
]]></content:encoded></item><item><title><![CDATA[Time series forecasting with Prophet]]></title><description><![CDATA[In this example, I am forecasting the future price of the AAPL stock, using data from the kaggle dataset starting with prices after 2015. View complete notebook here.
This model uses the Prophet library
from prophet import Prophet

After importing th...]]></description><link>https://blog.dtucker.xyz/time-series-forecasting-with-prophet</link><guid isPermaLink="true">https://blog.dtucker.xyz/time-series-forecasting-with-prophet</guid><category><![CDATA[prophet]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 15 Mar 2024 05:00:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1709312648905/472e312c-8f5a-46f8-a0e7-2fed98956f48.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this example, I am forecasting the future price of the AAPL stock, using data from the kaggle <a target="_blank" href="https://www.kaggle.com/datasets/guillemservera/aapl-stock-data">dataset</a> starting with prices after 2015. View complete notebook <a target="_blank" href="https://github.com/Unizomby/online_retail_eda/blob/main/Prophet_02_Forecast_AAPL.ipynb">here</a>.</p>
<p>This model uses the Prophet library</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> prophet <span class="hljs-keyword">import</span> Prophet
</code></pre>
<p>After importing the AAPL price data from csv, I created a graph of the price over time.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709312101264/33010ba7-e57c-44a7-8e91-691e8704bb44.png" alt class="image--center mx-auto" /></p>
<p>The dataset includes column for Volume each day, which I will use as a regressor in the prophet model</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709312284831/bf172f3f-c709-40a7-b8a6-2a69fc256f65.png" alt class="image--center mx-auto" /></p>
<p>I trained it on the whole dataset, predicting for future data 31 days in the future.</p>
<p>›</p>
<pre><code class="lang-python"><span class="hljs-comment"># building the model</span>
m = Prophet(
            seasonality_mode=sm,
            seasonality_prior_scale=sps,
            changepoint_prior_scale=cps)
m.add_regressor(<span class="hljs-string">'Volume'</span>)
m.fit(training)
</code></pre>
<p>Creating predictions is pretty simple, just passing the future data frame.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Forecasting</span>
forecast = m.predict(future)
forecast.head(<span class="hljs-number">5</span>)
</code></pre>
<p>Plotting the predictions</p>
<pre><code class="lang-python"><span class="hljs-comment"># Forecasting</span>
forecast = m.predict(future_df)
m.plot(forecast);
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709312463082/7b4bdf93-9449-4a29-89a4-3f25e82e96ad.png" alt class="image--center mx-auto" /></p>
<p>The model shows a downward trend for the next 31 days.</p>
<p>It also makes plotting the model components easy. So you can see the yearly, weekly trends</p>
<pre><code class="lang-python"><span class="hljs-comment"># Components</span>
m.plot_components(forecast);
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709312612634/fd4f9ffd-85f0-46fa-a816-ee7345a0fb38.png" alt class="image--center mx-auto" /></p>
<p>I will be forecasting this same dataset using some other forecasting models to compare results. And combining them together.</p>
]]></content:encoded></item><item><title><![CDATA[Creating Network Diagrams using PyVis]]></title><description><![CDATA[Recently I was working with large precedence networks, performing critical path analysis and finding node successors. I wanted to visualize the network and started learning about the PyVis Library in Python.
It is a pretty easy to use and creates int...]]></description><link>https://blog.dtucker.xyz/creating-network-diagrams-using-pyvis</link><guid isPermaLink="true">https://blog.dtucker.xyz/creating-network-diagrams-using-pyvis</guid><category><![CDATA[Pyvis]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 08 Mar 2024 06:00:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1706376318953/c0b2e304-0c49-401c-a490-b210dfaf4a6b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently I was working with large precedence networks, performing critical path analysis and finding node successors. I wanted to visualize the network and started learning about the PyVis Library in Python.</p>
<p>It is a pretty easy to use and creates interactive network diagrams.</p>
<p>I created a visual from the Kaggle Marvel Character Social Network <a target="_blank" href="https://www.kaggle.com/datasets/csanhueza/the-marvel-universe-social-network">dataset</a> which shows the relationships between all the Marvel Characters. The one downside to the visuals created with PyVis is that the larger models can take some time to load on the screen. In this example I had to subset the data to just 3 characters.</p>
<p>But the diagrams are interactive and move based on customizable physics.</p>
<p><a target="_blank" href="https://dtucker.xyz/buckey.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706376009401/fabaf069-7d5f-4091-be56-273c38138d89.png" alt class="image--center mx-auto" /></a></p>
<p>I also created a smaller diagram to show the relationship between different skills that I have learned.</p>
<p>You can interact with the diagram at this location here. I was able to customize the size, color, edge colors, labels.</p>
<p><a target="_blank" href="https://dtucker.xyz/skills/skill_diagram.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706376120722/a88784c2-07ca-45d2-95f7-d0dfb4867a37.png" alt class="image--center mx-auto" /></a></p>
<p>Check out the PyVis <a target="_blank" href="https://pyvis.readthedocs.io/en/latest/tutorial.html">tutorial</a> for more.</p>
]]></content:encoded></item><item><title><![CDATA[NLP Binary Classification of tweets]]></title><description><![CDATA[Wanted to share my steps in predicting if a random tweet was about a current disaster or if it was just a tweet, not about a disaster. This model was trained on the Kaggle Disaster Tweet dataset. It also uses a TensorFlow Hub pretrained universal sen...]]></description><link>https://blog.dtucker.xyz/nlp-binary-classification-of-tweets</link><guid isPermaLink="true">https://blog.dtucker.xyz/nlp-binary-classification-of-tweets</guid><category><![CDATA[nlp]]></category><category><![CDATA[Python]]></category><category><![CDATA[TensorFlow]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 01 Mar 2024 06:00:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1705686863721/7f05b9a0-3288-44f7-8e52-099e45175864.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Wanted to share my <a target="_blank" href="https://www.kaggle.com/code/glenn23/nlp-tweets-classification-use-sequential-api">steps</a> in predicting if a random tweet was about a current disaster or if it was just a tweet, not about a disaster. This model was trained on the Kaggle Disaster Tweet <a target="_blank" href="https://www.kaggle.com/competitions/nlp-getting-started/data">dataset</a>. It also uses a TensorFlow Hub pretrained universal sentence encoder, <a target="_blank" href="https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/tensorFlow2/variations/universal-sentence-encoder/versions/2?tfhub-redirect=true">USE</a> in a Sequential Model.</p>
<p>First loaded the needed libraries and imported Kaggle dataset</p>
<pre><code class="lang-python"><span class="hljs-comment"># Import libraries</span>
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">import</span> tensorflow_hub <span class="hljs-keyword">as</span> hub
<span class="hljs-keyword">from</span> tensorflow.keras <span class="hljs-keyword">import</span> layers
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> wordcloud <span class="hljs-keyword">import</span> WordCloud

<span class="hljs-comment"># Import data</span>
train_df = pd.read_csv(<span class="hljs-string">"/kaggle/input/nlp-getting-started/train.csv"</span>)
test_df = pd.read_csv(<span class="hljs-string">"/kaggle/input/nlp-getting-started/test.csv"</span>)
train_df.head()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705686410584/42c58657-d232-4c45-8b02-ac9fd3dc15a1.png" alt class="image--center mx-auto" /></p>
<p>Shuffled and split the data</p>
<pre><code class="lang-python"><span class="hljs-comment"># Shuffle training dataframe</span>
train_df_shuffled = train_df.sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>)
train_df_shuffled.head()

<span class="hljs-comment"># Use train_test_split to split training data into training and validation sets</span>
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled[<span class="hljs-string">"text"</span>].to_numpy(),
                                                                            train_df_shuffled[<span class="hljs-string">"target"</span>].to_numpy(),
                                                                            test_size=<span class="hljs-number">0.1</span>, 
                                                                            random_state=<span class="hljs-number">42</span>) 

<span class="hljs-comment"># Create sentences and labels</span>
whole_train_sentences = train_df_shuffled[<span class="hljs-string">'text'</span>].to_numpy()
whole_train_labels =  train_df_shuffled[<span class="hljs-string">'target'</span>].to_numpy() 

len(whole_train_sentences) , len(whole_train_labels)
</code></pre>
<p>Created the Keras Layer using the pretrained universal sentence encoder.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Create a Keras layer using the USE pretrained layer from tensorflow hub</span>
sentence_encoder_layer = hub.KerasLayer(<span class="hljs-string">"https://tfhub.dev/google/universal-sentence-encoder/4"</span>,
                                        input_shape=[], 
                                        dtype=tf.string, 
                                        trainable=<span class="hljs-literal">False</span>,
                                        name=<span class="hljs-string">"USE"</span>
                                        )
</code></pre>
<p>Created a Sequential model</p>
<pre><code class="lang-python"><span class="hljs-comment"># Create model using the Sequential API</span>
model = tf.keras.Sequential([
  sentence_encoder_layer, 
  layers.Dense(<span class="hljs-number">64</span> , activation =<span class="hljs-string">'relu'</span>),
  layers.Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">"sigmoid"</span>)
])

<span class="hljs-comment"># Compile model</span>
model.compile(loss=<span class="hljs-string">"binary_crossentropy"</span>,
                optimizer=tf.keras.optimizers.legacy.Adam(),
                metrics=[<span class="hljs-string">"accuracy"</span>])

<span class="hljs-comment"># Train a classifier on top of pretrained embeddings</span>
model_history =model.fit(whole_train_sentences,
                              whole_train_labels,
                              epochs=<span class="hljs-number">5</span>,
                              validation_data=(val_sentences, val_labels))
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705686936554/daa8faed-caac-4dfe-b0b8-de3ba9a2eb20.png" alt class="image--center mx-auto" /></p>
<p>Make predictions</p>
<pre><code class="lang-python"><span class="hljs-comment"># Make predictions with the model</span>
pred_probs = model.predict(test_df[<span class="hljs-string">'text'</span>].to_numpy())
</code></pre>
<p>Submitted my predictions in the Kaggle Competition. It just beat the AutoML Benchmark.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705686367505/6ff61220-7497-45dc-ae88-7935ca129583.png" alt class="image--center mx-auto" /></p>
<p>I also <a target="_blank" href="https://blog.dtucker.xyz/visualizing-text-data">created</a> some visuals to view the keywords in the tweet data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705686479574/68b58c84-98f3-4fbc-bc08-e1f0bd9c0991.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705686483535/1e982964-e385-4b27-a11f-d6594d3c4710.png" alt class="image--center mx-auto" /></p>
<p>I also used the TensorFlow projector tool to visualize the embeddings. It produced a neat clustering a related words.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705686553776/7a70fffb-9318-4903-bd7d-a69e6e32cb7b.png" alt class="image--center mx-auto" /></p>
<hr />
<p>You can find the Kaggle notebook <a target="_blank" href="https://www.kaggle.com/code/glenn23/nlp-tweets-classification-use-sequential-api">here</a></p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Visualizing text data]]></title><description><![CDATA[As part of exploring the Kaggle dataset of tweets used to train and predict whether a tweet was about a real disaster or not, I explored a couple ways of visualizing the text data.
Python Bar Chart
Part of the dataset is a column for the tweet keywor...]]></description><link>https://blog.dtucker.xyz/visualizing-text-data</link><guid isPermaLink="true">https://blog.dtucker.xyz/visualizing-text-data</guid><category><![CDATA[Python]]></category><category><![CDATA[nlp]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 23 Feb 2024 06:00:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1705521393104/18101393-32bf-4128-8ded-f7596512d9f8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As part of exploring the Kaggle <a target="_blank" href="https://www.kaggle.com/competitions/nlp-getting-started/data">dataset</a> of tweets used to train and predict whether a tweet was about a real disaster or not, I explored a couple ways of visualizing the text data.</p>
<h3 id="heading-python-bar-chart">Python Bar Chart</h3>
<p>Part of the dataset is a column for the tweet keyword. I created a sorted bar chart to display top keywords. 'Fatalities' is the top keyword by count with a number words very close.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Plot tweet keywords</span>
plt.bar(keyword_df[<span class="hljs-string">'keyword'</span>].head(<span class="hljs-number">20</span>), keyword_df[<span class="hljs-string">'count'</span>].head(<span class="hljs-number">20</span>), color=<span class="hljs-string">'green'</span>)
plt.xticks(rotation = <span class="hljs-number">90</span>)
plt.ylabel(<span class="hljs-string">'Count'</span>)
plt.title(<span class="hljs-string">'Top keywords'</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705519570059/5cdca87a-0526-4924-a00d-f4f3561b1775.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-python-wordcloud">Python WordCloud</h3>
<p>In python, I created a world cloud of the keywords.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Create word cloud visual of keywords</span>
<span class="hljs-keyword">from</span> wordcloud <span class="hljs-keyword">import</span> WordCloud
word_frequencies = keyword_df[<span class="hljs-string">'keyword'</span>].value_counts().to_dict()

<span class="hljs-comment"># Generate the word cloud with frequencies</span>
wordcloud = WordCloud(width=<span class="hljs-number">800</span>, height=<span class="hljs-number">400</span>, background_color=<span class="hljs-string">'white'</span>)
wordcloud.generate_from_frequencies(word_frequencies)

<span class="hljs-comment"># Display the word cloud using matplotlib</span>
plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">5</span>))
plt.imshow(wordcloud, interpolation=<span class="hljs-string">'bilinear'</span>)
plt.axis(<span class="hljs-string">'off'</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705519993725/9e034693-ef13-4363-a02e-319e3717d622.png" alt class="image--center mx-auto" /></p>
<p>This created a neat visual where larger words are based on the count of keywords.</p>
<h3 id="heading-tableau-word-cloud-and-treemap">Tableau Word Cloud and TreeMap</h3>
<p>Finally, I explored visualizing in <a target="_blank" href="https://public.tableau.com/app/profile/donald.tucker4155/viz/Visualizingkeywords/TreeMap">Tableau</a>, but since the top keywords have counts very close together, the word cloud looked more like a list of words.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705520396488/1878a689-e6e2-4b33-8a0d-2e3c506f7ea3.png" alt class="image--center mx-auto" /></p>
<p>I followed this video in creating the visual.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=UHOMH5DTq14">https://www.youtube.com/watch?v=UHOMH5DTq14</a></div>
<p> </p>
<p>Changing the tableau chart into a TreeMap helped visualize a little better but it isn't as helpful as I would like.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705520620718/0f9a0d7d-2088-4127-9055-491160e55d24.png" alt class="image--center mx-auto" /></p>
<p>You can view the Kaggle <a target="_blank" href="https://www.kaggle.com/code/glenn23/nlp-tweets-classification-use-sequential-api">notebook</a> where the python charts are used in a NLP classification model.</p>
]]></content:encoded></item><item><title><![CDATA[Time series forecasting with Tableau]]></title><description><![CDATA[In another example I looked at performing time series forecasting in excel. I like examining different tools or methods to perform similar tasks. For this example I am performing forecasting within Tableau using the same dataset.
This method is also ...]]></description><link>https://blog.dtucker.xyz/time-series-forecasting-with-tableau</link><guid isPermaLink="true">https://blog.dtucker.xyz/time-series-forecasting-with-tableau</guid><category><![CDATA[tableau]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 16 Feb 2024 06:00:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1705071533446/1be24b67-a746-40fe-9e8f-48f7610661b4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In another example I looked at performing time series <a target="_blank" href="https://blog.dtucker.xyz/time-series-forecasting-in-excel">forecasting in excel</a>. I like examining different tools or methods to perform similar tasks. For this example I am performing forecasting within Tableau using the same dataset.</p>
<p>This method is also pretty straightforward.</p>
<h3 id="heading-steps">Steps</h3>
<ol>
<li><p>Load the data into tableau. For this example I loaded an excel sheet with the mock data used in the excel example.</p>
</li>
<li><p>Create a line graph with a date value on the x axis.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705071212532/4c4d77b6-4b25-49bb-b351-d74ffbf7fcbc.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>On the left hand panel, select the 'Analytics' panel. Under 'Model', select 'Forecast' and drag it on the line chart. You will see a small pop up showing that it will add a forecast.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705071403966/74ce916d-72fd-45dd-8bc2-375b50b05818.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>The chart will update and add the forecast. Here it is the light blue line with the shaded confidence interval band.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705071414399/6ecff8cc-f6bf-4da4-bd1f-800ea9b416fb.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>You can view and change settings by right-clicking on the line chart.</p>
<ol>
<li><p>You can change the forecast timeframe, model, and confidence interval.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705071428396/9135e0e3-9bde-4143-bed7-0cd157c9952c.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>There is also an option to view descriptive information about the model.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1705071434290/866a7aa2-7512-4f4a-809e-36314fb13b16.png" alt class="image--center mx-auto" /></p>
</li>
</ol>
</li>
</ol>
<p>Overall, I found this very quick to produce. Compared to the excel visual, tableau produced what looks like a smaller confidence interval band around the forecast. Excel automatically created a table with the forecasted values. While this can be created and exported by creating another sheet in tableau, it wasn't created automatically for me. Tableau is more of a visual presentation tool so this not is a surprise, but helpful to know before getting started with one method.</p>
<p><a target="_blank" href="https://blog.dtucker.xyz/time-series-forecasting-in-excel">Time series forecasting in excel</a></p>
]]></content:encoded></item><item><title><![CDATA[Time series forecasting in excel]]></title><description><![CDATA[I have done some time series forecasting using python in the past. Recently explored the Prophet library.
I like exploring doing similar things using different tools, so I wanted to step through doing time series forecasting in excel. Which is surpri...]]></description><link>https://blog.dtucker.xyz/time-series-forecasting-in-excel</link><guid isPermaLink="true">https://blog.dtucker.xyz/time-series-forecasting-in-excel</guid><category><![CDATA[excel]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 09 Feb 2024 06:00:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1704468576325/0f1a1182-a407-426b-b07e-81bbd5e390f9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I have done some time series forecasting using python in the past. Recently explored the Prophet library.</p>
<p>I like exploring doing similar things using different tools, so I wanted to step through doing time series forecasting in excel. Which is surprisingly pretty simple.</p>
<h3 id="heading-steps">Steps</h3>
<ol>
<li><p>Highlight your data in excel. (Both x and y variables). In my example I had a 'Date' and 'Value' column. I created some fake data with 'seasonality'.</p>
</li>
<li><p>Select Data Tab -&gt; Forecast Section -&gt; Click ‘Forecast Sheet'</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704468383765/d73962c9-ff37-4771-ac8b-21786a6c2af2.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Adjust any parameters in the window</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704468457175/1c42a43b-61be-4c0f-8e96-82365714be0a.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>This will output a chart and data table on a new sheet.</p>
</li>
</ol>
<p><a target="_blank" href="https://blog.dtucker.xyz/time-series-forecasting-with-tableau">Time series forecasting in Tableau</a></p>
]]></content:encoded></item><item><title><![CDATA[Predicting housing prices]]></title><description><![CDATA[Using the Kaggle housing dataset, I practiced using machine learning to predict housing prices.

🔗 Links
Kaggle notebook
Tableau dashboard

Here is an outline of the steps I took to perform the analysis:

Data Exploration: Here are some of the helpf...]]></description><link>https://blog.dtucker.xyz/predicting-housing-prices</link><guid isPermaLink="true">https://blog.dtucker.xyz/predicting-housing-prices</guid><category><![CDATA[Python]]></category><category><![CDATA[Xgboost]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 02 Feb 2024 06:00:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/r3WAWU5Fi5Q/upload/822081a27b7b2c336b43ea14755036c1.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Using the Kaggle housing <a target="_blank" href="https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data">dataset</a>, I practiced using machine learning to predict housing prices.</p>
<hr />
<h3 id="heading-links">🔗 Links</h3>
<p>Kaggle <a target="_blank" href="https://www.kaggle.com/code/glenn23/housing-prices/notebook">notebook</a></p>
<p>Tableau <a target="_blank" href="https://public.tableau.com/app/profile/donald.tucker4155/viz/HousingPrices_16960239787650/Sheet12">dashboard</a></p>
<hr />
<p>Here is an outline of the steps I took to perform the analysis:</p>
<ol>
<li><p>Data Exploration: Here are some of the helpful graphs created while exploring the data.</p>
<ol>
<li><p>Histogram of Home Sales Prices</p>
<pre><code class="lang-python"> sns.displot(df_data[<span class="hljs-string">'SalePrice'</span>], 
             bins=<span class="hljs-number">50</span>, 
             aspect=<span class="hljs-number">2</span>,
             kde=<span class="hljs-literal">True</span>, 
             color=<span class="hljs-string">'darkblue'</span>)

 plt.title(<span class="hljs-string">f'Home Sales Price. Average: $<span class="hljs-subst">{(df_data.SalePrice.mean()):,<span class="hljs-number">.0</span>f}</span>'</span>)
 plt.xlabel(<span class="hljs-string">'Price ($)'</span>)
 plt.ylabel(<span class="hljs-string">'# of Homes'</span>)

 plt.show()
</code></pre>
</li>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704131992758/c0a8d540-4cad-46bd-ac79-a06b5117efd4.png" alt class="image--center mx-auto" /></p>
<p> Correlation heat map - This helps spot correlation between features and the target variable <code>SalesPrice</code></p>
</li>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704132054288/4bd4c7a3-1484-44ab-a9a0-6fdc0d6c59da.png" alt class="image--center mx-auto" /></p>
<p> Average <code>Sales Prices</code> over time ⬆️</p>
</li>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704132120634/e5f33f80-6ed7-41d9-b768-71f7a0d4225e.png" alt class="image--center mx-auto" /></p>
<p> 📊 I also created a tableau <a target="_blank" href="https://public.tableau.com/app/profile/donald.tucker4155/viz/HousingPrices_16960239787650/Sheet12">dashboard</a> to help me visualize the data. This wasn't necessary, but in the past I have used this as a way to spot correlations or relationships.</p>
</li>
</ol>
</li>
<li><p>Data Cleaning</p>
<ol>
<li><p>Created variables for <code>X</code> (predictor) variables and <code>y</code> (target) variable.</p>
</li>
<li><p>Train, test, split</p>
</li>
<li><p>Handle missing data and transform columns</p>
</li>
<li><pre><code class="lang-python"> <span class="hljs-comment"># Handle Missing Data</span>
 numeric_cols = X.select_dtypes(include=[<span class="hljs-string">'number'</span>]).columns
 categorical_cols = X.select_dtypes(exclude=[<span class="hljs-string">'number'</span>]).columns

 numeric_imputer = SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)
 categorical_imputer = SimpleImputer(strategy=<span class="hljs-string">'most_frequent'</span>)

 <span class="hljs-comment"># Create transformers for preprocessing</span>
 numeric_transformer = Pipeline(steps=[
     (<span class="hljs-string">'imputer'</span>, numeric_imputer),
     (<span class="hljs-string">'scaler'</span>, StandardScaler())
 ])

 categorical_transformer = Pipeline(steps=[
     (<span class="hljs-string">'imputer'</span>, categorical_imputer),
     (<span class="hljs-string">'encoder'</span>, OneHotEncoder(handle_unknown=<span class="hljs-string">'ignore'</span>))
 ])

 <span class="hljs-comment"># Use ColumnTransformer to apply transformers to the appropriate columns</span>
 <span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer

 preprocessor = ColumnTransformer(
     transformers=[
         (<span class="hljs-string">'num'</span>, numeric_transformer, numeric_cols),
         (<span class="hljs-string">'cat'</span>, categorical_transformer, categorical_cols)
     ])
</code></pre>
</li>
</ol>
</li>
<li><p>XGB model and Predictions</p>
<pre><code class="lang-python"> <span class="hljs-comment"># Create a XGBoost Regressor</span>
 xgb = XGBRegressor(n_estimators=<span class="hljs-number">500</span>, learning_rate=<span class="hljs-number">0.04</span>)
</code></pre>
<pre><code class="lang-python"> <span class="hljs-comment"># Bundle preprocessing and modeling code in a pipeline</span>
 my_pipeline = Pipeline(steps=[(<span class="hljs-string">'preprocessor'</span>, preprocessor),
                               (<span class="hljs-string">'model'</span>, xgb)
                              ])

 <span class="hljs-comment"># Preprocessing of training data, fit model </span>
 my_pipeline.fit(X_train, y_train)

 <span class="hljs-comment"># Preprocessing of validation data, get predictions</span>
 preds = my_pipeline.predict(X_valid)

 <span class="hljs-comment"># Evaluate the model</span>
 mae = mean_absolute_error(y_valid, preds)
 mse = mean_squared_error(y_valid, preds)
 r2 = r2_score(y_valid, preds)
 print(<span class="hljs-string">f"Mean Absolute Error (MAE): <span class="hljs-subst">{mae:<span class="hljs-number">.2</span>f}</span>"</span>)
 print(<span class="hljs-string">f"Mean Squared Error (MSE): <span class="hljs-subst">{mse:<span class="hljs-number">.2</span>f}</span>"</span>)
 print(<span class="hljs-string">f"R-squared (R2): <span class="hljs-subst">{r2:<span class="hljs-number">.2</span>f}</span>"</span>)
</code></pre>
</li>
<li><p>Hyper-parameter tuning</p>
<ol>
<li>Setting up parameters to tune</li>
</ol>
</li>
</ol>
<pre><code class="lang-python">param_tuning = {
    <span class="hljs-string">'model__learning_rate'</span>: [<span class="hljs-number">0.01</span>, <span class="hljs-number">0.1</span>, <span class="hljs-number">0.05</span>],
    <span class="hljs-string">'model__max_depth'</span>: [<span class="hljs-number">3</span>, <span class="hljs-number">5</span>, <span class="hljs-number">7</span>, <span class="hljs-number">10</span>],
    <span class="hljs-string">'model__min_child_weight'</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">3</span>, <span class="hljs-number">5</span>],
    <span class="hljs-string">'model__subsample'</span>: [<span class="hljs-number">0.5</span>, <span class="hljs-number">0.7</span>],
    <span class="hljs-string">'model__colsample_bytree'</span>: [<span class="hljs-number">0.5</span>, <span class="hljs-number">0.7</span>],
    <span class="hljs-string">'model__n_estimators'</span>: [<span class="hljs-number">100</span>, <span class="hljs-number">200</span>, <span class="hljs-number">500</span>, <span class="hljs-number">1000</span>],
    <span class="hljs-string">'model__objective'</span>: [<span class="hljs-string">'reg:squarederror'</span>]
}
</code></pre>
<ul>
<li>Grid Search</li>
</ul>
<pre><code class="lang-python">xgb_model = XGBRegressor()
my_pipeline = Pipeline(steps=[(<span class="hljs-string">'preprocessor'</span>, preprocessor),
                              (<span class="hljs-string">'model'</span>, xgb_model)
                             ])


xgb_cv = GridSearchCV(estimator=my_pipeline,
                           param_grid = param_tuning,                        
                           scoring = <span class="hljs-string">'neg_mean_absolute_error'</span>, <span class="hljs-comment">#MAE</span>
                           <span class="hljs-comment">#scoring = 'neg_mean_squared_error',  #MSE</span>
                           cv = <span class="hljs-number">5</span>)

xgb_cv.fit(X_train, y_train)
print(<span class="hljs-string">"Best Score: "</span>, xgb_cv.best_score_)
print(<span class="hljs-string">"Best Params: "</span>, xgb_cv.best_params_)
</code></pre>
<p>Hyper-parameter tuning helped decrease the model error 🎉</p>
]]></content:encoded></item><item><title><![CDATA[Spaceship Titanic classification model]]></title><description><![CDATA[Using the Kaggle Spaceship Titanic dataset, I created a machine learning model in python to predict which passengers in the data would be transported to an alternate dimension.
Check out the Kaggle notebook
Data Exploration
The first step was loading...]]></description><link>https://blog.dtucker.xyz/spaceship-titanic-classification-model</link><guid isPermaLink="true">https://blog.dtucker.xyz/spaceship-titanic-classification-model</guid><category><![CDATA[Python]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 26 Jan 2024 06:00:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/084iI8XTfN0/upload/672d5a04035f6fa9e62dec74f0cab8c8.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Using the Kaggle Spaceship Titanic <a target="_blank" href="https://www.kaggle.com/competitions/spaceship-titanic">dataset</a>, I created a machine learning model in python to predict which passengers in the data would be transported to an alternate dimension.</p>
<p>Check out the Kaggle <a target="_blank" href="https://www.kaggle.com/code/glenn23/spaceship-titanic-model">notebook</a></p>
<h3 id="heading-data-exploration"><strong>Data Exploration</strong></h3>
<p>The first step was loading and performing some data exploration</p>
<p>A pair plot is a helpful visual to spot correlations.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703708601897/91783689-a48f-459e-b154-1d5cd5772f46.png" alt class="image--center mx-auto" /></p>
<p>Here is a visual of the age of those transported</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703708798298/0f111bb0-aea2-4e67-b081-c1aba54e753a.png" alt class="image--center mx-auto" /></p>
<p>During the data exploration I used the <code>get_dummies</code> function on the 'HomePlanet' and 'Destination' columns. I also performed other methods to clean and transform the data.</p>
<pre><code class="lang-python">df_data_final = pd.get_dummies(df_data_final, columns=[<span class="hljs-string">'HomePlanet'</span>, <span class="hljs-string">'Destination'</span>], drop_first=<span class="hljs-literal">False</span>)
</code></pre>
<p>I created a tableau <a target="_blank" href="https://public.tableau.com/app/profile/donald.tucker4155/viz/SpaceshipTitanic_16882551562560/PassengerDashboard">dashboard</a> to help visualize the data. This wasn't necessary but helped me see and interact with the data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703709015618/98b00561-7981-4ea1-9041-56491114777b.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-creating-the-model">Creating the model</h3>
<p>The target variable is the 'Transported' column, the rest of the columns are the features.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Define the y (target) variable.</span>
y = df_data_final[<span class="hljs-string">'Transported'</span>]

<span class="hljs-comment"># Define the X (predictor) variables.</span>
X = df_data_final.drop([<span class="hljs-string">'Transported'</span>], axis = <span class="hljs-number">1</span>)
</code></pre>
<h3 id="heading-results">Results</h3>
<p>Random Forest Model Accuracy: 0.781048758049678</p>
<p>XGBoost Model Accuracy: 0.7828886844526219</p>
<p>The XGBoost model performed slightly better on the training data.</p>
<p>If you are interested, the Kaggle notebook can be found <a target="_blank" href="https://www.kaggle.com/code/glenn23/spaceship-titanic-model">here</a></p>
]]></content:encoded></item><item><title><![CDATA[Online Retail Exploratory Data Analysis with Python]]></title><description><![CDATA[Gaining some practice performing exploratory data analysis, since this is often one of the first steps in preparing data for machine learning.

Links:
Jupyter notebook
Tableau dashboard visualization

Here are some major steps

Load the dataset

Perf...]]></description><link>https://blog.dtucker.xyz/online-retail-exploratory-data-analysis-with-python</link><guid isPermaLink="true">https://blog.dtucker.xyz/online-retail-exploratory-data-analysis-with-python</guid><category><![CDATA[Python]]></category><category><![CDATA[exploratory data analysis]]></category><dc:creator><![CDATA[Donald Tucker]]></dc:creator><pubDate>Fri, 19 Jan 2024 06:00:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/LvySG1hvuzI/upload/ecf8286366d8e0df6eb24c2674f8e70d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Gaining some practice performing exploratory data analysis, since this is often one of the first steps in preparing data for machine learning.</p>
<hr />
<p><strong>Links:</strong></p>
<p>Jupyter <a target="_blank" href="https://github.com/Unizomby/online_retail_eda/blob/main/online_retail.ipynb">notebook</a></p>
<p>Tableau dashboard <a target="_blank" href="https://public.tableau.com/app/profile/donald.tucker4155/viz/OnlineRetailsalesEDA/Dashboard1">visualization</a></p>
<hr />
<h3 id="heading-here-are-some-major-steps">Here are some major steps</h3>
<ol>
<li><p>Load the dataset</p>
</li>
<li><p>Perform data cleaning by handling missing values, if any, and removing any redundant or unnecessary columns.</p>
<p> <code>drop_duplicates()</code>, <code>df.dropna()</code></p>
</li>
<li><p>Explore the basic statistics of the dataset</p>
<p> <code>df.describe()</code></p>
</li>
<li><p>Perform data visualization to gain insights into the dataset. Generate appropriate plots, such as histograms, scatter plots, or bar plots, to visualize different aspects of the data</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703710629476/041ede7f-8318-4b2e-b5be-ddd5da08e1ec.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Analyze the sales trends over time. Identify the busiest months and days of the week in terms of sales.</p>
<p> I used <code>groupby()</code> to pivot the data for the sales trend charts.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703710809247/68b531a2-bd0b-4b5e-8f96-805b3873ff7f.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Explore the top-selling products and countries based on the quantity sold.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703710945549/d31526cc-9051-417e-b3a1-e1bc61dca6d9.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703710868553/c7b77e3c-af74-460d-8aea-a6d50f98761b.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Identify any outliers or anomalies in the dataset and discuss their potential impact on the analysis.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703710981349/43f2c3d5-c892-461f-b34a-87be911e6552.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Conclusions</strong>:</p>
<p> There are missing values for product <code>Description</code> and <code>CustomerId</code>, for now I will keep this in the data to discuss with customer before removing.</p>
<p> There are some items in the transactions that might be removed. For example, items listed as AMAZON FEE, Manual, DOTCOM POSTAGE, and POSTAGE. This makes it harder to compare products sales in the data.</p>
<p> There appears to be 2 transactions with very high <code>Quantities</code>, with the same quantity returned the same day. This appears to be returned items, recommend excluding these transactions.</p>
</li>
</ol>
<p>2 outliers appear in <code>UnitPrice</code></p>
<ul>
<li><p>First is a Manual transaction for stock code 'M' with unit price of 38,970</p>
</li>
<li><p>Second, there are 2 negative transactions for <code>Stockcode</code> 'B' to 'adjust bad debt'</p>
</li>
</ul>
<p><strong>Until those items above are removed, we can see that:</strong></p>
<ul>
<li><p>📅Busiest month: November</p>
</li>
<li><p>📅Busiest weekday: Thursday</p>
</li>
<li><p>🔥Most transacted product qty: World War 2 Gliders Asstd Design (85123A)</p>
</li>
<li><p>Most transacted stockcode (without description): 22197</p>
</li>
<li><p>Highest unit price item: AMAZON FEE</p>
</li>
<li><p>Highest product unit price: REGENCY CAKESTAND 3 TIER</p>
</li>
<li><p>🌍Majority of sales are in United Kingdom</p>
</li>
<li><p>Avg transaction qty: 9.6</p>
</li>
<li><p>Avg transaction unit price: 4.6</p>
</li>
<li><p>Weekly qty has an upward trend in 2011</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>