首页 文章 精选 留言 我的

精选列表

搜索[工具模块],共10000篇文章
优秀的个人博客,低调大师

[雪峰磁针石博客]数据分析工具pandas快速入门教程5-处理缺失数据

第5章 缺失数据 介绍 很少没有任何缺失值的数据集。 有许多缺失数据的表示。 在数据库中是NULL值,一些编程语言使用NA。缺失值可以是空字符串:''或者甚至是数值88或99等。Pandas显示缺失值为NaN。 本章将涵盖: 什么是缺失值 如何创建缺失值 如何重新编码并使用缺失值进行计算 什么是缺失值 可以从numpy中获得NaN值,在Python中看到缺失值使用几种方式显示:NaN,NAN或nan,他们都是相等的。 NaN不等于0或空字符串''。 In [1]: from numpy import NaN, NAN, nan In [2]: print(NaN == True, NaN == False, NaN == 0, NaN == '', sep='|') False|False|False|False In [3]: print(NaN == NaN, NaN == nan, NaN == NAN, nan == NAN, sep='|') False|False|False|False In [4]: import pandas as pd In [5]: print(pd.isnull(NaN), pd.isnull(nan), pd.isnull(NAN), sep='|') True|True|True In [6]: print(pd.notnull(NaN), pd.notnull(99), pd.notnull("https://china-testing.github.io"), sep='|') False|True|True 缺失值的来源 来自加载数据或数据处理 加载数据 当我们加载数据时,pandas会自动找到该缺少数据的单元格,并填充NaN值。在read_csv函数中,参数na_values, keep_default_na, na_filter用于处理缺失值。比如:na_values=[99]。na_filter设置为False,在读大文件时会提升性能。 5-1.py import pandas as pd visited_file = 'data/survey_visited.csv' print(pd.read_csv(visited_file)) print(pd.read_csv(visited_file, keep_default_na=False)) print(pd.read_csv(visited_file, na_values=[''], keep_default_na=False)) 执行结果 $ python3 5-1.py ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 NaN 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22 ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22 ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 NaN 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22 合并数据 import pandas as pd visited = pd.read_csv('data/survey_visited.csv') survey = pd.read_csv('data/survey_survey.csv') print(visited) print(survey) vs = visited.merge(survey, left_on='ident', right_on='taken') print(vs) 执行结果 $ python3 5-2.py ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 NaN 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22 taken person quant reading 0 619 dyer rad 9.82 1 619 dyer sal 0.13 2 622 dyer rad 7.80 3 622 dyer sal 0.09 4 734 pb rad 8.41 5 734 lake sal 0.05 6 734 pb temp -21.50 7 735 pb rad 7.22 8 735 NaN sal 0.06 9 735 NaN temp -26.00 10 751 pb rad 4.35 11 751 pb temp -18.50 12 751 lake sal 0.10 13 752 lake rad 2.19 14 752 lake sal 0.09 15 752 lake temp -16.00 16 752 roe sal 41.60 17 837 lake rad 1.46 18 837 lake sal 0.21 19 837 roe sal 22.50 20 844 roe rad 11.25 ident site dated taken person quant reading 0 619 DR-1 1927-02-08 619 dyer rad 9.82 1 619 DR-1 1927-02-08 619 dyer sal 0.13 2 622 DR-1 1927-02-10 622 dyer rad 7.80 3 622 DR-1 1927-02-10 622 dyer sal 0.09 4 734 DR-3 1939-01-07 734 pb rad 8.41 5 734 DR-3 1939-01-07 734 lake sal 0.05 6 734 DR-3 1939-01-07 734 pb temp -21.50 7 735 DR-3 1930-01-12 735 pb rad 7.22 8 735 DR-3 1930-01-12 735 NaN sal 0.06 9 735 DR-3 1930-01-12 735 NaN temp -26.00 10 751 DR-3 1930-02-26 751 pb rad 4.35 11 751 DR-3 1930-02-26 751 pb temp -18.50 12 751 DR-3 1930-02-26 751 lake sal 0.10 13 752 DR-3 NaN 752 lake rad 2.19 14 752 DR-3 NaN 752 lake sal 0.09 15 752 DR-3 NaN 752 lake temp -16.00 16 752 DR-3 NaN 752 roe sal 41.60 17 837 MSK-4 1932-01-14 837 lake rad 1.46 18 837 MSK-4 1932-01-14 837 lake sal 0.21 19 837 MSK-4 1932-01-14 837 roe sal 22.50 20 844 DR-1 1932-03-22 844 roe rad 11.25 用户输入 import pandas as pd from numpy import NaN, NAN, nan num_legs = pd.Series({'goat': 4, 'amoeba': nan}) print(num_legs) scientists = pd.DataFrame({'Name': ['Rosaline Franklin', 'William Gosset'], 'Occupation': ['Chemist', 'Statistician'], 'Born': ['1920-07-25', '1876-06-13'], 'Died': ['1958-04-16', '1937-10-16'], 'missing': [NaN, nan]}) print(scientists) scientists['missing'] = nan print(scientists) 执行结果 $ python3 5-3.py amoeba NaN goat 4.0 dtype: float64 Born Died Name Occupation missing 0 1920-07-25 1958-04-16 Rosaline Franklin Chemist NaN 1 1876-06-13 1937-10-16 William Gosset Statistician NaN Born Died Name Occupation missing 0 1920-07-25 1958-04-16 Rosaline Franklin Chemist NaN 1 1876-06-13 1937-10-16 William Gosset Statistician NaN 重新索引 5-4.py import pandas as pd from numpy import NaN, NAN, nan gapminder = pd.read_csv('data/gapminder.tsv', sep='\t') life_exp = gapminder.groupby(['year'])['lifeExp'].mean() print(life_exp) print(life_exp.reindex(range(2000, 2010))) 执行结果 year 1952 49.057620 1957 51.507401 1962 53.609249 1967 55.678290 1972 57.647386 1977 59.570157 1982 61.533197 1987 63.212613 1992 64.160338 1997 65.014676 2002 65.694923 2007 67.007423 Name: lifeExp, dtype: float64 year 2000 NaN 2001 NaN 2002 65.694923 2003 NaN 2004 NaN 2005 NaN 2006 NaN 2007 67.007423 2008 NaN 2009 NaN Name: lifeExp, dtype: float64 处理缺失数据 统计缺失数据 5-5.py import pandas as pd from numpy import NaN, NAN, nan import numpy as np ebola = pd.read_csv('data/country_timeseries.csv') print(ebola.head()) print(ebola.count()) num_rows = ebola.shape[0] print("num_rows") print(num_rows) num_missing = num_rows - ebola.count() print("num_missing:") print(num_missing) print(np.count_nonzero(ebola.isnull())) print(np.count_nonzero(ebola['Cases_Guinea'].isnull())) print(ebola.Cases_Guinea.value_counts(dropna=False).head()) 执行结果 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \ 0 1/5/2015 289 2776.0 NaN 10030.0 1 1/4/2015 288 2775.0 NaN 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 NaN 8157.0 NaN 4 12/31/2014 284 2730.0 8115.0 9633.0 Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \ 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \ 0 1786.0 NaN 2977.0 NaN 1 1781.0 NaN 2943.0 NaN 2 1767.0 3496.0 2915.0 NaN 3 NaN 3496.0 NaN NaN 4 1739.0 3471.0 2827.0 NaN Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN Date 122 Day 122 Cases_Guinea 93 Cases_Liberia 83 Cases_SierraLeone 87 Cases_Nigeria 38 Cases_Senegal 25 Cases_UnitedStates 18 Cases_Spain 16 Cases_Mali 12 Deaths_Guinea 92 Deaths_Liberia 81 Deaths_SierraLeone 87 Deaths_Nigeria 38 Deaths_Senegal 22 Deaths_UnitedStates 18 Deaths_Spain 16 Deaths_Mali 12 dtype: int64 num_rows 122 num_missing: Date 0 Day 0 Cases_Guinea 29 Cases_Liberia 39 Cases_SierraLeone 35 Cases_Nigeria 84 Cases_Senegal 97 Cases_UnitedStates 104 Cases_Spain 106 Cases_Mali 110 Deaths_Guinea 30 Deaths_Liberia 41 Deaths_SierraLeone 35 Deaths_Nigeria 84 Deaths_Senegal 100 Deaths_UnitedStates 104 Deaths_Spain 106 Deaths_Mali 110 dtype: int64 1214 29 NaN 29 86.0 3 495.0 2 112.0 2 390.0 2 Name: Cases_Guinea, dtype: int64 处理缺失数据 5-6.py import pandas as pd from numpy import NaN, NAN, nan import numpy as np ebola = pd.read_csv('data/country_timeseries.csv') print(ebola.iloc[0:10, 0:5]) print(ebola.fillna(0).iloc[0:10, 0:5]) # 前向填充 print(ebola.fillna(method='ffill').iloc[0:10, 0:5]) # 后向填充 print(ebola.fillna(method='bfill').iloc[0:10, 0:5]) print(ebola.interpolate().iloc[0:10, 0:5]) print(ebola.shape) ebola_dropna = ebola.dropna() print(ebola_dropna.shape) print(ebola_dropna) ebola['Cases_multiple'] = ebola['Cases_Guinea'] + ebola['Cases_Liberia'] + \ ebola['Cases_SierraLeone'] ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone', 'Cases_multiple']] print(ebola_subset.head(n=10)) print(ebola.Cases_Guinea.sum(skipna = True)) print(ebola.Cases_Guinea.sum(skipna = False)) 执行结果 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 0 1/5/2015 289 2776.0 NaN 10030.0 1 1/4/2015 288 2775.0 NaN 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 NaN 8157.0 NaN 4 12/31/2014 284 2730.0 8115.0 9633.0 5 12/28/2014 281 2706.0 8018.0 9446.0 6 12/27/2014 280 2695.0 NaN 9409.0 7 12/24/2014 277 2630.0 7977.0 9203.0 8 12/21/2014 273 2597.0 NaN 9004.0 9 12/20/2014 272 2571.0 7862.0 8939.0 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 0 1/5/2015 289 2776.0 0.0 10030.0 1 1/4/2015 288 2775.0 0.0 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 0.0 8157.0 0.0 4 12/31/2014 284 2730.0 8115.0 9633.0 5 12/28/2014 281 2706.0 8018.0 9446.0 6 12/27/2014 280 2695.0 0.0 9409.0 7 12/24/2014 277 2630.0 7977.0 9203.0 8 12/21/2014 273 2597.0 0.0 9004.0 9 12/20/2014 272 2571.0 7862.0 8939.0 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 0 1/5/2015 289 2776.0 NaN 10030.0 1 1/4/2015 288 2775.0 NaN 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 2769.0 8157.0 9722.0 4 12/31/2014 284 2730.0 8115.0 9633.0 5 12/28/2014 281 2706.0 8018.0 9446.0 6 12/27/2014 280 2695.0 8018.0 9409.0 7 12/24/2014 277 2630.0 7977.0 9203.0 8 12/21/2014 273 2597.0 7977.0 9004.0 9 12/20/2014 272 2571.0 7862.0 8939.0 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 0 1/5/2015 289 2776.0 8166.0 10030.0 1 1/4/2015 288 2775.0 8166.0 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 2730.0 8157.0 9633.0 4 12/31/2014 284 2730.0 8115.0 9633.0 5 12/28/2014 281 2706.0 8018.0 9446.0 6 12/27/2014 280 2695.0 7977.0 9409.0 7 12/24/2014 277 2630.0 7977.0 9203.0 8 12/21/2014 273 2597.0 7862.0 9004.0 9 12/20/2014 272 2571.0 7862.0 8939.0 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 0 1/5/2015 289 2776.0 NaN 10030.0 1 1/4/2015 288 2775.0 NaN 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 2749.5 8157.0 9677.5 4 12/31/2014 284 2730.0 8115.0 9633.0 5 12/28/2014 281 2706.0 8018.0 9446.0 6 12/27/2014 280 2695.0 7997.5 9409.0 7 12/24/2014 277 2630.0 7977.0 9203.0 8 12/21/2014 273 2597.0 7919.5 9004.0 9 12/20/2014 272 2571.0 7862.0 8939.0 (122, 18) (1, 18) Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \ 19 11/18/2014 241 2047.0 7082.0 6190.0 Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \ 19 20.0 1.0 4.0 1.0 6.0 Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \ 19 1214.0 2963.0 1267.0 8.0 Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali 19 0.0 1.0 0.0 6.0 Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_multiple 0 2776.0 NaN 10030.0 NaN 1 2775.0 NaN 9780.0 NaN 2 2769.0 8166.0 9722.0 20657.0 3 NaN 8157.0 NaN NaN 4 2730.0 8115.0 9633.0 20478.0 5 2706.0 8018.0 9446.0 20170.0 6 2695.0 NaN 9409.0 NaN 7 2630.0 7977.0 9203.0 19810.0 8 2597.0 NaN 9004.0 NaN 9 2571.0 7862.0 8939.0 19372.0 84729.0 nan 参考资料 技术支持qq群144081101 591302926 567351477 钉钉免费群21745728 本文最新版本地址 本文涉及的python测试开发库 谢谢点赞! 本文相关海量书籍下载 源码下载 本文英文版书籍下载

优秀的个人博客,低调大师

分布式系统开发工具包 —— 基于Kryo的Java对象序列化

Kryo是用于Java语言的一个快速和高效的对象图序列化框架。Kryo项目的目的是快速、高效、方便地使用API。当需要持久化对象的时候,不论是持久化到文件、数据库还是网络,都可以使用Kryo。 目前Kryo已经到了4.0.1版本以上了。本文的介绍适用于V2.0+以上版本。 安装Kryo 一般适用maven来Kryo包。 使用官方版的Kryo的话可以引用下述配置代码 <dependency> <groupId>com.esotericsoftware</groupId> <artifactId>kryo</artifactId> <version>4.0.1</version> </dependency> 如果你已经在你的classpath下有了不同版本的asm了的话,上述依赖可能会碰到问题。这时你可以使用kyro-shaded jar包,它自身包含了它所需版本的asm,并且是位于在不同包里的。 <dependency> <groupId>com.esotericsoftware</groupId> <artifactId>kryo-shaded</artifactId> <version>4.0.1</version> </dependency> 如果你想试用最新的特性 <repository> <id>sonatype-snapshots</id> <name>sonatype snapshots repo</name> <url>https://oss.sonatype.org/content/repositories/snapshots</url> </repository> <dependency> <groupId>com.esotericsoftware</groupId> <artifactId>kryo</artifactId> <version>4.0.1-SNAPSHOT</version> </dependency> 开始使用Kryo库 import com.esotericsoftware.kryo.Kryo; import com.esotericsoftware.kryo.io.Output; import com.esotericsoftware.kryo.io.Input; // ... Kryo kryo = new Kryo(); // ... Output output = new Output(new FileOutputStream("file.bin")); SomeClass someObject = ... kryo.writeObject(output, someObject); output.close(); // ... Input input = new Input(new FileInputStream("file.bin")); SomeClass someObject = kryo.readObject(input, SomeClass.class); input.close(); Kryo类编排序列化,Output和Input类处理缓存字节和刷新到流中。 参考连接: Kryo官网 用于网络的Kryo Kryo为什么比Hessian快 原文发布于:http://www.yesdata.net/2018/03/17/kyro/

优秀的个人博客,低调大师

exe2hex - 使用Windows工具DEBUG.exe或PowerShell进行内联文件传输

版权声明:转载请注明出处:http://blog.csdn.net/dajitui2024 https://blog.csdn.net/dajitui2024/article/details/79396425 参考文章:1、https://howucan.gr/scripts-tools/2368-exe2hex-inline-file-transfer-using-in-built-windows-tools-debug-exe-or-powershell2、https://github.com/g0tmi1k/exe2hex 安装: exe2hex只需要Python 3的功能(Expect是Telnet和WinEXE功能的可选)。 需在$PATH变量中添加一个文件夹exe2hex即可: $ echo $PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin $ curl https://raw.githubusercontent.com/g0tmi1k/exe2hex/master/exe2hex.py > /usr/local/bin/exe2hex $ chmod 0755 /usr/local/bin/exe2hex root@kali:~# apt install -y exe2hexbat github页面有相对详细的介绍。

资源下载

更多资源
Mario

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长,特征是大鼻子、头戴帽子、身穿背带裤,还留着胡子。与他的双胞胎兄弟路易基一起,长年担任任天堂的招牌角色。

Nacos

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称,一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集,帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Sublime Text

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能,例如代码缩略图,Python的插件,代码段等。还可自定义键绑定,菜单和工具栏。Sublime Text 的主要功能包括:拼写检查,书签,完整的 Python API , Goto 功能,即时项目切换,多选择,多窗口等等。Sublime Text 是一个跨平台的编辑器,同时支持Windows、Linux、Mac OS X等操作系统。

WebStorm

WebStorm

WebStorm 是jetbrains公司旗下一款JavaScript 开发工具。目前已经被广大中国JS开发者誉为“Web前端开发神器”、“最强大的HTML5编辑器”、“最智能的JavaScript IDE”等。与IntelliJ IDEA同源,继承了IntelliJ IDEA强大的JS部分的功能。

用户登录
用户注册