找回密码
 骑士注册

QQ登录

微博登录

搜索
❏ 站外平台:

构建你的数据科学作品集:用数据讲故事

2017-10-22 21:01    评论: 1    

计算变量

计算变量可以通过使我们的比较更加快速来加快分析速度,并且能使我们做到本无法做到的比较。我们能做的第一件事就是从分开的列 SAT Math Avg. ScoreSAT Critical Reading Avg. Score 和 SAT Writing Avg. Score 计算 SAT 成绩:

  • 将 SAT 列数值从字符转化为数字。
  • 将所有列相加以得到 sat_score,即 SAT 成绩。

In [72]:

cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = data["sat_results"][c].convert_objects(convert_numeric=True)

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

接下来,我们将需要进行每所学校的坐标位置分析,以便我们制作地图。这将使我们画出每所学校的位置。在下面的代码中,我们将会:

  • Location 1 列分析出经度和维度。
  • 转化 lat(经度)和 lon(维度)为数字。

In [73]:

data["hs_directory"]['lat'] = data["hs_directory"]['Location 1'].apply(lambda x: x.split("\n")[-1].replace("(", "").replace(")", "").split(", ")[0])
data["hs_directory"]['lon'] = data["hs_directory"]['Location 1'].apply(lambda x: x.split("\n")[-1].replace("(", "").replace(")", "").split(", ")[1])

for c in ['lat', 'lon']:
    data["hs_directory"][c] = data["hs_directory"][c].convert_objects(convert_numeric=True)

现在,我们将输出每个数据集来查看我们有了什么数据:

In [74]:

for k,v in data.items():
    print(k)
    print(v.head())
math_test_results
        DBN Grade  Year      Category  Number Tested Mean Scale Score  \
111  01M034     8  2011  All Students             48              646
280  01M140     8  2011  All Students             61              665
346  01M184     8  2011  All Students             49              727
388  01M188     8  2011  All Students             49              658
411  01M292     8  2011  All Students             49              650

    Level 1 # Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 #  \
111        15     31.3%        22     45.8%        11     22.9%         0
280         1      1.6%        43     70.5%        17     27.9%         0
346         0        0%         0        0%         5     10.2%        44
388        10     20.4%        26     53.1%        10     20.4%         3
411        15     30.6%        25       51%         7     14.3%         2

    Level 4 % Level 3+4 # Level 3+4 %
111        0%          11       22.9%
280        0%          17       27.9%
346     89.8%          49        100%
388      6.1%          13       26.5%
411      4.1%           9       18.4%
survey
      DBN  rr_s  rr_t  rr_p    N_s   N_t    N_p  saf_p_11  com_p_11  eng_p_11  \
0  01M015   NaN    88    60    NaN  22.0   90.0       8.5       7.6       7.5
1  01M019   NaN   100    60    NaN  34.0  161.0       8.4       7.6       7.6
2  01M020   NaN    88    73    NaN  42.0  367.0       8.9       8.3       8.3
3  01M034  89.0    73    50  145.0  29.0  151.0       8.8       8.2       8.0
4  01M063   NaN   100    60    NaN  23.0   90.0       8.7       7.9       8.1

      ...      eng_t_10  aca_t_11  saf_s_11  com_s_11  eng_s_11  aca_s_11  \
0     ...           NaN       7.9       NaN       NaN       NaN       NaN
1     ...           NaN       9.1       NaN       NaN       NaN       NaN
2     ...           NaN       7.5       NaN       NaN       NaN       NaN
3     ...           NaN       7.8       6.2       5.9       6.5       7.4
4     ...           NaN       8.1       NaN       NaN       NaN       NaN

   saf_tot_11  com_tot_11  eng_tot_11  aca_tot_11
0         8.0         7.7         7.5         7.9
1         8.5         8.1         8.2         8.4
2         8.2         7.3         7.5         8.0
3         7.3         6.7         7.1         7.9
4         8.5         7.6         7.9         8.0

[5 rows x 23 columns]
ap_2010
      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39
1  01M450                 EAST SIDE COMMUNITY HS              19
2  01M515                    LOWER EASTSIDE PREP              24
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255
4  02M296  High School of Hospitality Management               s

  Total Exams Taken Number of Exams with scores 3 4 or 5
0                49                                   10
1                21                                    s
2                26                                   24
3               377                                  191
4                 s                                    s
sat_results
      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL
2  01M450                     EAST SIDE COMMUNITY SCHOOL
3  01M458                      FORSYTH SATELLITE ACADEMY
4  01M509                        MARTA VALLE HIGH SCHOOL

  Num of SAT Test Takers  SAT Critical Reading Avg. Score  \
0                     29                            355.0
1                     91                            383.0
2                     70                            377.0
3                      7                            414.0
4                     44                            390.0

   SAT Math Avg. Score  SAT Writing Avg. Score  sat_score
0                404.0                   363.0     1122.0
1                423.0                   366.0     1172.0
2                402.0                   370.0     1149.0
3                401.0                   359.0     1174.0
4                433.0                   384.0     1207.0
class_size
      DBN  CSD  NUMBER OF STUDENTS / SEATS FILLED  NUMBER OF SECTIONS  \
0  01M292    1                            88.0000            4.000000
1  01M332    1                            46.0000            2.000000
2  01M378    1                            33.0000            1.000000
3  01M448    1                           105.6875            4.750000
4  01M450    1                            57.6000            2.733333

   AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  SIZE OF LARGEST CLASS  \
0           22.564286                   18.50              26.571429
1           22.000000                   21.00              23.500000
2           33.000000                   33.00              33.000000
3           22.231250                   18.25              27.062500
4           21.200000                   19.40              22.866667

   SCHOOLWIDE PUPIL-TEACHER RATIO
0                             NaN
1                             NaN
2                             NaN
3                             NaN
4                             NaN
demographics
       DBN                                              Name  schoolyear  \
6   01M015  P.S. 015 ROBERTO CLEMENTE                           20112012
13  01M019  P.S. 019 ASHER LEVY                                 20112012
20  01M020  PS 020 ANNA SILVER                                  20112012
27  01M034  PS 034 FRANKLIN D ROOSEVELT                         20112012
35  01M063  PS 063 WILLIAM MCKINLEY                             20112012

   fl_percent  frl_percent  total_enrollment prek    k grade1 grade2  \
6         NaN         89.4               189   13   31     35     28
13        NaN         61.5               328   32   46     52     54
20        NaN         92.5               626   52  102    121     87
27        NaN         99.7               401   14   34     38     36
35        NaN         78.9               176   18   20     30     21

      ...     black_num black_per hispanic_num hispanic_per white_num  \
6     ...            63      33.3          109         57.7         4
13    ...            81      24.7          158         48.2        28
20    ...            55       8.8          357         57.0        16
27    ...            90      22.4          275         68.6         8
35    ...            41      23.3          110         62.5        15

   white_per male_num male_per female_num female_per
6        2.1     97.0     51.3       92.0       48.7
13       8.5    147.0     44.8      181.0       55.2
20       2.6    330.0     52.7      296.0       47.3
27       2.0    204.0     50.9      197.0       49.1
35       8.5     97.0     55.1       79.0       44.9

[5 rows x 38 columns]
graduation
     Demographic     DBN                            School Name Cohort  \
3   Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL   2006
10  Total Cohort  01M448    UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   2006
17  Total Cohort  01M450             EAST SIDE COMMUNITY SCHOOL   2006
24  Total Cohort  01M509                MARTA VALLE HIGH SCHOOL   2006
31  Total Cohort  01M515  LOWER EAST SIDE PREPARATORY HIGH SCHO   2006

    Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
3             78              43                     55.1%                36
10           124              53                     42.7%                42
17            90              70                     77.8%                67
24            84              47                       56%                40
31           193             105                     54.4%                91

   Total Regents - % of cohort Total Regents - % of grads  \
3                        46.2%                      83.7%
10                       33.9%                      79.2%
17         74.400000000000006%                      95.7%
24                       47.6%                      85.1%
31                       47.2%                      86.7%

              ...            Regents w/o Advanced - n  \
3             ...                                  36
10            ...                                  34
17            ...                                  67
24            ...                                  23
31            ...                                  22

   Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads  \
3                               46.2%                             83.7%
10                              27.4%                             64.2%
17                74.400000000000006%                             95.7%
24                              27.4%                             48.9%
31                              11.4%                               21%

   Local - n Local - % of cohort Local - % of grads Still Enrolled - n  \
3          7                  9%              16.3%                 16
10        11                8.9%              20.8%                 46
17         3                3.3%               4.3%                 15
24         7  8.300000000000001%              14.9%                 25
31        14                7.3%              13.3%                 53

   Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort
3                         20.5%              11                     14.1%
10                        37.1%              20       16.100000000000001%
17                        16.7%               5                      5.6%
24                        29.8%               5                        6%
31                        27.5%              35       18.100000000000001%

[5 rows x 23 columns]
hs_directory
      dbn                                        school_name       boro  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn
1  09X543                   High School for Violin and Dance      Bronx
2  09X327        Comprehensive Model School Project M.S. 327      Bronx
3  02M280     Manhattan Early College School for Advertising  Manhattan
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens

  building_code    phone_number    fax_number grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262              9              12
1          X400    718-842-0687  718-589-9849              9              12
2          X240    718-294-8111  718-294-8109              6              12
3          M520  718-935-3477             NaN              9              10
4          Q695    718-969-3155  718-969-3552              6              12

  expgrade_span_min  expgrade_span_max    ...      \
0               NaN                NaN    ...
1               NaN                NaN    ...
2               NaN                NaN    ...
3                 9               14.0    ...
4               NaN                NaN    ...

                        priority05 priority06 priority07 priority08  \
0                              NaN        NaN        NaN        NaN
1                              NaN        NaN        NaN        NaN
2  Then to New York City residents        NaN        NaN        NaN
3                              NaN        NaN        NaN        NaN
4                              NaN        NaN        NaN        NaN

  priority09  priority10                                         Location 1  \
0        NaN         NaN  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1        NaN         NaN  1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2        NaN         NaN  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
3        NaN         NaN  411 Pearl Street\nNew York, NY 10038\n(40.7106...
4        NaN         NaN  160-20 Goethals Avenue\nJamaica, NY 11432\n(40...

      DBN        lat        lon
0  17K548  40.670299 -73.961648
1  09X543  40.827603 -73.904475
2  09X327  40.842414 -73.916162
3  02M280  40.710679 -74.000807
4  28Q680  40.718810 -73.806500

[5 rows x 61 columns]

合并数据集

现在我们已经完成了全部准备工作,我们可以用 DBN 列将数据组合在一起了。最终,我们将会从原始数据集得到一个有着上百列的数据集。当我们合并它们,请注意有些数据集中会丢失了 sat_result 中出现的高中。为了解决这个问题,我们需要使用 outer 方法来合并缺少行的数据集,这样我们就不会丢失数据。在实际分析中,缺少数据是很常见的。能够展示解释和解决数据缺失的能力是构建一个作品集的重要部分。

你可以在阅读关于不同类型的合并。

接下来的代码,我们将会:

  • 循环遍历 data 文件夹中的每一个条目。
  • 输出条目中的非唯一的 DBN 码数量。
  • 决定合并策略 - innerouter
  • 使用 DBN 列将条目合并到 DataFrame full 中。

In [75]:

flat_data_names = [k for k,v in data.items()]
flat_data = [data[k] for k in flat_data_names]
full = flat_data[0]
for i, f in enumerate(flat_data[1:]):
    name = flat_data_names[i+1]
    print(name)
    print(len(f["DBN"]) - len(f["DBN"].unique()))
    join_type = "inner"
    if name in ["sat_results", "ap_2010", "graduation"]:
        join_type = "outer"
    if name not in ["math_test_results"]:
        full = full.merge(f, on="DBN", how=join_type)

full.shape
survey
0
ap_2010
1
sat_results
0
class_size
0
demographics
0
graduation
0
hs_directory
0

Out[75]:

(374, 174)
查看其它分页:

最新评论

我也要发表评论

海南老王 [Safari 11.0|Mac 10.11] 2017-10-23 00:04 26 回复
好长的文章,不过可以慢慢跟着做,要是有国内的数据源就好了

LCTT 译者

Yoo-4x 🌟 🌟
共计翻译: 2 篇 | 共计贡献: 97
贡献时间:2017-01-08 -> 2017-04-14
访问我的 LCTT 主页 | 在 GitHub 上关注我

收藏

返回顶部

分享到微信

打开微信,点击顶部的“╋”,
使用“扫一扫”将网页分享至微信。