Control egress in a private instance

This section describes the high level architecture for establishing egress control from private Cloud Data Fusion instances during the development phase and the pipeline execution phase.

The following system architecture diagram shows how a private Cloud Data Fusion instance connects with the public internet when you develop a pipeline:

Private instance architecture diagram

You can control connections to SaaS applications and third-party public cloud services during pipeline development or execution, by routing all egress traffic through customer projects. This process uses the following resources:

  • Custom VPC network route: A custom VPC network routes traffic through an imported custom route to gateway VMs, which export to a tenant project VPC using VPC peering .

  • Proxy VM: A Proxy VM routes egress traffic out of Google Cloud from the Cloud Data Fusion tenant project to the specified destination through the public internet. You create and manage a gateway VM in your customer projects. It's recommended you configure them in a High-Availability (HA) setup using an Internal Load Balancer (ILB). If you have multiple private Cloud Data Fusion instances that use the same VPC network, you can reuse the same VM within the VPC.

Before you begin

Set up egress control during pipeline development

Egress control lets you control or filter what can go out of your network, which is useful in VPC Service Controls environments. There is no preferred network proxy for performing this task. Examples of proxies include Squid proxy , HAProxy , and Envoy .

The examples in this guide describe how to setup HTTP proxy for HTTP filtering on VM instances that use a Debian image. The examples use a Squid proxy server , which is one of the ways of setting up a proxy server.

Create a proxy VM

Create a VM in the same VPC as your private Cloud Data Fusion instance with the following startup script and IP forwarding.

This script installs Squid proxy, and configures it to intercept HTTP traffic and allow .squid-cache.org and .google.com domains. You can replace these domains with the domains that you want to connect with your Cloud Data Fusion instance.

Console

  1. Go to the VM instancespage.

    Go to the VM instances page

  2. Click Create instance.

  3. Use the same VPC that has network peering set up with the private Cloud Data Fusion instance. For more information about VPC network peering in this scenario, see Before you begin .

  4. Enable IP forwarding for the instance in the same networkas the Cloud Data Fusion instance.

  5. In the Startup scriptfield, enter the following script:

      # 
    !  
    /bin/bash apt-get -y install squid3 
     cat <<EOF > /etc/squid/conf.d/debian.conf 
     # 
     # 
    Squid  
    configuration  
    settings  
     for 
      
    Debian # 
     logformat squid %ts.%03tu %6tr %>a %Ss/%03>Hs %<st %rm %ru %ssl::>sni %Sh/%<a %mt 
     logfile_rotate 10 
     debug_options rotate=10 
     # 
    configure  
    intercept  
    port http_port 3129 intercept 
     # 
    allow  
    only  
    certain  
    sites acl allowed_domains dstdomain "/etc/squid/allowed_domains.txt" 
     http_access allow allowed_domains 
     # 
    deny  
    all  
    other  
    http  
    requests http_access deny all 
     EOF 
     # 
    Create  
    a  
    file  
    with  
    allowed  
    egress  
    domains # 
    Replace  
    these  
    example  
    domains  
    with  
    the  
    domains  
    that  
    you  
    want  
    to  
    allow # 
    egress  
    from  
     in 
      
    Data  
    Fusion  
    pipelines cat <<EOF > /etc/squid/allowed_domains.txt 
     .squid-cache.org 
     .google.com 
     EOF 
     /etc/init.d/squid restart 
     iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 3129 
     echo 1 > /proc/sys/net/ipv4/ip_forward 
     echo net.ipv4.ip_forward=1 > /etc/sysctl.d/11-gce-network-security.conf 
     iptables -t nat -A POSTROUTING -s 0.0.0.0/0 -p tcp --dport 443 -j MASQUERADE 
     
    

gcloud

  export 
  
 CDF_PROJECT 
 = 
< cdf 
 - 
 project 
> export 
  
 PROXY_VM 
 = 
< proxy 
 - 
 vm 
> export 
  
 ZONE 
 = 
< vm 
 - 
 zone 
> export 
  
 SUBNET 
 = 
< subnet 
> export 
  
 VPC_NETWORK 
 = 
< vpc 
 - 
 network 
> export 
  
 COMPUTE_ENGINE_SA 
 = 
< compute 
 - 
 engine 
 - 
 sa 
> gcloud 
  
 beta 
  
 compute 
  
 -- 
 project 
 = 
 $ 
 CDF_PROJECT 
  
 instances 
  
 create 
  
 $ 
 PROXY_VM 
  
 -- 
 zone 
 = 
 $ 
 ZONE 
  
 -- 
 machine 
 - 
 type 
 = 
 e2 
 - 
 medium 
  
 -- 
 subnet 
 = 
 $ 
 SUBNET 
  
 -- 
 no 
 - 
 address 
  
 -- 
 metadata 
 = 
 startup 
 - 
 script 
 = 
 \#\ 
 ! 
 \ 
  
 / 
 bin 
 / 
 bash 
 $ 
 '\n' 
 apt 
 - 
 get 
 \ 
  
 - 
 y 
 \ 
  
 install 
 \ 
  
 squid3 
 $ 
 '\n' 
 cat 
 \ 
  
 \<\ 
< EOF 
 \ 
  
 \>\ 
  
 / 
 etc 
 / 
 squid 
 / 
 conf 
 . 
 d 
 / 
 debian 
 . 
 conf 
 $ 
 '\n' 
 \#$ 
 '\n' 
 \#\ 
  
 Squid 
 \ 
  
 configuration 
 \ 
  
 settings 
 \ 
  
 for 
 \ 
  
 Debian 
 $ 
 '\n' 
 \#$ 
 '\n' 
 logformat 
 \ 
  
 squid 
 \ 
  
 \ 
 % 
 ts 
 . 
 \ 
 % 
 03 
 tu 
 \ 
  
 \ 
 % 
 6 
 tr 
 \ 
  
 \ 
 % 
 \ 
> a 
 \ 
  
 \ 
 % 
 Ss 
 / 
 \ 
 % 
 03 
 \ 
> Hs 
 \ 
  
 \ 
 % 
 \ 
< st 
 \ 
  
 \ 
 % 
 rm 
 \ 
  
 \ 
 % 
 ru 
 \ 
  
 \ 
 % 
 ssl 
 :: 
 \ 
> sni 
 \ 
  
 \ 
 % 
 Sh 
 / 
 \ 
 % 
 \ 
< a 
 \ 
  
 \ 
 % 
 mt 
 $ 
 '\n' 
 logfile_rotate 
 \ 
  
 10 
 $ 
 '\n' 
 debug_options 
 \ 
  
 rotate 
 = 
 10 
 $ 
 '\n' 
 $ 
 '\n' 
 \#\ 
  
 configure 
 \ 
  
 intercept 
 \ 
  
 port 
 $ 
 '\n' 
 http_port 
 \ 
  
 3129 
 \ 
  
 intercept 
 $ 
 '\n' 
 $ 
 '\n' 
 \#\ 
  
 allow 
 \ 
  
 only 
 \ 
  
 certain 
 \ 
  
 sites 
 $ 
 '\n' 
 acl 
 \ 
  
 allowed_domains 
 \ 
  
 dstdomain 
 \ 
  
 \" 
 / 
 etc 
 / 
 squid 
 / 
 allowed_domains 
 . 
 txt 
 \"$ 
 '\n' 
 http_access 
 \ 
  
 allow 
 \ 
  
 allowed_domains 
 $ 
 '\n' 
 $ 
 '\n' 
 \#\ 
  
 deny 
 \ 
  
 all 
 \ 
  
 other 
 \ 
  
 http 
 \ 
  
 requests 
 $ 
 '\n' 
 http_access 
 \ 
  
 deny 
 \ 
  
 all 
 $ 
 '\n' 
 EOF 
 $ 
 '\n' 
 $ 
 '\n' 
 $ 
 '\n' 
 \#\ 
  
 Create 
 \ 
  
 a 
 \ 
  
 file 
 \ 
  
 with 
 \ 
  
 allowed 
 \ 
  
 egress 
 \ 
  
 domains 
 $ 
 '\n' 
 \#\ 
  
 Replace 
 \ 
  
 these 
 \ 
  
 example 
 \ 
  
 domains 
 \ 
  
 with 
 \ 
  
 the 
 \ 
  
 domains 
 \ 
  
 that 
 \ 
  
 you 
 \ 
  
 want 
 \ 
  
 to 
 \ 
  
 allow 
 \ 
  
 $ 
 '\n' 
 \#\ 
  
 egress 
 \ 
  
 from 
 \ 
  
 in 
 \ 
  
 Data 
 \ 
  
 Fusion 
 \ 
  
 pipelines 
 $ 
 '\n' 
 cat 
 \ 
  
 \<\ 
< EOF 
 \ 
  
 \>\ 
  
 / 
 etc 
 / 
 squid 
 / 
 allowed_domains 
 . 
 txt 
 $ 
 '\n' 
 . 
 squid 
 - 
 cache 
 . 
 org 
 $ 
 '\n' 
 . 
 google 
 . 
 com 
 $ 
 '\n' 
 EOF 
 $ 
 '\n' 
 $ 
 '\n' 
 / 
 etc 
 / 
 init 
 . 
 d 
 / 
 squid 
 \ 
  
 restart 
 $ 
 '\n' 
 $ 
 '\n' 
 iptables 
 \ 
  
 - 
 t 
 \ 
  
 nat 
 \ 
  
 - 
 A 
 \ 
  
 PREROUTING 
 \ 
  
 - 
 p 
 \ 
  
 tcp 
 \ 
  
 -- 
 dport 
 \ 
  
 80 
 \ 
  
 - 
 j 
 \ 
  
 REDIRECT 
 \ 
  
 -- 
 to 
 - 
 port 
 \ 
  
 3129 
 $ 
 '\n' 
 echo 
 \ 
  
 1 
 \ 
  
 \>\ 
  
 / 
 proc 
 / 
 sys 
 / 
 net 
 / 
 ipv4 
 / 
 ip_forward 
 $ 
 '\n' 
 echo 
 \ 
  
 net 
 . 
 ipv4 
 . 
 ip_forward 
 = 
 1 
 \ 
  
 \>\ 
  
 / 
 etc 
 / 
 sysctl 
 . 
 d 
 / 
 11 
 - 
 gce 
 - 
 network 
 - 
 security 
 . 
 conf 
 $ 
 '\n' 
 iptables 
 \ 
  
 - 
 t 
 \ 
  
 nat 
 \ 
  
 - 
 A 
 \ 
  
 POSTROUTING 
 \ 
  
 - 
 s 
 \ 
  
 0.0.0.0 
 / 
 0 
 \ 
  
 - 
 p 
 \ 
  
 tcp 
 \ 
  
 -- 
 dport 
 \ 
  
 443 
 \ 
  
 - 
 j 
 \ 
  
 MASQUERADE 
  
 -- 
 can 
 - 
 ip 
 - 
 forward 
  
 -- 
 maintenance 
 - 
 policy 
 = 
 MIGRATE 
  
 -- 
 service 
 - 
 account 
 = 
 $ 
 COMPUTE_ENGINE_SA 
  
 -- 
 scopes 
 = 
 https 
 : 
 //www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --tags=http-server,https-server --image=debian-10-buster-v20210420 --image-project=debian-cloud --boot-disk-size=10GB --boot-disk-type=pd-balanced --boot-disk-device-name=instance-1 --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any 
 gcloud 
  
 compute 
  
 -- 
 project 
 = 
 $ 
 CDF_PROJECT 
  
 firewall 
 - 
 rules 
  
 create 
  
 egress 
 - 
 allow 
 - 
 http 
  
 -- 
 direction 
 = 
 INGRESS 
  
 -- 
 priority 
 = 
 1000 
  
 -- 
 network 
 = 
 $ 
 VPC_NETWORK 
  
 -- 
 action 
 = 
 ALLOW 
  
 -- 
 rules 
 = 
 tcp 
 : 
 80 
  
 -- 
 source 
 - 
 ranges 
 = 
 0.0.0.0 
 / 
 0 
  
 -- 
 target 
 - 
 tags 
 = 
 https 
 - 
 server 
 gcloud 
  
 compute 
  
 -- 
 project 
 = 
 $ 
 CDF_PROJECT 
  
 firewall 
 - 
 rules 
  
 create 
  
 egress 
 - 
 allow 
 - 
 https 
  
 -- 
 direction 
 = 
 INGRESS 
  
 -- 
 priority 
 = 
 1000 
  
 -- 
 network 
 = 
 $ 
 VPC_NETWORK 
  
 -- 
 action 
 = 
 ALLOW 
  
 -- 
 rules 
 = 
 tcp 
 : 
 443 
  
 -- 
 source 
 - 
 ranges 
 = 
 0.0.0.0 
 / 
 0 
  
 -- 
 target 
 - 
 tags 
 = 
 https 
 - 
 server 
 

Create a custom route

Create a custom route to connect to the gateway VM instance that you created .

Console

To create your route in the Google Cloud console, see Adding a static route .

When you configure the route, do the following:

  • Set the Priorityto greater than or equal to 1001 .
  • Use the same project and VPC as the private Cloud Data Fusion instance.
  • Be sure that your VPC network peering configuration allows exporting routes, so that the Cloud Data Fusion tenant project VPC imports this custom route through VPC network peering.

gcloud

To create your route in gcloud CLI:

gcloud beta compute routes create $ROUTE --project=$CDF_PROJECT \
    --network=$VPC_NETWORK --priority=1001 \
    --destination-range=0.0.0.0/0 \
    --next-hop-instance=$PROXY_VM \
    --next-hop-instance-zone=$ZONE

Set up egress control for pipeline execution

After you're able to access the public internet with allowed hostnames in Preview and Wrangler in your design environment, deploy your pipeline. Deployed Cloud Data Fusion pipelines run on Dataproc clusters by default.

To ensure that all public internet traffic from the Dataproc cluster goes through one or more Proxy VMs, add the private DNS zone and records. This step is required because Cloud NAT doesn't support filtering.

In the DNS records, include the IP address of the proxy VM or ILB.

Deploy your pipeline

After you've verified the pipeline in the design phase, deploy your pipeline. Deployed pipelines run on Dataproc clusters by default.

To ensure that all public internet traffic from the Dataproc cluster goes through one or more Proxy VMs, add a custom route with instance tags proxy and priority 1000 to the same VPC as the Dataproc VMs:

Create custom route

Modify your pipeline to use Dataproc tags because Cloud NAT currently doesn't support any egress filtering.

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: